Print Close

An Analysis of Performance Bottlenecks in MRI Pre-Processing

Poster No:

2253

Submission Type:

Abstract Submission

Authors:

Mathieu Dugré¹, Yohan Chatelain¹, Tristan Glatard²

Institutions:

¹Concordia University, Montreal, Quebec, ²Concordia University (Department of Computer Science and Software Engineering), Montreal, Quebec

First Author:

Mathieu Dugré
Concordia University
Montreal, Quebec

Co-Author(s):

Yohan Chatelain
Concordia University
Montreal, Quebec

Tristan Glatard, Associate Professor
Concordia University (Department of Computer Science and Software Engineering)
Montreal, Quebec

E-Poster

Introduction:

Computationally expensive pipelines have been one of the main bottlenecks in processing large subject cohorts in neuroimaging, and their execution time can hinder clinical applications where timely data analysis is needed. To leverage higher statistical power from larger cohorts and enable more applications, the community is constantly seeking for novel ways to speed up data processing with HPC, cloud computing, GPU accelerators, or Deep Learning models [Henschel et al., 2020; Hoffmann et al., 2022]. Optimizing a pipeline requires understanding and characterizing its different bottlenecks. We characterize the computational profile of several commonly-adopted MRI pre-processing pipelines. The results of our analysis can serve as a reference for future efforts to optimize MRI pre-processing workflows.

Methods:

We focus on profiling fMRIPrep [Esteban et al., 2018], a commonly used pipeline for anatomical and functional MRI pre-processing and analysis. For finer granularity, we also profile the sub-pipelines of fMRIPrep which include tools widely used in the community (ANTS brainExtraction, ANTS registrationSyN, FSL FAST, FSL MCFLIRT, FSL FLIRT, FreeSurfer reconall). We used the OpenNeuro ds004513 v.1.0.2 dataset [Castrillon et al., 2023], with anatomical, functional, and diffusion data from 20 healthy individuals acquired from two cohorts: a cohort with nine participants (mean age=43 yrs, std=7 yrs; 4 females) and a replication cohort with eleven participants (mean age=27 yrs, std=5 yrs; 6 females). We measured CPU time, percent of memory boundness, and percent of floating point operations with the VTune profiler, which provides low overhead and varying levels of granularity, for the metrics collected. To collect human readable information from the profiler, we re-compiled each application with debug_info using Docker images. Then, we converted the Docker images to Singularity images using docker2singularity. We profiled the application with a single thread to simplify their analysis, but also with 32 threads to understand the performance of their multi-threaded implementation. For profiling, we used dedicated compute nodes with two 16-core Intel(R) Xeon(R) Gold 6130 CPUs, 250 GiB of RAM, Rocky8, and Linux 4.18.0-477.10.1.el8_lustre.x86_64. Data was transferred to compute nodes before profiling the applications.

Results:

The average CPU time per function across all tested pipelines follows a long-tail distribution (Figure 1). Thus, future efforts can focus on a few functions to optimize the applications. A detailed analysis of the CPU time, contribution to total pipeline make span, and memory boundness of each application shows that interpolation is a primary bottleneck for all profiled applications (Figure 2 shows FreeSurfer reconall). We suggest future efforts to concentrate on optimizing interpolation techniques, using DL drop-in alternatives, or developing reduced precision techniques specialized for interpolation. Furthermore, we observed surprising slow-down for ANTS brainExtraction (1.14x) and registrationSyN (1.23x), while using the built-in single-precision option compared to the double-precision default. Both versions completed after a similar number of iterations. We speculate that the slow-down is due to less optimized implementations for single-precision in the ITK library. Finally, we observed underperforming multi-thread scaling for FreeSurfer, which we speculate results from an underperforming OpenMP scheduling policy.
The code developed to conduct our benchmarks is available on GitHub: https://github.com/mathdugre/mri-bottleneck

·Figure 1. Functions sorted by decreasing average CPU time (in second). Pipeline included: ANTS brainExtraction, ANTS registrationSyN, FSL FAST, FSL MCFLIRT, FSL FLIRT, FreeSurfer reconall.

Supporting Image: hotspots-1thread-freesurfer-reconall.png

·Figure 2. Functions sorted by CPU time of their modules, then their own. Cumulative makespan percent showed with green dots. The function IDs using interpolation are 4, 8, 15, and 26.

Conclusions:

We performed a detailed performance profiling of fMRIPrep and several of its sub-workflows. Overall, only a few functions contribute to a majority of the computation time and interpolation is the main bottleneck. ANTS applications were faster when using double-precision compared to single-precision, and FreeSurfer suffered from OpenMP bottlenecks while using multi-threading.

Modeling and Analysis Methods:

Image Registration and Computational Anatomy

Methods Development

Motion Correction and Preprocessing

Neuroinformatics and Data Sharing:

Workflows ²

Informatics Other ¹

Keywords:

Computational Neuroscience

Computing

FUNCTIONAL MRI

Informatics

STRUCTURAL MRI

Workflows

Other - Performance

^1|2Indicates the priority used for review

Provide references using author date format

Castrillon G. (2023). The energetic costs of the human connectome. OpenNeuro. [Dataset] doi: doi:10.18112/openneuro.ds004513.v1.0.2
Esteban, O. (2019). fMRIPrep: a robust preprocessing pipeline for functional MRI. Nature methods, 16(1), 111-116.
Henschel, L. (2020). Fastsurfer-a fast and accurate deep learning based neuroimaging pipeline. NeuroImage, 219, 117012.
Hoffmann, M. (2021). SynthMorph: learning contrast-invariant registration without acquired images. IEEE transactions on medical imaging, 41(3), 543-558.