Poster No:
2262
Submission Type:
Abstract Submission
Authors:
Rastko Ciric1, Russell Poldrack2
Institutions:
1Stanford University, Mountain View, CA, 2Stanford University, Stanford, CA
First Author:
Co-Author:
Introduction:
Brain mapping and other data-intensive sciences have adopted data organization conventions that are formalised, structured, and principled (e.g., [1]). However, these conventions are often at odds with optimal practices for online training of large-scale machine learning models. For instance, volumetric brain image datasets often explicitly represent thousands of extraneous voxels. MRI datasets are usually scattered across many files that are non-contiguous on disk. One-size-fits-all workflows are not suitable in the setting of online training – even after preprocessing, data frequently require additional transformations before they are ready for consumption by a model, and different models have different input data demands. Together, these design limitations manifest in sluggish latency when naively loading data batches; because of the large size of fMRI data and limited memory bandwidth, data load operations can be a bottleneck in model training.
Methods:
We introduce entense, a software library that transforms neuroimaging datasets into archive formats designed for downstream ingestion by widely used machine learning libraries like PyTorch[2] and JAX[3]. We do not introduce new data formats but instead leverage TensorFlow records[4] and tar archives[5] for interoperability with existing libraries. entense is built upon a compositional design that allows registration of custom write operations, thereby supporting future extensibility to additional potential formats, such as h5 or GIfTI.
Results:
The core of entense is a sequence of three abstract routines which approximately implements the ETL (extract-transform-load) process used in industrial data engineering. Under the paradigm of compositional functional programming, these core routines are composable with functional atoms called primitives, which modify or concretise their behaviour. The first core routine is responsible for locating paths corresponding to each data record, matching variables from tabular datasets (e.g., diagnostic, demographic, or behavioural) with these records, and collating all data instances in a data frame representation. Primitives can provide this routine with a schema to use when identifying data records (e.g., fMRIPrep-BIDS[1,6]; HCP[7]), a set of pre-transformations to apply to paths (e.g., download from a URL; synchronise using DataLad[8]), instructions on what files (e.g, volumes, surfaces, confounds) to stage to each instance's record, and filtering protocols for either dropping incomplete records or pivoting the aggregate data frame to specify the level (e.g., subject or image) corresponding to a single data instance.
The second core routine is responsible for workflow assembly: it "compiles" a sequence of transformation instructions into a DAG and then introspects it, applying a greedy heuristic to fuse contiguous transformations into workflow nodes subject to resource restrictions and the inferred demands of each transformation. Primitives specify the transformation sequence (e.g., masking, filtering, confound removal) and the node fusion heuristic. Transformations are implemented using the same backend used for the hypercoil differentiable programming system[9], supporting just-in-time compilation for hardware accelerators such as GPU. The assembled workflow can run either serially or in distributed mode, in which case each node is wrapped in a node of an equivalent Pydra workflow.[10]
The final core routine implements the archival or record writing process, which includes calling the assembled workflow. Primitives specify output format(s) and whether and how any products of the transformation should be disaggregated across different archive files or data shards. entense also provides utilities for performing data splits, either randomly or on a stratified or representative basis, and schemes for "sharding" data over multiple files.
Conclusions:
In conclusion, we introduce a software system that transforms data for online training of large-scale models.
Modeling and Analysis Methods:
Methods Development
Other Methods
Neuroinformatics and Data Sharing:
Workflows 1
Informatics Other 2
Keywords:
Computing
Data Organization
FUNCTIONAL MRI
Informatics
Open-Source Code
Open-Source Software
Preprint
Workflows
1|2Indicates the priority used for review
Provide references using author date format
[1] Gorgolewski et al. (2016) The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Scientific Data, 160044.
[2] Paszke et al. (2019) PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS.
[3] Frostig et al. (2018) Compiling machine learning programs via high-level tracing. SYSML.
[4] Murray et al. (2021) tf.data: A Machine Learning Data Processing Framework.
[5] Aizman et al. (2019) High Performance I/O For Large Scale Deep Learning. IEEE Conference on Big Data.
[6] Esteban et al. (2019) fMRIPrep: a robust preprocessing pipeline for functional MRI. Nature Methods, 16, 111–116.
[7] Glasser et al. (2013) The minimal preprocessing pipelines for the Human Connectome Project. NeuroImage 15:80:105-24.
[8] Halchenko et al. (2021) DataLad: distributed system for joint management of code, data, and their relationship. Journal of Open Source Software 6(63), 3262.
[9] Ciric et al. (2022) Differentiable programming for functional connectomics. Proceedings of Machine Learning Research, 419–455, Vol. 193).
[10] Jarecka et al. (2020) Pydra - a flexible and lightweight dataflow engine for scientific analyses. Proceedings of the 19th Python in Science Conference.