From PubMed to a DataFrame: An ecosystem for mining the biomedical literature

Poster No:

1861 

Submission Type:

Abstract Submission 

Authors:

Kendra Oudyk1, Jérôme Dockès2, Mohammad Torabi1, Alejandro De La Vega3, Jean-Baptiste Poline1

Institutions:

1McGill University, Montreal, Quebec, 2INRIA, Paris, Paris Region, 3University of Texas at AUstin, Austin, TX

First Author:

Kendra Oudyk  
McGill University
Montreal, Quebec

Co-Author(s):

Jérôme Dockès, PhD  
INRIA
Paris, Paris Region
Mohammad Torabi  
McGill University
Montreal, Quebec
Alejandro De La Vega  
University of Texas at AUstin
Austin, TX
Jean-Baptiste Poline  
McGill University
Montreal, Quebec

Introduction:

Meta-research is important for summarizing results (as in meta-analyses and reviews) and for pointing out issues with different aspects of research, like methods, authorship, and publishing. Meta-research should be 'living' because the scientific literature is continuously changing. Further, with thousands of publications each year, we need systematic or (semi-)automated approaches for indexing, aggregating, and summarizing the literature.
An important challenge to meta-research projects is the construction of an appropriate dataset (i.e., set of articles). One must download a large number of articles and extract the relevant text, metadata, and often the stereotactic coordinates of results. Ideally this would be done automatically, but doing this well is a software-engineering problem that most researchers are not prepared for.
We facilitate living meta-research by providing tools that are as reproducible, scalable, and accessible as possible. Here we introduce a set of inter-operable tools that help with collecting and labelling articles.

Methods:

Figure 1 shows the tools that we have created and which stage(s) of a meta-research project they apply to.

Pubget is a command-line tool for downloading and processing articles from PubMed Central. It builds upon the code used to create NeuroQuery. Given a search query or a list of PMCIDs, it provides the matching articles in their original XML format, in addition to CSV files containing: (i) metadata such as authors or publication year, (ii) the full text, and (iii) the activation coordinates. Pubget can extract term-frequency features, and run NeuroQuery's or NeuroSynth's analyses. It can prepare a NiMARE (nimare.readthedocs.io) dataset, making a wide range of meta-analysis methods easy to apply. It also can be extended with plugins.

Labelbuddy is a simple and lightweight desktop application for labelling texts, which manages annotations with a regular file (a SQLite database). Pubget's output can directly be imported into labelbuddy. Labelbuddy imports and exports its data to a simple JSON format, and offers a command-line interface, making it well-suited for projects organized around a git repository. An example repository containing over 1,800 annotations can be found at https://neurodatascience.github.io/labelbuddy-annotations/.

Pubextract is a Python package containing plugins for pubget. Its functions can be called from the command line along with the code to download the papers. This enables a researcher to, for example, automatically extract features from the set of articles they download, such as author genders and locations, demographic information like the number and ages of participants, as well as which papers contain certain user-provided terms (e.g., "FSL" or "UKBioBank").
Supporting Image: graphic_workflow.png
   ·Figure 1. Workflow and tools for meta-research that is reproducible, scalable, and accessible.
 

Results:

To illustrate the use of this ecosystem, we replicated and extended the investigation of sample sizes from Poldrack et al., (2017). We downloaded articles with pubget, designed a heuristic to extract participant counts and demographics, and validated it on 100 articles that we annotated with labelbuddy. We further compared this heuristic approach to extraction of sample sizes by GPT-3.5 (Figure 2A). As shown in Figure 2C, the median sample size has continued to increase since 2015.

We also ran pubget for a query matching a larger number of articles (over 9K), and we produced meta-analytic maps obtained with pubget's --fit_neurosynth option and from neurosynth.org. Results are similar for frequent terms, but for rare terms, pubget's use of the full text produces more powerful analyses.
Supporting Image: participant_demographics_figures_together.png
   ·Figure 2. Results of replicating and extending the work of Poldrack et al. (2017) on tracking the sample size of neuroimaging experiments over time.
 

Conclusions:

We facilitate downloading, annotating and preparing articles for meta-research in a way that is highly reproducible, scalable, and accessible. This can be key for living meta-research that furthers our understanding of an ever-increasing body of literature. We hope that discussions at the OHBM 2024 meeting will help us tailor them to the needs of the neuroimaging community.

Modeling and Analysis Methods:

Activation (eg. BOLD task-fMRI)
Methods Development 1
Other Methods

Neuroinformatics and Data Sharing:

Informatics Other 2

Keywords:

Data Organization
Meta- Analysis
Other - meta-research, literature mining

1|2Indicates the priority used for review

Provide references using author date format

Poldrack, R. A. et al. (2017). Scanning the horizon: towards transparent and reproducible neuroimaging research. Nature reviews neuroscience, 18(2), 115-126.