Print Close

Generating Gene Embedding for Early Diagnosis of Alzheimer's Disease

Poster No:

873

Submission Type:

Abstract Submission

Authors:

Kyeongho Kim¹, YeonJu Park¹, Jong-Min Lee²

Institutions:

¹Department of Artificial Intelligence, Hanyang University, Seoul, Republic of Korea, ²Department of Electronic Engineering, Hanyang University, Seoul, Republic of Korea

First Author:

Kyeongho Kim
Department of Artificial Intelligence, Hanyang University
Seoul, Republic of Korea

Co-Author(s):

YeonJu Park
Department of Artificial Intelligence, Hanyang University
Seoul, Republic of Korea

Jong-Min Lee
Department of Electronic Engineering, Hanyang University
Seoul, Republic of Korea

Introduction:

Alzheimer's disease (AD) is characterized by memory dysfunction and language disorders in patients. The elevated prevalence of AD in individuals aged 65 and older has raised significant concerns among the elderly. With the global aging of the population, it is predicted that the number of Alzheimer's patients will increase approximately triple in 2050 [1]. Consequently, there has been recent extensive research leading to the development of new drugs for AD. Currently developed drugs play a role in slowing down the progression of diseases rather than restoring the state of health [2]. Therefore, it is necessary to utilize medication as early as possible through early diagnosis of AD to slow down the progression of the disease. Early diagnosis of AD techniques mainly use neuroimaging data. However, there are limitations in the accuracy when early diagnosis of AD based on neuroimaging data. Recognizing that the estimated heritability of AD is 60%-80%[3], it is expected that genetic data will play an important role in early diagnosis. But current methods like minor allele counting and one-hot vectors are simplistic to explain complex genetic data. So We generate gene embedding through deep learning from the input of genes used in early diagnosis of AD.

Methods:

In this study, 623 individuals from Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset with available Single Nucleotide Polymorphism (SNP) information are utilized. This cohort includes 285 AD and 338 cognitively normal (NC). We use the top 10 genes closely associated with AD.
SNPs are represented in diploid with two complete sets of chromosomes, one from each parent. However, Natural Language Processing (NLP) models typically take single sentences as inputs. Therefore, we transformed diploid into single sentences form. SNPs composed of four alphabet letters (i.e., A, C, G and T) can be paired up into ten different alphabet letters, excluding the repetitive ones. We use the k-mer method to tokenize SNPs as one word.
We make use of Bidirectional Encoder Representations from Transformers (BERT) to create meaningful embedding vectors for k-mer tokens. BERT employs two unsupervised learning tasks (Masked Language Model (MLM) and Next Sentence Prediction (NSP)) for training. However, we only used MLM because our model only takes information about a person as input. 15% of the input tokens were randomly masked in each subject's SNPs. BERT learns to predict the correct token through the hidden vector of mask token.
To create gene embeddings, the embedded k-mer tokens are fed into Long Short-Term Memory (LSTM) as input. LSTM is employed to classify 10 genes. The process involves taking a single gene's nucleotide sequence as input, passing it through BERT, resulting in 120-dimensional embedding values for each token. These embeddings are then fed into LSTM sequentially. We leverage the final output value, capturing the meaning of the entire nucleotide sequence, to create a 120-dimensional embedding for the gene's sequence.

·Overall structure of the model

Results:

To train LSTM, learning was conducted using 5000 gene data, and evaluation is carried out using 1230 gene data. We obtained the results through cross-entropy with the outcomes of a model predicting 10 types of gene. Figure 2 shows t-SNE and confusion matrix representing the classification results of 10 types of genes. Classification average results of 10 types of genes is 69.2%. Additionally, to investigate whether the gene embeddings influence early diagnosis of AD, we conduct early AD diagnosis using T1-weighted images and gene embedding.

·t-SNE and confusion matrix representing the classification results of 10 types of genes

Conclusions:

In this study, we utilize BERT and LSTM for the training of gene embedding. After which we conducted early AD diagnosis using T1-weighted images and gene embeddings. In future work, early diagnosis of AD will be conducted using not only T1-weighted images and gene embedding but also other types of data.

Genetics:

Genetics Other ¹

Modeling and Analysis Methods:

Other Methods ²

Keywords:

Machine Learning

MRI

Open Data

^1|2Indicates the priority used for review

Provide references using author date format

1. Hao, Xiaoke, et al. "Multi-modal Self-paced Locality Preserving Learning for Diagnosis of Alzheimer’s Disease." IEEE Transactions on Cognitive and Developmental Systems (2022).
2. Sims, John R., et al. "Donanemab in early symptomatic Alzheimer disease: the TRAILBLAZER-ALZ 2 randomized clinical trial." Jama 330.6 (2023): 512-527.
3. Lagisetty, Yashwanth, et al. "Identification of risk genes for Alzheimer’s disease by gene embedding." Cell genomics 2.9 (2022).
4. Ji, Yanrong, et al. "DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome." Bioinformatics 37.15 (2021): 2112-2120.
5. Cahyawijaya, Samuel, et al. "SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study." Proceedings of the 21st Workshop on Biomedical Language Processing. 2022.
6. Kenton, Jacob Devlin Ming-Wei Chang, and Lee Kristina Toutanova. "Bert: Pre-training of deep bidirectional transformers for language understanding." Proceedings of naacL-HLT. Vol. 1. 2019.
7. Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.