Computational Frameworks for Identifying and Visualizing SNP-SNP Interactions in Alzheimer’s Disease Risk

The main aim of this paper is to perform a case-control study to detect high-ranking risk genes and the most significant SNP-SNP interactions for a better understanding of the disease’s etiology. For a variety of reasons, early and accurate disease detection is essential. While there isn’t a known treatment for Alzheimer’s, behavioral symptoms can be managed with a variety of drugs and coping techniques [27]. Early treatment in the disorder process may help maintain daily functioning for a while. Most drugs work best for people who are in the early or middle stages of their condition. Due to data mining and machine learning advancements in science, physicians, and researchers may now investigate biological markers, or biomarkers, of complicated illnesses in live organisms.

This paper presents two frameworks for identifying accurate genetic biomarkers for early AD diagnosis, increasing diagnostic accuracy. Framework I of this work demonstrated a method that combined MDR and ensemble learning techniques. The promising findings of this framework presented significant genes and SNP-SNP interactions that effectively and precisely contribute to AD risk. In framework II, top-ranking SNPs that may contribute to the risk of AD were identified through SNP-SNP interactions by integrating DNN with the MDR constructive induction algorithm. This model can be explained using certain methodologies, such as SHAP, which also addresses the black-box design of DNN. These frameworks’ main aim is to identify SNP-SNP interactions that aid in developing an understanding of the disease process and provide a crucial foundational step for a PM strategy.

The proposed flow diagram can be described as in Fig. 1. In the first step, the genotype dataset was acquired, and preprocessing procedures were applied. Then dimensionality reduction method was used to obtain a subset of SNP features from a huge number of features. When the subsets of SNPs are chosen, they can be examined using ensemble learning / deep learning methods to find the greatest risk-susceptibility SNPs linked to AD. The next stage in this work was to investigate interactions on the SNPs that were most associated with risk susceptibility using MDR.

Fig. 1Fig. 1The alternative text for this image may have been generated using AI.

MDR results are commonly adjusted using the Benjamini–Hochberg (BH) false discovery rate (FDR) correction. Because MDR is often used to test large numbers of gene-gene or gene-environment interactions (combinations of SNPs), the risk of false positives is high. To address this, FDR methods like Benjamini-Hochberg are applied to the permutation-based p-values generated by the MDR analysis.

To ensure the statistical robustness of our findings, we conducted 1,000 permutations of the diagnostic labels. These permutations were performed inside the nested cross-validation loop, encompassing the entire pipeline from MDR feature selection to DNN training. This approach (Full-Pipeline Permutation) ensures that the reported p-values account for the model selection process and the high dimensionality of the ADNI dataset. Furthermore, all p-values were adjusted using Benjamini-Hochberg FDR correction to account for multiple testing across the SNP combinations and models evaluated.

Finally, the most significant SNP-SNP interactions were detected and visualized to predict and classify AD.

Dataset

Dataset was provided from Alzheimer’s Disease Neuroimaging Initiative (ADNI) database [28]. It includes total genotypes for 127 normal individuals and 304 patients. The SNP count for both sick and healthy people is 730,524. ADNI is a benchmark dataset that contains several data types from research participants, using a uniform set of protocols to minimize inconsistencies and facilitate the exploration of the disease’s processes.

Data Preprocessing

Preparing the data is an essential step in obtaining meaningful results. In this paper, the preprocessing stages listed below were employed: To categorize the phenotypic data for each participant as either a patient or a normal person, the diagnostic information was provided in the first step. Therefore, there were 127 unaffected individuals and 304 affected individuals. ADNI dataset was then examined using quality control (QC) procedures to eliminate low-quality SNPs and reduce the possibility of erroneous results. The following QC processes were implemented with PLINK [29]:

a)

People with a high percentage (10%) of missing genotyping data were removed from the analysis.

b)

SNPs with a 10% missing genotyping rate were removed. SNPs with a genotyping rate of 90% or higher are the only ones considered.

c)

SNPs that had a minor allele frequency (MAF) of less 10% were likewise removed.

There were 431 individuals in total (304 cases and 127 controls) when QC processes were used, and the aggregate number of SNPs became 530,750. SNPs with MAF below 10% were excluded to minimize noise and enhance the stability of the predictive model. Low-frequency variants are more prone to genotyping errors and may produce unreliable estimates, particularly in high-dimensional datasets with limited sample sizes. As this study emphasizes predictive modeling, focusing on common variants supports better generalization and helps reduce the risk of overfitting. Overall, these thresholds represent a practical balance between maintaining data quality and ensuring model robustness.

After that, the linkage disequilibrium (LD) pruning stage was employed to improve the efficacy of genetic association research for complicated disorders. ADNI dataset passed LD pruning phase to select important markers, yielding 447,538 SNPs. SNP-disease association tests were then employed to reduce the extensive processing demands and assess the statistical relationship of each SNP with AD [8]. In this paper, basic association, logistic model, and Fisher’s exact tests were applied using PLINK. The non-significant SNPs of p-value more than 0.01 were excluded. The achieved results showed that there were 3,502 crucial SNPs were obtained by intersecting the SNP results from the applied association tests. However, the achieved number of SNPs was still enormous. As a result, feature selection methods were utilized to identify the most important SNPs.

Proposed Genetic Biomarkers Discovery (Integrated EL-MDR) Framework I

This section explains the first proposed framework to detect and visualize significant AD biomarkers. Ensemble learning is a machine learning approach that involves training multiple models to address a problem. However, traditional machine learning classifiers focus on learning one hypothesis from training data [30].

Framework I was applied to employ ensemble learning techniques to identify significant SNPs that participate in AD risk through SNP-SNP interactions. The union of the top twenty-ranking SNPs obtained seventy-six potential predisposition SNPs. Subsequently, a multi-locus interaction analysis was conducted on these identified SNPs to identify and visualize the most important SNP-SNP interactions. A key feature of this proposed framework is its ability to implement a high-dimensional model to explore SNP-SNP interactions among genetic variants.

Dimensionality Reduction

Feature reduction methods were applied to eliminate insignificant and redundant features from the obtained dataset. Turned ReliefF (TuRF) method consistently suits the complex character of biology and is most effective for large data problems [31]. Additionally, TuRF feature selection method was employed to enhance the efficiency of the widely known ReliefF method. This technique is a common example of the filter strategy. A filter method is typically employed to rank and sort the dataset’s features. TuRF approach adds an iterative component to SNP filtering, which makes it efficient. In each iteration, this method can iteratively exclude the insignificant and law-ranked SNPs. In the TuRF approach, the N/R least discriminative SNPs are removed in each iteration, assuming there are R iterations and a total of N SNPs. After using TuRF method, 3,502 SNPs in total were reduced to 1,050. To prevent overfitting, the best SNPs were chosen, and the redundant SNPs were eliminated using this feature selection technique.

Ensemble Learning Algorithms

Ensemble learning classifiers showed favorable results in detecting complex disorders [17]. CART, Random Forest with Gini index and permutation importance, and XGBoost techniques have been used in this paper to detect the highest twenty-scoring SNPs of each method. Subsequently, these detected vital SNPs were combined to investigate their potential for identifying non-linear SNP-SNP interactions. To solve the problem of unbalanced data, this study presents a unique framework that uses ensemble learning approaches to improve the performance metrics for the classes that include most of the participants [30].

Random Forest algorithm is regarded as a black box technique [32]. Permutation and Gini significance techniques are used to determine the feature importance of random forests [33]. In this proposed work, Important SNPs have been identified using permutation and Gini importance approaches. This work specified a range of possible values for each hyperparameter, and experimentation was used to determine the hyperparameter values for the models. Consequently, we employed the RandomizedSearchCV approach to do parameter tuning for random forest to discover the optimal values. Gini importance and permutation approaches have been utilized to identify the highest-ranking SNPs by each and find their ability to detect non-linear SNP-SNP interactions.

Based on XGBoost’s scikit-learn compatible API, the XGBoost algorithm is applied using trees as base learners [34]. XGBoost was utilized in this proposed work since it obtains the highest-ranked SNPs by estimating feature importance using a trained predictive technique. These top-ranked SNPs obtained by this technique are vital to get significant complex genes and SNP-SNP interactions that may show the AD risk.

Classification and Regression Trees (CART), a non-parametric decision tree learning method, was used. CART technique assigns importance ratings based on the decrease in the criterion parameter, such as entropy or Gini, that is used to choose split points [35]. The best split was determined to be the SNP with the lowest entropy. We applied the feature important property to obtain the relative importance ratings for every input SNP. In this work, the top-ranking SNPs by the CART technique were identified to examine their potential for selecting high-ranked SNP-SNP interactions.

Multi-Locus Interaction Analysis

The highest-ranked SNPs detected by ensemble learning methods were combined to identify 76 potential predisposition SNPs. In this paper, significant epistasis interactions were investigated by applying multi-locus interaction analysis utilizing MDR to the identified SNPs. To combine the genotypes from two or more SNPs into a feature with low and high-risk categories, MDR was used in this study. Performing an extensive search to find epistasis connections in a GWAS is computationally costly. As a result, this process needs a workload of computation for high-order SNP-SNP interactions. Moreover, there are more multi-locus interactions when there are a lot of markers.

Consequently, by determining the highest-ranking SNPs, a novel framework combining ensemble learning and MDR techniques has been presented to lessen some of the MDR method’s shortcomings. MDR is an important machine learning method that identifies interactions of genetic variants in complex diseases. MDR was employed to assess two to five-way interactions. Due to the computational complexity of analyzing these SNP interaction models, finding the significant interactions was restricted to the interactions of the top twenty rankings produced by the used ensemble techniques.

The rankings on the significance of SNP contribution to AD were produced by these ensemble methods. The total number of attributes became seventy-six SNPs by merging the best twenty rankings achieved using these methods. After that, the statistical interaction analysis of the detected seventy-six SNPs was applied to discover significant SNP-SNP interactions. Consequently, the coding SNPs mapped to 38 genes, including 27 known genes associated with AD and 11 newly discovered genes, as explained in Table 1. The new genes could potentially be indicators for AD. The previously detected AD genes were shown in data sources like Open Targets Platform and TWAS Hub – Genes [36, 37].

Table 1 AD discovered genes (framework I)

The biological significance of the newly identified genes for Table 1 is described as the following: The identified gene set comprises protein-coding genes, long non-coding RNAs, and several uncharacterized loci, underscoring the complex and polygenic nature of Alzheimer’s disease [24]. PDE1C and NSUN7 represent functionally relevant candidates, implicating cyclic nucleotide signaling and RNA methylation, respectively, in disease-associated regulatory mechanisms that may affect neuronal function and gene expression.

The presence of multiple long intergenic non-coding RNAs (LINC02880, LINC01482, and LINC01837) suggests a potential role for non-coding regulatory networks in modulating Alzheimer’s-related pathways, including neuroinflammation and transcriptional control. In contrast, several LOC-designated genes lack functional annotation and likely represent novel or low-expression transcripts identified through high-throughput analyses, requiring further validation to determine their biological relevance [25].

KRTAP27-1, a gene primarily associated with structural keratin biology, shows no known neural function, indicating that its association with Alzheimer’s disease may be indirect. Collectively, these findings highlight emerging regulatory and epigenetic components in Alzheimer’s disease beyond classical amyloid and tau mechanisms and emphasize the need for further functional studies to elucidate their roles in disease pathogenesis [37].

To discover novel SNP-SNP interactions connected to AD disease, a proposed MDR method was put forth that makes use of ensemble learning techniques. The findings demonstrate that this suggested framework can detect significant risk genes and SNP-SNP interactions that could aid in better investigating AD etiology.

Proposed Genetic Biomarkers Discovery (Integrated DNN-MDR) Framework II

The second proposed framework is described to detect the most significant AD biomarkers, which is a vital starting step within a PM strategy. DNN have shown great results in discovering complex diseases throughout the years [38]. In this paper, DNN was employed using SHAP to discover the highest ranked SNPs that were associated with AD through SNP-SNP interactions. In this proposed work, the global level of the model, which is built on aggregations of Shapley values, was used to investigate the general operations. The target of this step is to obtain the SHAP feature importance for the applied DNN [39].

The features were listed in order of decreasing significance. After that, multi-locus interaction analysis was performed on these detected SNPs using MDR to discover vital SNP-SNP interactions. The prediction accuracy of SNP-SNP interactions was evaluated using 10-fold cross-validation. Finally, the presented framework shows important complex genes and SNP-SNP interactions.

Deep Neural Network (DNN)

Complex non-linear relationships, or non-linearity in the genotype-phenotype link, can be handled using DNN. DNN is a robust method of machine learning that has drawn much attention for its achievements in several tasks in the biomedical domain [40]. To improve AD diagnosis, a fully connected six-layer network structure was employed in the proposed work. In this paper, deep learning technique was applied using one input layer, first hidden layer (h1) and second hidden layer (h2) with dropout, and one output layer. In this research, two hidden layers were used, since adding more hidden layers might increase the model’s complexity and occasionally induce overfitting.

First, the model predicts rows of data with 3502 attributes according to the number of dimensions of the features. Then, the number of nodes in the second hidden layer is 876 which employed the Relu activation function. Finally, the output layer consists of a single node that uses the sigmoid activation function. Figure 2 shows the graph of the proposed DNN model. When DNN algorithm was applied to the ADNI dataset, a powerful model with a classification accuracy of 70.53%, recall of 98%, precision of 66%, and f1-score of 79% was performed.

Fig. 2Fig. 2The alternative text for this image may have been generated using AI.Model Interpretability With SHAP

In this study, SHAP was utilized to assess SNPs from a DL algorithm trained on ADNI data that might contribute to risk of AD through SNP-SNP interactions. SHAP was employed to provide a numerical value of credit to every input feature [41]. Each feature is given an importance rating for a specific prediction using SHAP technique, which unifies explanations of predictions. Finding the significant features with large absolute Shapley values is the primary objective of utilizing SHAP feature importance.

In this work, interactions between SNPs from a deep learning model that may increase the risk of AD were evaluated using SHAP. Global XAI has been employed to show the effect of the input variables as an aggregate to the entire dataset. Figure 3 demonstrates the top twenty features that help DNN build decisions. The average Shapley value for each feature across all instances is displayed in this figure.

Fig. 3Fig. 3The alternative text for this image may have been generated using AI.Multi-Locus Interaction Analysis Using MDR

This work proposes an unconventional technique to minimize some of the drawbacks of the MDR method by identifying the top-ranking SNPs through the integration of DNN using SHAP and MDR. Thus, target of the presented framework is to detect complex genes and looking for important SNP-SNP interactions associated with AD. First, DNN was applied using SHAP to identify the highest 20 ranked SNPs. After that, pairwise, three-way, four-way, and five-way interaction models related to AD were detected using MDR. Subsequently, the interactions of the highest-ranked SNPs were explored. After detecting the most significant twenty rankings, the statistical interaction analysis of them was created to discover the most important SNP-SNP interactions. The coding SNPs mapped to eleven genes which contains six known complex genes and five novel genes, as shown in Table 2.

Table 2 AD discovered genes (framework II)

The biological significance of the newly identified genes for Table 2 is described as the following: The newly identified genes include both protein-coding and non-coding elements with potential regulatory and metabolic relevance [25]. LINC01830 is a long intergenic non-coding RNA that may play a role in transcriptional and epigenetic regulation, processes frequently implicated in complex diseases. AGBL1 is involved in post-translational modification of tubulin and contributes to microtubule stability and intracellular transport, suggesting a role in cellular structure and neuronal function.

IZUMO4, a member of the IZUMO gene family, is associated with membrane interaction and cell–cell communication, indicating possible involvement in signaling or fusion-related mechanisms. SLC13A3 encodes a sodium-dependent dicarboxylate transporter that regulates cellular uptake of key metabolic intermediates, linking it to energy metabolism and oxidative stress pathways [36].

LOC101929507 represents a poorly characterized genomic locus that may function as a regulatory element or non-coding RNA. Collectively, these genes are associated with regulatory, structural, and metabolic processes, supporting their potential biological relevance and warranting further functional investigation [37].

Evaluation Criteria

Classification accuracy, recall, precision, and f1-score were used to assess the prediction performance of the DNN model and the employed ensemble learning techniques in order to identify the top SNPs. These SNPs were utilized to contribute to the risk of AD through interactions. The predictive SNP-SNP interactions’ performance was assessed from 10-fold cross-validation along with the training and testing datasets. In this work, balanced accuracy (BA) was selected as the metric of model fit for all cross-validation experiments [42].

Comments (0)

No login
gif