BrainInsights: a comprehensive framework for pre-processing, analysis, and interpretation of neuroimaging data using traditional statistics and machine learning

Abstract

Neuroimaging presents us with an in-depth understanding about brain structure and function, yet the data complexity poses significant analytical challenges. Current frameworks suffer from issues such as scalability, poor integration with traditional statistics and a need for a programing background, which hinder researchers from focusing on neuroscience questions. To address these limitations, we present BrainInsights, an integrated and automated GUI-based pipeline ecosystem designed to facilitate the analysis of multi-modal or multi-parametric neuroimaging data in a flexible way. The framework comprises three core tools: MARIA (MAgnetic Resonance Imaging data Analysis and inspection tool) for data inspection and hypotheses testing, ML Pipeline for automated feature selection and model construction, and ML DaViz for model evaluation and bio-signature generation. Deployed as a singularity container, the system ensures reproducibility and scalability across computing environments. We validated BrainInsights using diverse datasets, including multi-parametric MRI studies of Anorexia Nervosa, Crohn’s disease, and Rheumatoid Arthritis. Specifically, the framework distinguished young Anorexia Nervosa patients from controls with a balanced accuracy of 65%, while in the PreCePRA trial, it predicted Rheumatoid Arthritis treatment response with a balanced accuracy of up to 95.4% using functional pain markers. The results demonstrate the ability of the framework to achieve high separation of subgroups and treatment success and additionally bridge hypotheses-driven statistical analysis with data-driven machine learning analysis. By enabling interpretability tools like SHAP, BrainInsights empowers researchers to move beyond “black-box” modeling to uncover stable, biologically plausible bio-signatures. Ultimately, this framework aids in accelerating the translation of complex neuroimaging data into meaningful clinical insights.

1 Introduction

Neuroimaging is an essential armamentarium in neuroscience, helping researchers understand the brain’s complexity and its intricate network of neurons. The different neuroimaging modalities like magnetic resonance imaging (MRI), positron emission tomography (PET), and electroencephalography (EEG) provide us unique and distinct perspectives on the brain’s anatomical structure, function, and connectivity. These tools are powerful for neuroscience research and are instrumental in studying both brain function and a range of neurological and psychiatric conditions (Biessmann et al., 2011; Tulay et al., 2019). Multi-modal imaging, i.e., integrating different neuroimaging modalities, can provide a more comprehensive understanding of the brain and enhance the insights gained from neuroimaging studies (Biessmann et al., 2011). However, multi-modal neuroimaging data analysis comes with a significant challenge: the resulting dataset is more complex and thus requires a more sophisticated analysis to process and extract meaningful information.

Current machine learning (ML) frameworks used in neuroimaging such as WEKA (Frank, 2005), RapidMiner (Mierswa et al., 2006), among others, face several limitations. These platforms often struggle with scalability when dealing with large neuroimaging datasets and suffer from reproducibility issues when analyses are moved between computing environments. Furthermore, they typically require significant programing expertise, posing a barrier for researchers who wish to focus on neuroscience questions rather than technical implementation. A prominent deficiency is the poor integration of traditional statistical tests with modern ML methods, and a notable hurdle is the absence of integrated and intuitive visualization tools—like chord or connectome plots—to convey findings effectively.

This paper presents BrainInsights, an integrated GUI-based framework designed to overcome the above challenges and facilitate a flexible and systematic analysis of multi-parametric/multi-modal neuroimaging data. BrainInsights is a collection of three powerful tools:

MARIA (a preliminary data analysis and visualization tool).

ML Pipeline (machine learning framework comprising various algorithms and feature selection algorithms that run on multi-modal/multi-parametric or subsets of data).

ML DaViz (an evaluation tool for different feature selection algorithms and classification algorithms obtained from the machine learning framework).

It is important to clarify that BrainInsights is designed to operate on derived, feature-extracted neuroimaging data from tabular formats rather than raw voxel-level images. The framework does not perform initial raw image pre-processing such as motion correction, skull stripping or registration—but instead ingests the statistical outputs from established pipelines like Freesurfer, VBM or DBM.

Together these tools seamlessly address the issues of traditional statistical analysis, hypothesis generation and testing, feature selection, building machine learning models, and visualization of individual features and classification results. In addition to providing advanced data exploration and hypothesis testing, this novel framework enables the generation of interpretable and actionable insights from neuroimaging data. It bridges the gap between traditional statistical analysis and modern machine learning (ML) techniques in an easy-to-use GUI-based interface.

The primary goal of BrainInsights is to simplify the data analysis process and enable researchers to uncover hidden patterns in data and generate characteristic “bio-signatures” for specific cohorts and/or treatment success. From this perspective, a bio-signature is a distinct and measurable set of features—derived from brain structure, function or clinical data—that collectively serves as a signature for a specific condition, such as differentiating treatment responders from non-responders. Depending on the research question, it allows for traditional hypothesis-driven statistical assessments but also provides methods of data-driven evaluations, thereby leading to new hypotheses and findings. The framework has been designed to be adaptable to incorporate multiple neuroimaging modalities and parameters as well as clinical data for multi-center, multi-measurement (repeated measurements at different time points) studies for animal but also human data. The framework provides the ability to test the datasets with multiple feature selection and ML classification algorithms, cross-evaluate them, and establish bio-signatures, e.g., for different diseases or to differentiate responders from non-responders to drugs. Inspection and interactive evaluation of individual features, atlases, or modalities are possible through various plots available or as tables. MARIA and ML DaViz are GUI-based while the ML Pipeline is completely automated via scripts. High-quality output is possible as images and structured text. The framework is implemented and deployed as a singularity container, making it reproducible, platform-independent, and easily scalable on HPC clusters.

To demonstrate the capabilities of BrainInsights, we conducted extensive experiments using several datasets, including a multi-centric multi-parametric MRI PreCePRA dataset with additional clinical data (Hess et al., 2025; Schenker et al., 2021), a multi-parametric MRI Anorexia dataset (Mendez-Torrijos et al., 2024), a multi-parametric MRI Crohn’s disease study (Hess et al., 2015), among others. The results exemplify our framework’s versatility and robustness and highlight its potential to advance the neuroimaging data analysis landscape for both human and animal studies.

2 Framework design, data structure, and analysis workflow2.1 BrainInsights overview

The BrainInsights workflow begins by ingesting tabular data from common formats such as Excel, CSV, or SYLK files from any standard data analysis pipeline of the specific data in use. The framework automatically converts these inputs into a unified, structured format suitable for use across all its components.

While designed for neuroimaging, the architecture is data-agnostic and can process any tabular data, including detailed clinical or demographic information. A key and quite unique feature enhancing this flexibility is the use of an optional external group assignment file. This allows users to dynamically define or update subgroups for analysis—for example, based on age, gender, handedness, treatment success, or combinations thereof. New group definitions can be added as columns, enabling efficient re-analysis of the data without needing to re-import or re-process the entire dataset.

Figure 1 provides a schematic overview of the BrainInsights framework. The specific processes within each of the four color-coded components are detailed in the flowchart in Figure 2.

Flowchart depicting a machine learning workflow starting from data import, splitting into modules for MARIA, ML pipeline, and ML DaViz, with parallel stages for data gathering, preprocessing, statistical testing, hypothesis testing, data inspection, model configuration, feature comparison, and bio-signature building, all interconnected by arrows to indicate process flow.

Workflow of brain insights framework: the diagram illustrates the integrated structure of the framework. The left side outlines the major components to provide higher-level overview, while the right side details the key processes within each module. Dotted lines indicate the flow of data, and the solid lines represent the conceptual workflow. The framework starts with data import and pre-processing (blue), and branches into three main modules: MARIA for statistical analysis and hypothesis testing (red), ML Pipeline for machine learning model construction (green), and ML DaViz for model evaluation and bio-signature generation (purple).

Flowchart illustrating the BrainInsights platform, divided into Data Import, ML Pipeline, ML DaViz, and MARIA modules with sequential steps for data processing, machine learning, data visualization, statistical assessment, correlation analysis, and dimensionality reduction, represented with labeled boxes and connecting arrows.

BrainInsights overview: a detailed flowchart of each of the four parts of the framework. The Data Import section imports raw datasets, applies necessary imputation and filtering and saves the data. MARIA section can use this data for inspection and testing hypotheses in the form of plots and tables. The ML Pipeline uses the data from the data import step to build ML models after feature selection. The results from ML Pipeline can be inspected and analyzed with ML DaViz. The pipelines inside the larger blue box (Data Import and ML Pipeline) are accessible via scripts and may take longer to run, fitting a “do it and go home” batch process. The tools in the large red box (MARIA and ML DaViz) are interactive and is part of a GUI, allowing for easy exploration and testing.

DATA IMPORT (Blue Section): Handles the initial data import and pre-processing steps, including data cleaning and harmonization.

MARIA (Red Section): MAgnetic Resonance Imaging data Analysis and inspection tool is used for data inspection and hypothesis testing. It performs traditional statistical analyses, group comparisons, dimensionality reduction, and clustering on pre-processed data.

ML PIPELINE (Green Section): This is the core Machine Learning engine. It automates key steps including data pre-processing, filtering, hyperparameter tuning, feature engineering, and building predictive models.

ML DaViz (Purple Section): Machine Learning Analysis and Data Visualization tool takes the output from the ML Pipeline to evaluate model performance and interpret results. The workflow allows users to compare feature distributions and build bio-signatures using a variety of visualizations including violin, chord and spider plots to highlight the most significant features in the data.

The specific input and output specifications for each of these modular components are detailed in Supplements Table 1.

2.2 Target user profiles and accessibilities

BrainInsights is designed to be accessible to a wide range of researchers, regardless of their programing background. To achieve this, we focus on two distinct ways people interact with the framework.

2.2.1 For clinical and neuroscience researchers

MARIA and ML DaViz modules are designed to focus primarily on science rather than on writing code. These components are built as interactive Shiny-based GUIs that function entirely through a “point-and-click” interface. A detailed visual guide of the GUI navigation and functional modules is provided in the supplement methods (cf. Supplementary Figure 1). This allows researchers to perform traditional statistics, visualize data distributions, create hypotheses, e.g., by dimensionality reduction, and interpret machine learning results without needing any programing expertise.

2.2.2 For data scientists and ML practitioners

For users who need to scale up their analysis, the ML Pipeline provides an automated approach for batch processing. While the specific analysis steps are managed through a single Excel configuration file, executing the pipeline does require a basic comfort level with a Linux terminal. This setup is designed to run with singularity containers, making it easy to manage large-scale jobs on High-Performance Computing (HPC) clusters.

2.3 BrainInsights environment and workflow

All tools and scripts were developed using the R programing language V4.3.1 (R Core Team, 2021). The utilized R packages are listed in Table 1. To ensure that the results are reproducible and the software remains platform-independent, the entire implementation is provided as a Singularity container (Kurtzer et al., 2017). This allows the tools to scale seamlessly from local workstations to High-Performance Computing (HPC) clusters.

Data manipulationDplyr (Wickham et al., 2018), tidyr (Wickham et al., 2024), stringr (Wickham, 2023)Visualizationggplot2 (Wickham, 2016), plotly (Sievert, 2020), heatmaply (Galili et al., 2017), fmsb (Nakazawa, 2024), ggbeeswarm (Clarke and Sherrill-Mix, 2024), circlize (Gu et al., 2014)ImputationsVIM (Kowarik and Templ, 2016), mice (Zhang, 2016)Dimensionality reduction, machine learning and feature selectionCaret (Kuhn, 2008), Boruta (Kursa and Rudnicki, 2010), mixOmics (Rohart et al., 2017), Recursive Feature Elimination (Chen and Jeong, 2007), RandomForest (Liaw and Wiener, 2002), Hmisc (Harrell, 2024), Rtsne (Krijthe et al., 2018), uwot (Melville, 2025), e1071 (Dimitriadou, 2009), rSimca (Filzmoser and Todorov, 2013; Filzmoser and Vlaskova, 2022), rferns (Kursa, 2014), XGBoost (Chen et al., 2019), catboost (Dorogush et al., 2018), nnet (Ripley et al., 2016)Parallel processingPurrr (Wickham and Henry, 2023), foreach (Microsoft and Weston, 2022), doParallel (Corporation and Weston, 2022)MiscellaneousShiny (Chang et al., 2023)

Overview of primary R libraries utilized within the BrainInsights ecosystem for data processing, machine learning and visualization.

The workflow is designed to move from script-based data preparation to interactive analysis:

Data Ingestion: While the analysis modules are interactive, the initial data import and pre-processing steps are handled via R scripts. This means a basic understanding of programing is required at the outset to convert raw, feature-extracted neuroimaging data into the framework’s structured format. Additionally, as BrainInsights is dedicated to downstream analysis, it is expected that raw image processing—including motion correction, skull stripping and registration—has been performed externally using modality-specific pipelines such as fMRIPrep, Freesurfer or VBM. Once this is complete, the data works seamlessly across MARIA and ML Pipeline modules.

The Machine Learning Cycle: Machine learning within the framework is a two-stage process. First, the ML Pipeline executes the computationally heavy tasks—such as feature engineering, hyperparameter tuning, and model building. Once processed, results are explored through ML DaViz and for visual interpretation of bio-signatures.

2.4 Software verification and quality assurance

To ensure the reliability of the framework’s core analytical engine, we implemented a structured verification process. Individual R functions—specifically those responsible for data cleaning, curation, and missing value imputation were verified using the testthat (Wickham, 2011) package. The unit testing suite was designed to confirm that the software handles edge cases, such as unexpected non-numeric entries or features with high missingness, without compromising the integrity of the analysis pipeline. Code coverage was done using covr (Hester, 2023) package and the analysis demonstrates a high degree of testing density across the framework. The Data Import module achieves 95.11% coverage, MARIA 93.08%, ML DaViz 93.32%, and the ML Pipeline 90.11%.

Furthermore, the deployment of BrainInsights within a singularity container functions as environment-level verification. By freezing the versions of all R package dependencies, we eliminate dependency drift and ensure that the software behaves identically across different computing platforms, from local machines to HPC clusters. This is complemented by the inclusion of a sample configuration file (provided as a converted PDF of the native YAML settings), which details the exact hyperparameters, random seeds, and algorithms used, allowing researchers to replicate the reported classification outcomes precisely. This is further supported by the iterative validation described in Case Study 3, where ML Pipeline outputs are visually cross-checked in MARIA to identify potential technical artifacts or systematic processing errors.

2.5 Data import2.5.1 Data structure and compatibility

BrainInsights is designed to be data-agnostic, ingesting processed tabular data instead of raw voxel-level images or region by time-series matrices. The framework assumes that raw neuroimaging scans have already been processed using modality-specific software (e.g., Freesurfer, VBM or fMRIPrep) to generate derived features. To accommodate the wide range of supported multi-modal and multi-parametric imaging and the numerous atlases available for each modality, as in the Freesurfer (Fischl and Dale, 2000; Fischl et al., 2002) for MRI data—we utilize three primary organizational templates for input:

Region-based Metric Tables: This format handles anatomical measurements (VBM, Ashburner and Friston, 2000); DBM, Davatzikos et al., 1996, Freesurfer), DTI, and functional summary maps like ALFF and ReHo. Each row corresponds to a regional label from a specific atlas, while columns represent extracted measures such as volume, surface area, or mean intensity. For resting-state fMRI, the framework expects these pre-calculated regional summary statistics or flattened connectivity vectors rather than raw nodal time-series.

Graph Theoretical Parameters: Network-based data is organized into two distinct structures. Global metrics are formatted with rows representing a threshold or density range and columns representing network-wide properties like small-worldness or total edge count. Nodal parameters follow the region-based format, where rows are brain structures and columns are local metrics like clustering coefficients or betweenness centrality.

Clinical and Demographic Data: These Excel files (.xlsx) can be imported with features as rows (one file per participant) or as combined tables where columns represent specific variables like age, blood parameters or other clinical scores.

Regardless of the initial input template, the framework automatically flattens and concatenates these files into a unified subject-by-feature matrix. The specific input and output requirements for utilizing each module as a standalone tool are summarized in Supplementary Table 1. The framework utilizes an early fusion strategy, where cross-modal features are integrated into a single high-dimensional space prior to feature selection and model training. This design choice enables the algorithms to identify complex, non-linear interactions across different imaging modalities and clinical markers that might be overlooked if each subset was analyzed in isolation. While effective for identifying unified bio-signatures, we are evaluating the implementation of late fusion or hybrid strategies for future iterations to effectively handle datasets characterized by asymmetries in feature dimensionality or varying information density across modalities. In this structured output, each row represents a single participant and each column represents a unique “feature” defined by the combination of the Modality/Pipeline, atlas, brain structure and measure.

For example:

This standardized schema ensures that the components of BrainInsights remain modular; while they are part of an integrated ecosystem, each tool can be used independently. The structured data can be saved with the lightweight RDS (R data structure) or high-speed Feather formats.

2.5.2 Data pre-processing pipeline

The data pre-processing pipeline begins with a quality assessment followed by a data curation step designed to correct errors, handle NAs and NaN values, and remove unexpected non-numeric entries, thereby ensuring data integrity. Missing data is managed by a two-step process. Specific recommendations for handling various rates of missingness are detailed in Supplementary Table 2. First, as an optional step, to prevent introducing analytical bias, features with a high percentage of missing values (e.g., > 30%) can be dropped. Second, for the remaining features, several imputation methodologies are available. While the choice is user-dependent, the pipeline provides guidance, generally recommending simpler methods such as mean/median imputation for lower rates of missingness and more advanced methods such as K-Nearest Neighbors imputation based on feature similarity and Multivariate Imputation by Chained Equations (MICE) for more complex scenarios.

Following imputation, optional data scaling and normalization options are available. Methods such as standard scaling, robust scaling, min-max scaling, logarithmic or box-cox transformations are available. The mathematical formulas and typical applications are summarized in Supplementary Table 3. Additionally, a specialized within-subject normalization option is available that can be particularly helpful for longitudinal data analysis. This approach applies the selected scaling method to the repeated measurements within each subject, controlling for individual baseline differences. In cases where a subject has only a single measurement, normalization is instead performed using the group average. Furthermore, all imputations and scaling methods can be applied flexibly, either across the entire dataset or within various subgroups.

2.5.3 Practical implementation guidelines

BrainInsights was designed keeping practical considerations in mind to streamline research workflows and robust analysis as outlined below.

2.5.3.1 Standard workflow procedures

The data import process is straightforward. The data import pipeline converts raw feature-extracted neuroimaging data from tabular data different formats into a standardized R data structure. This standardized format works seamlessly on other tools within the BrainInsights framework. Additionally, an external group assignment file available in excel format aids in dynamic group analyses without reloading data repeatedly.

2.5.3.2 Troubleshooting protocols

The data import pipeline is equipped with an extensive logging feature that helps troubleshooting issues with data. During the import process, detailed quality statistics are generated to help flag potential issues early. After import, users can quickly inspect the data using MARIA to visually detect outliers or unexpected distributions, issues that may point to data quality problems, as demonstrated in Case Study 3.

2.5.3.3 Best practical recommendations

For optimal performance, we recommend the following:

Ensure consistency across all raw datasets in terms of structure and feature-extracted methods. When and where possible, register data to a similar space before the feature extraction process, as it can be ineffective when each parametric neuroimaging data is in a different space.

A user-defined but consistent naming convention has to be guaranteed across all data sources.

Choose imputation methods carefully to avoid introducing bias in the datasets. For low missing rates, simple imputation methods may suffice. However, as the missing rates increase advanced methods like MICE can provide more accurate estimates.

Leverage the group assignment file for efficient and flexible subgroup analyses without having to reload and reprocess datasets.

2.6 MARIA (MAgnetic Resonance Imaging data Analysis and inspection tool)

MARIA is an R/Shiny-based GUI tool designed for the comprehensive analysis and visualization of feature-extracted neuroimaging data. It integrates data inspection, traditional statistical analysis, and advanced visualization techniques into a single, user-friendly platform.

2.6.1 Data Inspection and visualization2.6.1.1 Basic visualization plots

To facilitate understanding of the dataset’s structure, MARIA includes standard plot types such as line plots, histograms, density plots, scatter plots (2D/3D), bee swarm plots and whisker plots. These visualizations are widely used in neuroimaging research for examining temporal changes, distributions, relationships between variables, and between-group differences.

These tools allow users to quickly identify patterns, outliers, or potential issues in the data.

Figure 3 showcases examples of these plots.

Four-panel data visualization comparing Adult_AN, Adult_HC, Young_AN, and Young_HC groups: A shows a line plot of Inferiorparietal Volume across two measurements with error bars; B displays a whisker plot of Total Grey Volume by category; C presents density curves for Inferiorparietal Volume by category; D features a scatter plot of BMI versus Total Grey Volume with color-coded trend lines for each group.

Basic visualization plots in MARIA. In (A) a line plot displaying mean values with standard error of measurement (SEM) of inferior parietal volume across two measurement points (M01 and M02) for all four groups. (B) The density distribution of inferoparietal volume measurements across all groups, highlighting the different patterns between adult and young subjects with and without anorexia. (C) Whisker plots comparing total gray volume distribution across the four subject categories (Adult_AN, Adult_HC, Young_AN, Young_HC), with individual data points overlaid to show the spread within each group. (D) displays a scatter plot with total gray volume on the x-axis and BMI on the y-axis, with regression lines for each group showing the relationship between brain volume and BMI.

2.6.1.2 Advanced plots

Beyond basic inspection, MARIA enables more integrated views assessing the relationships across multiple features or dimensions:

Correlation Plots (Auto and Cross-Correlation): Used to examine intra- and inter-dataset relationships (e.g., structural–structural or structure–clinical parameter associations).

Spider (Radar) Plots: Interactive visualizations for comparing multiple metrics (e.g., mean, percentage change) across selected features or ROIs, useful in summarizing outputs from statistical or machine learning pipelines.

Figure 4 presents examples of advanced plots.

Panel A displays a table summarizing T-test results comparing anorexia patients and healthy controls across multiple imaging features, while Panel B presents a summary table listing feature categories such as modality, atlas, structure, and measure. Panel C is a heatmap showing a measure auto-correlation plot that differentiates healthy controls and anorexia patients with a color scale indicating correlation intensity. Panel D is a cross-correlation heatmap linking imaging features to clinical measures, with color intensity representing correlation strength.

Traditional statistical analysis and correlation plots. (A) A detailed t-test table comparing specific brain measurements between anorexia and healthy control groups, with significant differences highlighted in various brain structures. (B) A summary of features tested, with the most prominent features (ANATOMY, T1_GRAYWHITE, INFERIOR PARIETAL, and INTENSITY-STD) highlighted in red ovals, which correspond to measures also utilized in Figure 3. (C) An autocorrelation plot with healthy controls in the top triangle and anorexia patients in the bottom triangle, demonstrating how ANATOMY T1 GW INTENSITY-STD, i.e., intensity standard deviation, correlates within each group. (D) A cross-correlation heatmap showing relationships between clinical scores and ANATOMY T1 GW INTENSITY-STD, further illustrating pattern differences between groups. This analysis builds upon the visualization in Figure 3 by quantifying the statistical significance of the observed group differences.

2.6.2 Traditional statistical analysis2.6.2.1 Correlation analysis

MARIA provides tools to calculate and visualize correlation coefficients, which measure the strength and direction of relationships between variables. The tool presents correlation tables with p-values and supports multiple p-value correction options to ensure the robustness of the findings.

Figure 4C illustrates how these plots can be used to compare correlation patterns between different groups (e.g., anorexic patients vs. healthy controls). Figure 4D further shows how correlation analysis can be combined with hierarchical clustering to reveal patterns in complex datasets, for example, relationships between voxel intensity standard deviations across anatomical structures and clinical measures.

2.6.2.2 T-test analysis

T-tests (paired and homo-/heteroscedastic unpaired, one- and two- sided) are employed to determine whether there are significant differences between the means of two groups. MARIA provides detailed output tables that include t-values, group-wise means, mean differences thereby indicating the direction of change, percentage changes, uncorrected p-values, and adjusted p-values. To control for multiple comparisons, users can select from several available p-value correction methods, including Bonferroni, Holm, Hochberg, Hommel, Benjamini-Hochberg (BH), Benjamini-Yekutieli (BY), and False Discovery Rate (FDR).

Figure 4A provides an example of such a table, showing the most significant features when comparing anorexic to the healthy control groups. This detailed output allows researchers to quickly identify which features show the most substantial group differences. Furthermore, as shown in Figure 4D, MARIA can summarize t-test results at different levels of data organization, such as by atlas, modality, or to specifics such as to the level of structures or measures providing a higher-level view of where significant differences are concentrated.

2.6.2.3 Other analysis

MARIA also supports other traditional statistical methods for multi-group comparison methods such as ANOVA, ANCOVA, and MANOVA, which are not detailed here.

2.6.3 Clustering and dimensionality reduction analysis

To explore patterns in high-dimensional data, MARIA offers several dimensionality reduction techniques to help researchers visualize and explore high-dimensional neuroimaging data. These techniques are useful to investigate, visualize group separation and identify complex, non-linear relationships. Figure 5 illustrates three popular dimensionality reduction methods implemented in MARIA: K-means clustering, t-SNE, and UMAP.

Three-panel figure comparing dimensionality reduction and clustering of four categories (Adult_AN, Adult_HC, Young_AN, Young_HC) using k-means clustering on PCA (panel A), t-SNE (panel B), and UMAP (panel C). Each plot uses colors and shapes for categories and clusters, with convex hulls or ellipses highlighting category groupings.

Dimensionality reduction plots. Clusters are clouded by study groups (Adult_AN, Adult_HC, Young_AN, Young_HC). (A) K-means clustering results on PCA components, comparing identified cluster patterns (shapes) with actual categories (colors). (B) t-SNE reduced dimensions, highlighting non-linear relationships between groups, along with centroids and ellipses for each category. (C) UMAP dimensional reduction, preserving both local and global data structures, while also presenting the standard deviation and ellipsoids. All visualizations were generated after feature selection using sPLS-DA. The analysis platform allows for flexible visualization options, including 2D/3D representations, custom coloring schemes (by measurement, center, or user-defined groups), various data scaling methods, and alternative feature selection approaches (supervised methods like sPLS-DA and Boruta, or unsupervised methods like PCA).

2.6.3.1 Techniques supported

Principal Component Analysis (PCA): PCA is used to reduce the dimensionality of the data by identifying the principal components that capture the most variance.

Linear Discriminant Analysis (LDA): LDA finds linear combinations of features that best separate different classes.

t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a technique for reducing the dimensionality of high-dimensional data while preserving its structure, making it useful for visualization.

Uniform Manifold Approximation and Projection (UMAP): UMAP is used to preserve both the global and local structure of high-dimensional data in a lower-dimensional space, aiding in the visualization and exploration of complex datasets.

2.6.3.2 Visualization

As shown in Figure 5, MARIA provides interactive 2D and 3D scatter plots for various dimensionality reduction techniques. The dimensionality reduction techniques interactively allow for the creation of 2D or 3D scatter plots that represent high-dimensional data in a simplified form. These plots can be color-coded based on different variables or user-defined groups to visualize, but also identify separating parameters between clusters and patterns. Figure 5A illustrates how K-means clustering can be visualized in 2D, while Figures 5B,C show how t-SNE and UMAP results can be displayed in 2D and 3D, respectively. These visualizations allow researchers to explore potential separations between groups (such as anorexic vs. healthy groups for both adults and young subjects) and identify clusters or patterns in the data.

2.6.4 Flexibility and user-friendly design2.6.4.1 External group information

MARIA allows users to incorporate external group information files to add and analyze new sub-groups or subsets of data. This feature enhances the flexibility and efficiency of data analysis by enabling dynamic group comparisons without reloading and processing the data.

2.6.4.2 Publication-quality outputs

MARIA is equipped to generate and save publication-quality plots and tables from various statistical analyses. This functionality streamlines the process of integrating results into presentations or research papers, making it easier for researchers to present their findings.

2.7 ML Pipeline (machine learning pipeline)

The ML Pipeline is an automated and flexible R-framework designed to apply a comprehensive machine learning workflow to multi-parametric neuroimaging and clinical data. The entire process- from data preparation to model evaluation—is controlled by a single user-defined configuration file. This file allows researchers to specify all key parameters including:

The data splitting strategy (e.g., training/test/ validation percentages).

The cross-validation method (e.g., K-fold or Monte Carlo) and along with the number of iterations and number of folds.

The specific feature selection and classification algorithms to be used.

The random seed for ensuring reproducible results.

A sample configuration file, converted from YAML to human-readable PDF format for transparency, is provided in Supplementary material.

Once configured, the workflow runs automatically within a Singularity container, which encapsulates all necessary software packages and dependencies. This containerized approach ensures perfect reproducibility, platform independence, and allows the entire analysis to be seamlessly scaled on high-performance computing (HPC) clusters. The full workflow is depicted in Figure 6.

Workflow diagram for n time monte-carlo cross validation showing sequential steps: data import, preprocessing (imputation and filtering), data splitting into training and validation, feature selection, model building and classification, evaluation metrics, and data visualization, each represented by icons and brief lists of tasks or algorithms under each category.

ML pipeline framework. This flowchart illustrates the automated and sequential workflow of the ML Pipeline, from initial data import to final model evaluation. Each stage is user-configurable via a central configuration file. All images were obtained from flaticon.com.

2.7.1 Data Preparation and validation strategy

The initial phase involves thorough data preparation and pre-processing, which are crucial for ensuring the quality and relevance of the data used for model training.

Comments (0)

No login
gif