Enhancing dementia and cognitive decline detection with large language models and speech representation learning

Abstract

Dementia poses a major challenge to individuals and public health systems. Detecting cognitive decline through spontaneous speech offers a promising, non-invasive avenue for diagnosis of mild cognitive impairment (MCI) and dementia, enabling timely intervention and improved outcomes. This study describes our submission to the PROCESS Signal Processing Grand Challenge (ICASSP 2025), which tasked participants with predicting cognitive decline from speech samples. Our method combines eGeMAPS features from openSMILE, HuBERT (a self-supervised speech representation model), and GPT-4o, OpenAI's state-of-the-art large language model. These are integrated with the custom LSTM and ResMLP neural networks, and supported by Scikit-learn regressors/classifiers for both cognitive score regression and dementia classification. Our regression model based on LightGBM achieved an RMSE of 2.7775, placing us 10th out of 80 teams globally and surpassing the RoBERTa baseline by 7.5%. For the three-class classification task (Dementia/MCI/Control), our LSTM model obtained an F1-score of 0.5521, ranking 20th of 106 and marginally outperforming the best baseline. We trained models on speech data from 157 study participants, with independent evaluation performed on a separate test set of 40 individuals. We discoved that integrating large language models with self-supervised speech representations enhances the detection of cognitive decline. The proposed approach offers a scalable, data-driven method for early cognitive screening and may support emerging applications in neuropsychological informatics.

1 Introduction

Dementia represents a major public health challenge in developed nations, where the proportion of elderly individuals is steadily increasing. Among its causes, Alzheimer's disease (AD) is the most prevalent, accounting for 60%–80% of cases. In 2019, an estimated 57.4 million people worldwide were living with dementia—a figure projected to rise to 152.8 million by 2050 (Nichols et al., 2022).

Dementia is a leading cause of disability among those aged 60 and above, with prevalence rates in Europe rising exponentially with age—reaching 5.05% (95% CI, 4.73–5.39), and notably higher in women (7.13%) than men (3.31%) (Niu et al., 2017). According to the Global Burden of Disease Study 2021, neurological conditions, including dementia, are among the foremost contributors to disability globally (Ferrari et al., 2024).

This issue is particularly significant in European countries. For example, Poland is projected to experience a doubling in the number of people with dementia, from 525,084 in 2018 to 1,075,099 by 2050, despite an overall population decline. This increase, which slightly exceeds the European average, is driven by rapid growth in the elderly population, especially those over 80 years of age (Georges et al., 2020). Cognitive impairment encompasses deficits in memory, language, attention, and executive function that surpass normal aging, often leading to diminished independence and adverse outcomes such as increased falls, hospitalisations, financial difficulties, and heightened caregiver burden (Fowler et al., 2025).

Early detection is therefore essential, as timely diagnosis enables interventions that may slow cognitive decline and delay or prevent progression to dementia (Fowler et al., 2025; Ruzi et al., 2025). Early identification also allows clinicians to address reversible causes, optimize management of comorbidities, mitigate safety risks, and preserve patients' quality of life (Fowler et al., 2025). Automated analysis of speech has emerged as a non-invasive, cost-effective method for detecting cognitive decline, facilitating rapid and frequent monitoring without the need for specialist personnel (König et al., 2015).

Recent advances in voice-based cognitive assessment tools have demonstrated their efficacy as early screening methods. Neurodegenerative processes often manifest as subtle changes in speech and language, even at prodromal stages of dementia. Numerous studies report that speech and voice biomarkers can distinguish cognitively impaired individuals from healthy older adults with high accuracy, often around 80% for mild impairment (Martínez-Nicolás et al., 2021). Furthermore, clinical evaluations indicate high acceptance of voice-based screening among older adults (Ruzi et al., 2025). For instance, a recent mobile application for MCI achieved performance comparable to standard in-person tests, with approximately 86% user acceptance (Ruzi et al., 2025).

Speech is a well-established early indicator of cognitive deficits, including dementia (Bucks et al., 2000). Detection approaches frequently combine acoustic and linguistic feature extraction, achieving binary classification accuracy on brief spontaneous speech samples of up to 88%, with recall rates as high as 0.92 (Jarrold et al., 2014). Such methods offer the potential for fully automated, near real-time screening and can supplement traditional diagnostic processes (Weiner et al., 2016). Speech-language pathologists contribute expertise in discourse coherence, fluency, and pragmatic use—features sensitive to early-stage dementia (Boschi et al., 2017).

Dementia in speech is often detected through multi-step data processing, including voice activity detection, speaker diarisation, and extraction of acoustic and/or linguistic features. Studies have shown that speech segments as short as 2.5 min can be informative, with optimal results achieved using segments between 10 and 15 min (Weiner et al., 2018). Deep convolutional neural networks have significantly advanced the field (Krizhevsky et al., 2012; Cheplygina et al., 2019), and transfer learning consistently improves performance across a range of tasks (Kornblith et al., 2019).

Incorporating clinical insights from neuropsychologists is crucial when developing speech-based biomarkers for early dementia detection. Neuropsychologists are adept at interpreting how cognitive deficits manifest in natural language, particularly in areas such as discourse coherence, semantic content, and executive control. For example, (Mueller et al., 2018) demonstrated that individuals with subclinical cognitive impairment exhibited measurable declines in connected speech fluency and semantic richness prior to deficits appearing on conventional neuropsychological tests. Similarly, associations have been observed between amyloid-beta positivity and accelerated decline in word-level content during spontaneous speech, highlighting the value of neuropsychological expertise in contextualizing speech biomarkers (Mueller et al., 2021).

Recently, large language model (LLM)-aided feature engineering has been employed to predict AD and related dementias in an explainable and accurate manner. Kashyap et al. (2025) utilized OpenAI's GPT-4 to extract patient concept features from the Oxford Textbook of Medicine.

These findings highlight the potential of voice-based digital tools as accurate, scalable, and patient-friendly solutions for early detection and ongoing monitoring of cognitive impairment. The development and application of algorithms and data science methods within neuropsychological informatics is rapidly advancing, driven by the need for non-invasive, accessible, and ecologically valid approaches. The ADReSS (Alzheimer's Dementia Recognition through Spontaneous Speech) challenge series, held at Interspeech conferences in 2020 and 2021, exemplifies this progress, focusing on robust, generalisable models suitable for real-world clinical deployment (Luz et al., 2021b).

A growing body of research demonstrates that effective dementia detection can be achieved by combining acoustic and linguistic features extracted from spontaneous speech. Linguistic analysis, whether derived from automatic speech recognition (ASR) output or manual transcripts, has reliably identified dementia-related changes (Weiner et al., 2017). However, studies indicate that integrating acoustic and linguistic modalities yields only modest improvements over using either approach alone (Cummins et al., 2020; Rohanian et al., 2020), highlighting both the promise and current limitations of multi-modal speech analysis.

Given these challenges and opportunities, this study aims to:

Develop and evaluate novel multimodal approaches combining speech representations with large language model-derived features for automated dementia and cognitive decline detection.

Assess the performance of these methods using a standardized benchmark dataset that enables direct comparison with state-of-the-art approaches in speech-based dementia and cognitive declibe detection.

Compare our proposed methods against established baselines to demonstrate their effectiveness for early cognitive screening.

Test new LLM-derived features that can inform healthcare professionals about language and speech patterns associated with cognitive decline.

2 Methods

In this section, we present our proposed methods for early-stage dementia detection, which leverage acoustic, paralinguistic, and pre-trained features. The overall workflow is illustrated in Figure 1 and described in detail below.

Flowchart depicting a process for diagnosing dementia. Audio input is transcribed using “Whisper,” then engineered into features using “LLM (GPT-4o)” and “HuBERT eGeMAPS.” Modeling options include PyT LSTM, PyT ResMLP, XGBoost, GBT, RandomForest, and LightGBM. Outputs are labeled as Dementia, MCI, HC, or Converted-MMSE.

The general flow diagram for the methods proposed in this study.

Our approach builds upon prior work by Chlasta and Wołk (2021), who introduced a two-step classification framework for detecting cognitive impairment due to AD. Their method utilized VGGish, a deep pre-trained TensorFlow model, as an audio feature extractor, followed by classical machine learning classifiers implemented in Scikit-learn. This approach achieved 59.1% accuracy—approximately 3% higher than the best-performing baseline binary classification models using acoustic features from the ADReSS challenge. Additionally, they proposed DemCNN, a convolutional neural network trained directly on raw waveforms, which reached an accuracy of 63.6%, outperforming the baseline by 7%. These findings suggest that transfer learning with pre-trained audio models (such as VGGish) can be more effective than handcrafted acoustic features in dementia detection tasks.

We also draw upon subsequent work by Chen et al. (2021), who explored multimodal approaches by integrating both acoustic and linguistic information. They extracted a wide range of acoustic features (MFCCs, GeMAPS, eGeMAPS, ComParE-2016, and IS10-Paraling) and combined them with linguistic features derived from ASR transcriptions, including LIWC and contextual embeddings from Bidirectional Encoder Representations from Transformers (BERT). Using various fusion strategies and an ensemble of the top 10 classifiers based on training performance, their model achieved a binary classification accuracy of 81.69% on the test set—exceeding the baseline of 78.87%. Importantly, their findings highlight the value of combining acoustic and textual modalities, even when ASR introduces transcription noise, and demonstrate that ensemble learning can significantly boost both accuracy and robustness in AD detection.

We participate in both tasks of 2025 ICASSP PROCESS Signal Processing Grand Challenge (Grand Challenge at ICASSP 2025, 2025). The classification task involves developing models to distinguish between three groups of patient speech [Healthy Control (HC), MCI, or Dementia groups] using the PROCESS dataset, while the regression task predicts the corresponding MMSE score of speakers from their speech.

The challenge organizers provided six baseline machine learning models utilizing both acoustic and linguistic features for the analysis of spontaneous speech pathology. Acoustic approaches included emobase (Eyben et al., 2010), ComParE 2013 (Eyben et al., 2013), Multi-resolution Cochleagram features (MRCG) (Chen et al., 2014), the Geneva minimalistic acoustic parameter set (eGeMAPS) (Eyben et al., 2015), and a minimal feature set (Luz, 2017).

Our approach is designed as a screening tool for early detection of cognitive decline, rather than for formal clinical diagnosis. We present models supporting both the classification and regression tasks, and compare our results to respective baselines in Tables 1, 2.

Model No.FeaturesPromptAcc.Prec.Rec.F11-ResMLPeGeMAPSCT+VT59.417.958.627.32-LSTMHuBERT+LLMCT+VT58.752.658.755.23-XGBoostHuBERTCT68.152.155.153.5Baseline 1SVC (eGeMAPS)CT57.553.561.255.0VT45.038.839.138.3CT+VT50.041.742.341.7Baseline 2RFC (eGeMAPS)CT60.071.749.953.3VF52.533.135.933.9CT+VF52.566.143.947.4Baseline 3RoBERTa-ClassifierCT52.536.139.736.8VF55.035.638.135.6CT+VF52.532.235.932.9

Our approaches (best in bold) vs the best baseline accuracy classification results (on the test set) for the Cookie Theft (CT) task, semantic fluency, phonemic fluency (VF), and their combination (CT+VF) are presented in terms of accuracy (Acc.), macro-precision (Prec.), macro-recall (Rec.), and macro-F1 score, all reported as percentages.

Our approach (HuBERT+LLM) in bold vs the baselines using acoustic features (eGeMAPS) and RoBERTa.

Model No.FeaturesPromptRMSEModel 1-GBTHuBERT+LLMCT+VT3.3877Model 2-RandomForesteGeMAPSCT+VT5.2870Model 3-LightGBMHuBERT+LLMCT+VT2.7775Baseline Model 1SVR (eGeMAPS)CT4.4000Baseline Model 2RFR (eGeMAPS)CT+VF3.1700Baseline Model 3RoBERTa-RegressionCT+VF2.9850

Our results on the regression task using the PROCESS test set.

Our approach (HuBERT+LLM) in bold vs the baselines using acoustic features (eGeMAPS) and RoBERTa.

In our approach, we analyse a comprehensive set of acoustic and linguistic indicators to detect cognitive decline. Acoustic features such as prosodic variations, articulation rate, speech rate, pause frequency and duration, pitch variability, articulation precision, and spectral characteristics are extracted using HuBERT. Linguistic markers—including lexical diversity (e.g., type-token ratio), syntactic complexity (e.g., mean length of utterance), semantic coherence, and discourse organization—are derived from GPT-4o-generated transcriptions. These features are known to correlate with cognitive decline and serve as potential identifiers of dementia and MCI. The measures selected are consistent with observed impairments in dementia, such as impoverished vocabulary, reduced syntactic structure, and topic drift, and align with established neuropsychological constructs for speech and language assessment.

All preprocessing steps—including feature extraction via HuBERT and transcription using Whisper—were carried out on a local workstation equipped with 32 GB RAM and a 12th Generation Intel Core i7-12700H processor (2.3 GHz, 14 cores, 20 logical threads). While GPU acceleration was assessed using an NVIDIA GeForce RTX 3060 Laptop GPU (6 GB VRAM), it did not yield significant gains in processing speed for the selected tasks. Experimental workflows followed an AutoML strategy (Dataiku, 2025), employing well-established libraries such as Scikit-learn (Pedregosa et al., 2011), XGBoost (Chen and Guestrin, 2016), and LightGBM (Ke et al., 2017), alongside custom neural network models implemented with PyTorch (Paszke et al., 2019).

We adopted the official PROCESS dataset split, assigning 80% of participants (n = 157; codes 001–157) to the training set and the remaining 20% (n = 40; codes 001–040) to the test set.

2.1 Dataset

We utilized the dataset provided by the PROCESS Signal Processing Grand Challenge (Tao et al., 2025), which was divided into training and test sets. The training set comprised 926 MB of data from 157 participants, each represented by a unique folder (Process-rec-XXX) containing audio recordings in .wav format and manual transcriptions from three neuropsychological tasks:

Semantic Fluency task: Participants were asked, “Please name as many animals as you can in a minute.” This task is analogous to naming tasks in standard cognitive assessments, primarily evaluating language abilities and naming skills to detect potential issues in language comprehension and expression.

Phonemic Fluency task: Participants were instructed, “Please say as many words beginning with the letter ‘P' as you can. Any word beginning with ‘P' except for names of people such as Peter, or countries such as Portugal.” A one-minute time limit was imposed. This task is similar to those used in cognitive assessments to test verbal fluency and executive functions related to language.

Cookie Theft picture task: Participants were prompted to describe a visual scene from the picture, evaluating spontaneous narrative speech and language coherence.

The corpus provides diagnostic class labels alongside cognitive assessment scores. Several standardized measures are included, such as the Montreal Cognitive Assessment (MoCA) (Nasreddine et al., 2005), Mini-Cognitive Examination (MCE) (Vilalta-Franch et al., 1996), and Alzheimer's Cognitive Examination-III (ACE-III) (Hsieh et al., 2013), reflecting differences in clinical practice across dementia and stroke pathways. Additionally, a unified Mini-Mental State Examination (MMSE) (Cockrell and Folstein, 2002) score is provided for some participants, derived by converting MoCA, MCE, or ACE-III scores. This converted MMSE score forms the basis of the regaression task, while the diagnostic labels support the classification task.

In addition to audio and transcripts, the dataset included a metadata file (dem-info.csv) providing subject-level demographic and clinical information. This metadata comprised the dataset partition (train or development), diagnostic class (HC, MCI, or Dementia), gender, age, and cognitive status measured by the MMSE. Where MMSE scores were not directly available, they were estimated from MoCA or ACE-III assessments using validated conversion formulas (Fasnacht, 2023).

Among the 157 training participants, 82 were labeled HC, 59 MCI, and 16 Dementia. MMSE scores were available for 69 individuals. Age information was missing for 24 participants and was imputed using the average age across the dataset (66 years), marked with an asterisk. For the remaining 133 individuals, the age range was 23–94 years. Table 3 presents the demographic summary for the PROCESS training and development dataset. This dataset was used exclusively for training our models.

PROCESS dataset statisticsValueDementia cases16MCI cases59Healthy controls82Total participants157Female participants81Male participants75Participants with age info133Min Age23Max Age94MMSE scores available69Mean MMSE27.36

Demographic summary for the PROCESS training and development dataset.

A separate test dataset, comprising 236 MB of data from 40 additional study participants, was provided for evaluation. The test dataset contains audio recordings for the same three neuropsychological tasks, but does not include manual transcriptions. This design ensures that model performance is assessed on previously unseen data, reflecting real-world deployment scenarios. Table 4 presents the demographic summary for the PROCESS test dataset.

PROCESS dataset statisticsValueTotal participants40Female participants21Male participants19Participants with age info40Min age45Max age86

Demographic summary for the PROCESS test dataset.

For each participant in the test set, our classification models produced a diagnostic label (Healthy Control, Mild Cognitive Impairment, or Dementia), wheras our regression models predicted the MMSE score for each individual.

2.1.1 Acoustic features (eGeMAPS)

To extract acoustic biomarkers relevant to cognitive state, we employed the openSMILE toolkit (Eyben et al., 2010) configured with the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPSv02), a standardized set of 88 features designed for paralinguistic and clinical voice analysis (Eyben et al., 2015).

Our pipeline processed three distinct audio recordings per subject—CT, PF, and SF—each located within structured directories corresponding to a different speech task. Using openSMILE, we computed summary statistics (functionals) over low-level descriptors such as fundamental frequency (F0), shimmer, jitter, harmonic-to-noise ratio (HNR), spectral flux, and Mel-frequency cepstral coefficients (MFCCs), all of which have demonstrated high discriminative power in dementia-related speech studies (König et al., 2015; Martínez-Nicolás et al., 2021).

In parallel, we used Parselmouth (Jadoul et al., 2018), a Python interface to Praat, to extract additional temporal and perturbation-based features—mean pitch, pitch variation, mean intensity, local jitter, local shimmer, and mean HNR—which supplement the eGeMAPS set with clinically interpretable measures of vocal fold stability and voice quality.

All extracted features were tabulated for downstream analysis. This dual-extraction strategy is consistent with current best practices in computational paralinguistics and neurocognitive speech analysis, which emphasize the integration of spectral, prosodic, and phonatory metrics (Schuller et al., 2013; Eyben et al., 2015).

2.1.2 HuBERT

We utilized HuBERT (Hsu et al., 2021), a self-supervised learning model for speech representation that has demonstrated superior performance in tasks such as automatic speech recognition and speaker verification. The pre-trained HuBERT model1 was used to extract high-level features from the audio recordings.

For each recording, we extracted 1,024-dimensional feature vectors using the HuBERT model. These features served as inputs to our classification and regression models. To access HuBERT, we utilized the Hugging Face Transformers library (Wolf et al., 2020), which provides a user-friendly interface for working with pre-trained models. Feature extraction was performed by passing the audio recordings through HuBERT and obtaining the representations from the final hidden layer. Specifically, HuBERT-based acoustic embeddings were extracted from the recordings of all three tasks (SF, PF, and CT) and concatenated to form a unified feature representation for each individual. These representations were subsequently used in the modeling tasks.

2.1.3 LLM (Whisper + GPT-4o)

As part of our analysis, we explicitly extracted and scored clinically relevant speech and language features from each task using large language models (LLMs). For every participant, LLMs were employed to evaluate the following dimensions:

Cookie theft description (CT): Content accuracy, language fluency, grammar and syntax, organization, and awareness of key details.

Phonemic fluency task (PF): Phonemic fluency, repetition errors, intrusion errors, pace and effort, and total valid words.

Semantic fluency task (SF): Semantic fluency, repetition errors, intrusion errors, pace and effort, semantic clustering and switching, and total valid items.

These dimensions were selected based on neuropsychological literature and clinical practice, reflecting core aspects of cognitive and linguistic function affected by dementia and MCI. Each feature was scored on a scale from 0 (full mental capabilities) to 10 (heavy dementia) by prompting the LLM with standardized templates (see Table 5), ensuring that our feature set is both interpretable and clinically meaningful. Specifically, for the Cookie Theft Description (CT) task, we extracted content accuracy, language fluency, grammar and syntax, organization, and awareness of key details; for the Phonemic Fluency Task (PF), we extracted phonemic fluency, repetition errors, intrusion errors, pace and effort, and total valid words; and for the Semantic Fluency Task (SF), we extracted semantic fluency, repetition errors, intrusion errors, pace and effort, semantic clustering and switching, and total valid items. All these features were rated on a scale from 0 (worst score) to 10 (best score) and subsequently used as input to our classification and regression models, with extraction guided by the LLM prompts presented in Table 5.

TaskPrompt templateCT (Cookie theft description)Evaluate the cognitive capabilities of the person based on The Cookie Theft description task from Mini-Mental State Examination (MMSE). Provide a score for key aspects like Content Accuracy, Language Fluency, Grammar and Syntax, Organization, Awareness of Key Details and a summary score from 0 (full mental capabilities) to 10 (heavy dementia), along with a brief reasoning for the score. Here is the transcription with patient's response marked as Pat: and other persons' responses that should be dismissed marked as Oth:PF (Phonemic fluency test)Evaluate the cognitive capabilities of the person based on Phonemic Fluency Test task from Mini-Mental State Examination (MMSE). Provide a score for key aspects like Phonemic Fluency, Repetition Errors, Intrusion Errors, Pace and Effort, Total Valid Words and a summary score from 0 (full mental capabilities) to 10 (heavy dementia), along with a brief reasoning for the score. Here is the transcription with patient's response marked as Pat: and other persons' responses that should be dismissed marked as Oth:.SF (Semantic fluency test)Evaluate the cognitive capabilities of the person based on Semantic Fluency Test task from Mini-Mental State Examination (MMSE). Provide a score for key aspects like Semantic Fluency, Repetition Errors, Intrusion Errors, Pace and Effort, Semantic Clustering and Switching, Total Valid Items and a summary score from 0 (full mental capabilities) to 10 (heavy dementia), along with a brief reasoning for the score. Here is the transcription with patient's response marked as Pat: and other persons' responses that should be dismissed marked as Oth:.

LLM prompt templates for extracting features from three speech tasks: Cookie Theft (CT), Phonemic Fluency (PF), and Semantic Fluency (SF).

None of these tasks are part of the MMSE. Prompts were designed to assess cognitive-linguistic performance with scores from 0 (no impairment) to 10 (severe impairment).

For feature extraction using a LLM, we first transcribed all audio recordings from the PROCESS study participants using Whisper (Radford et al., 2022), a state-of-the-art ASR system developed by OpenAI. Whisper is a transformer-based model trained on a large, diverse corpus of multilingual audio data, enabling accurate transcription across various languages and dialects. Transcriptions were generated using the official Whisper library,2 and subsequently served as input to OpenAI's LLM (Hurst et al., 2024) for downstream analytical tasks.

2.2 Classification method

We conducted a two-step experimental procedure to detect dementia and MCI, as illustrated in Figure 1. For the classification task, we developed models that leveraged both acoustic and language-derived features to predict cognitive impairment. These were then combined with 12 interpretable, high-level features derived from large language models (LLMs), encompassing measures such as CT Language Fluency, PF Total Valid Words, and SF Repetition Errors.

Classification models were trained using several algorithms, including Support Vector Machines (SVM) with a radial basis function (RBF) kernel, Gradient Boosted Trees (GBT), LightGBM, and XGBoost, implemented via Dataiku's DSS platform (Dataiku, 2025). All classifiers utilized the full feature set, which comprised HuBERT-based embeddings from verbal fluency tasks (SF, PF) and a discourse elicitation task (CT), along with 12 high-level features derived from LLM. Performance was assessed on held-out test data to ensure generalizability. To enhance robustness, ensemble learning techniques were applied by aggregating predictions from classifiers trained on individual feature subsets (e.g., HuBERT_Features_SF, PF, and CT), yielding a total of nine model variants. Final decisions were derived through decision-level fusion. In cases of prediction ties, the label “dementia” was assigned to mitigate the risk of under-diagnosis—consistent with clinical practice, where false positives (Type I errors) are generally considered less harmful than false negatives (Type II errors).

We further implemented a neural network-based classification model using the LSTMSpeechClassifier, a custom PyTorch module designed to handle high-dimensional speech features. The architecture begins with a batch normalization layer to standardize input features, followed by a two-layer Long Short-Term Memory (LSTM) network with 256 hidden units, which captures temporal dependencies and sequential dynamics inherent in the speech data. The output sequence from the LSTM is passed through a fully connected linear layer to generate logits for three diagnostic categories (output_dim = 3).

During the forward pass, input features are normalized and reshaped to incorporate a temporal sequence dimension before being processed by the LSTM; the final hidden state is then mapped to class predictions via the classifier layer. This design enables the model to combine robust feature-level normalization with sequential modeling of cognitive status. The training process converged within 20 epochs, and early stopping based on validation loss was employed to mitigate overfitting.

We also implemented a custom neural network model termed ResidualMLP, developed as a PyTorch module for processing high-dimensional speech features using a deep multilayer perceptron (MLP) architecture enhanced with residual connections. The network comprises sequential fully connected layers with decreasing hidden dimensions of 512, 384, 256, and 128 units. Each layer is followed by batch normalization and non-linear activation to ensure stable learning and effective feature transformation. Residual connections are introduced between layers of matching dimensions to facilitate gradient flow and mitigate vanishing gradient issues, thereby improving training dynamics and model convergence. This architecture enables efficient learning from complex speech embeddings by leveraging deep nonlinear transformations while retaining stability through architectural design.

2.3 Regression method

For the regression task, the goal was to predict participants' cognitive performance, quantified via the Converted-MMSE score, based solely on speech-derived features. The input representation combined low-level acoustic embeddings and high-level linguistic attributes. Specifically, HuBERT-based embeddings were extracted from three speech tasks: a discourse elicitation task (CT) and two verbal fluency tasks (SF, PF). Each task produced a 1024-dimensional feature vector. In parallel, twelve cognitively and linguistically informative features derived from large language models (LLMs) were incorporated, including syntactic complexity from CT (e.g., grammar and clause structure), semantic fluency metrics from SF, and repetition patterns from PF. All features were concatenated into a single high-dimensional input vector, enabling the model to jointly leverage fine-grained acoustic signals and abstract semantic-linguistic cues for cognitive status estimation.

To assess performance across various algorithmic families, we employed an AutoML strategy (Dataiku, 2025), which facilitated systematic hyperparameter tuning and model selection. The following regression model families were evaluated in this study: Gradient Boosting Trees (GBT), Random Forests, and LightGBM.

All regression models were evaluated on held-out test data to ensure generalizability. Hyperparameter optimization for tree-based methods was performed via cross-validation, with a focus on minimizing Root Mean Squared Error, while minimizing overfitting. Since the official Challenge test labels were withheld, we conducted post-hoc analyses on the released validation subset (40 samples, of which 21 had Converted-MMSE scores) to further compare models and assess demographic effects.

Convergence behavior for the tree-based models—namely LightGBM and XGBoost—was managed internally via early stopping mechanisms provided by the Dataiku Data Science Studio platform (Dataiku, 2025). As these models are based on decision tree ensembles, they do not rely on epoch-based training. Instead, hyperparameter tuning was conducted using cross-validation, focusing on optimizing performance metrics while mitigating overfitting. LightGBM and XGBoost dynamically determine the optimal number of boosting iterations based on validation set performance, thus eliminating the need for manual control over training duration and ensuring efficient convergence.

We employed three complementary feature streams. First, HuBERT embeddings (Base model, 12 transformer layers) were extracted from speech recordings, capturing contextualized acoustic–phonetic information. Second, LLM-derived textual embeddings were obtained by prompting transcripts with both cognitive task (CT) and verbal task (VT) contexts; embeddings were generated using a transformer-based encoder and concatenated across prompts. Third, eGeMAPS acoustic features (88-dimensional, hand-crafted low-level descriptors) were extracted using the openSMILE toolkit. For composite systems, these representations were concatenated without additional dimensionality reduction, allowing the downstream models to determine feature relevance.

For modeling, we used three classical machine learning approaches with fixed configurations: (1) Gradient Boosted Trees (GBT) with 300 estimators, maximum depth of 6, and learning rate 0.1; (2) Random Forest (RF) with 200 estimators and maximum depth of 12; and (3) LightGBM with 500 boosting rounds, maximum depth of 7, and learning rate 0.05. All models used default class weighting and early stopping where available. Transformer-based features were used as frozen embeddings (no fine-tuning). Detailed hyperparameters are provided in Supplementary Table 11.

3 Experiments and results

Table 1 presents a summary of our classification results on the multi-task speech data from the PROCESS test set, where participants were categorized into HC, MCI, or Dementia groups. In addition to the F1 score, we report standard metrics including accuracy, precision, and recall. Our regression results—predicting participants' cognitive performance as measured by the MMSE score (Cockrell and Folstein, 2002)—are shown in Table 2.

Overall, the findings highlight the efficacy of multi-modal feature integration, specifically combining HuBERT embeddings with LLM-derived linguistic features (HuBERT+LLM) in improving models performance in relation to baselines, across both classification and regression tasks.

HuBERT and LLM features (transformer-based) played a central role in capturing higher-order temporal and semantic structure, which proved critical for forecasting cognitive scores, whereas eGeMAPS served as a traditional baseline.

In regression analysis (Table 2), LightGBM trained on HuBERT+LLM features yielded the lowest RMSE (2.7775), demonstrating superior prediction accuracy relative to models relying solely on handcrafted acoustic features (e.g., SVR RMSE = 4.4000) or RoBERTa-based regression models (RMSE = 2.9850).

In the regression task, our best LightGBM Model 3 trained on HuBERT embeddings combined with LLM-derived features—achieved a Root Mean Squared Error (RMSE) of 2.7775, ranking 10th out of 80 submissions. This represents a 7.5% improvement over the best baseline model, which was based on RoBERTa and reported an RMSE of 2.9850.

SHAP-based feature importance analysis of our LightGBM model (Supplementary Figure 2) shows that individual features exert only modest influence, with the most informative (e.g., SFT Semantic Clustering and Switching, PFT Total Valid Words, CTD Awareness of Key Details) contributing 2%–3% each. A larger number of semantic fluency and error-related variables (e.g., intrusion errors, phonemic fluency) contributed at the 1%–2% level. Their presence among the top-ranked predictors suggests that even relatively low-weighted LLM- and HuBERT-derived signals provide complementary value when integrated into the multimodal framework.

In

Comments (0)

No login
gif