Speech pattern disorders in verbally fluent individuals with autism spectrum disorder: a machine learning analysis

Abstract

Introduction:

Diagnosing Autism Spectrum Disorder (ASD) in verbally fluent individuals based on speech patterns in examiner-patient dialogues is challenging because speech-related symptoms are often subtle and heterogeneous. This study aimed to identify distinctive speech characteristics associated with ASD by analyzing recorded dialogues from the Autism Diagnostic Observation Schedule (ADOS-2).

Methods:

We analyzed examiner-participant dialogues from ADOS-2 Module 4 and extracted 40 speech-related features categorized into intonation, volume, rate, pauses, spectral characteristics, chroma, and duration. These acoustic and prosodic features were processed using advanced speech analysis tools and used to train machine learning models to classify ASD participants into two subgroups: those with and without A2-defined speech pattern abnormalities. Model performance was evaluated using cross-validation and standard classification metrics.

Results:

Using all 40 features, the support vector machine (SVM) achieved an F1-score of 84.49%. After removing Mel-Frequency Cepstral Coefficients (MFCC) and Chroma features to focus on prosodic, rhythmic, energy, and selected spectral features aligned with ADOS-2 A2 scores, performance improved, achieving 85.77% accuracy and an F1-score of 86.27%. Spectral spread and spectral centroid emerged as key features in the reduced set, while MFCC 6 and Chroma 4 also contributed significantly in the full feature set.

Discussion:

These findings demonstrate that a compact, diverse set of non-MFCC and selected spectral features effectively characterizes speech abnormalities in verbally fluent individuals with ASD. The approach highlights the potential of context-aware, data-driven models to complement clinical assessments and enhance understanding of speech-related manifestations in ASD.

1 Introduction

Autism spectrum disorder (ASD) is a developmental condition that presents considerable challenges in social interaction, communication, and behavior (Leekam et al., 2011; Lord et al., 2018, 2020). In the United States, ASD affects approximately 1 in 36 children and 1 in 45 adults, making it a critical public health concern (Maenner, 2020; Dietz et al., 2020). Despite its prevalence, diagnosing ASD is complex, relying heavily on subjective assessments of behavior and the clinical expertise of specialists. These complexities are compounded by differences in diagnostic standards and healthcare availability across regions, resulting in delayed diagnoses and limiting early intervention opportunities for many families (Daniels and Mandell, 2014). This subjectivity can lead to inconsistencies in the accuracy and timing of diagnoses across various regions and populations.

ASD diagnosis is traditionally conducted through clinical interviews and behavioral observations, often following standardized tools such as the Autism Diagnostic Observation Schedule (ADOS) (Lord et al., 1999). ADOS-2 consists of five modules, each tailored to different age groups and language abilities, ranging from nonverbal toddlers to verbally fluent adults. Module 1 is designed for minimally verbal children, Module 2 for those with some phrase speech, Module 3 for verbally fluent children and adolescents, Module 4 for verbally fluent adults, and the Toddler Module for children under 30 months of age. This structured approach allows clinicians to assess social communication, interaction, and restricted or repetitive behaviors across diverse developmental stages. However, these methods require extensive clinician expertise, leading to potential inconsistencies in diagnosis and accessibility issues in underserved areas (Matson and Kozlowski, 2011; Elsabbagh et al., 2012). ADOS-2 assessments require trained clinicians who can administer structured tasks, score behavioral responses, and interpret results based on standardized criteria. This specialized training is costly and time-intensive, contributing to a shortage of qualified professionals, especially in regions with limited healthcare resources. Moreover, ASD evaluations are often expensive, requiring multiple clinical visits, making it difficult for families in lower-income communities to access timely assessments. As a result, there is increasing interest in technology-driven approaches that can enhance diagnostic consistency and accessibility (Fletcher-Watson and Happé, 2019; Song et al., 2019; Rezaee, 2025).

One promising approach is the use of speech analysis for ASD detection. Speech is a fundamental mode of communication, and research suggests that individuals with ASD often exhibit distinctive speech characteristics, including atypical intonation, altered rhythm, abnormal speech rate, and variations in pitch modulation (Mody and Belliveau, 2013; Pickles et al., 2009; Vogindroukas et al., 2022; Martin and Rouas, 2024). These abnormalities can emerge in development, offering a potential biomarker for ASD diagnosis (Bonneh et al., 2011). Advances in computational speech processing enable precise analysis of these features, paving the way for non-invasive, scalable, and cost-effective diagnostic tools that could complement existing clinical methods.

Recent advancements in machine learning have further expanded the possibilities for ASD diagnosis by enabling automated detection of behavioral and linguistic patterns (Wang et al., 2015; Ruan et al., 2021, 2023; Zhang et al., 2022). For example, machine learning techniques have been applied to digital behavioral phenotyping (Perochon et al., 2023) and automated analysis of gestures and facial expressions from video recordings (Lakkapragada et al., 2022; Krishnappa Babu et al., 2023). Natural language processing (NLP) has also been applied to electronic health records to derive ASD phenotypes (Zhao et al., 2022). Speech features are increasingly recognized as digital biomarkers in clinical decision support (Sariyanidi et al., 2025). Advances in representation learning, such as GANs and self-supervised models, have demonstrated improved ASD speech recognition performance, even in data-limited conditions (Sohn et al., 2025; Al Futaisi et al., 2025). On a different scale, Rajagopalan et al. (2024) showed that robust prediction can be achieved with minimal feature sets across large cohorts, while multi-modal approaches such as facial expression analysis are emerging as valuable complements to speech-based diagnosis (Mahmood et al., 2025). Building on these successes, leveraging ML for speech analysis offers a promising and relatively unexplored direction in ASD diagnosis.

This study targets verbally fluent individuals assessed with ADOS-2 Module 4 and classifies participants with vs. without A2-defined speech abnormalities. Our goal is not to distinguish ASD from non-ASD; rather, we examine how machine learning can characterize speech-related abnormalities within this subgroup and how such models might complement clinical practice. This research focuses on the following key objectives:

Comprehensive speech feature extraction: we employed advanced signal processing techniques to extract 40 distinct speech features, grouped into prosodic, rhythmic, spectral, and energy-related categories, to capture subtle ASD-related speech patterns.

Machine learning-based classification: we applied machine learning models to classify participants with vs. without ADOS-2 A2-defined speech abnormalities, providing an objective framework for analyzing atypical prosody and rhythm.

Complementary clinical insight: Rather than diagnosing ASD per se, this study evaluates whether acoustic speech features can support the characterization of speech abnormalities in verbally fluent individuals with ASD, serving as a data-driven complement to traditional clinical assessments.

This study represents a significant methodological advancement in diagnosis of speech abnormalities in ASD by integrating machine learning with detailed speech analysis. The use of a comprehensive set of speech features, combined with sophisticated machine learning techniques, offers a notable improvement over traditional diagnostic methods. This approach holds the potential for more accurate and earlier detection of ASD, which is critical for timely intervention. Ultimately, the research aims to contribute to personalized treatment and management strategies, enhancing outcomes for individuals with ASD and providing a scalable, objective solution for clinical use. This work focuses on autistic individuals assessed with ADOS-2 Module 4 (verbally fluent adolescents and adults); accordingly, findings pertain to this subgroup rather than the autism spectrum as a whole.

2 Methods2.1 Caltech audio dataset2.1.1 Autism Diagnostic Observation Schedule (ADOS)

The Autism Diagnostic Observation Schedule, Second Edition (ADOS-2) (Lord et al., 1999; American Psychiatric Association et al., 2013) is a widely used standardized instrument for diagnosing ASD. Module 4 of ADOS-2 is specifically designed for verbally fluent adolescents and adults, typically aged 16 and older, and differs from other modules intended for younger or non-verbal individuals. This study focuses on the A2 score, which assesses abnormalities in speech patterns, including intonation, volume, rate, and rhythm. Details for each A2 score level are provided in Table 1.

ScoreDescription0Appropriately varying intonation, reasonable volume, and normal rate of speech, with regular rhythm coordinated with breathing.1Little variation in pitch and tone; rather flat or exaggerated intonation, but not obviously peculiar, OR slightly unusual volume, AND/OR speech that tends to be somewhat unusually slow, fast, or jerky.2Speech that is clearly abnormal for ANY of the following reasons: slow and halting; inappropriately rapid; jerky and irregular in rhythm (other than ordinary stutter/stammer), such that there is some interference with intelligibility; odd intonation or inappropriate pitch and stress; markedly flat and toneless ("mechanical"); consistently abnormal volume.7Stutter or stammer or other fluency disorder (if odd intonation is also present, code 1 or 2 accordingly).

Speech abnormalities associated with autism (intonation/volume/rhythm/rate).

2.1.2 ADOS interview audio dataset

The ADOS sessions were conducted sequentially, involving 15 structured scenario tasks designed to elicit responses across a range of communicative and social interactions (see Table 2). These tasks allow clinicians to capture meaningful speech and behavioral data, including intonation and speech rate, for analysis. In this study, the Caltech Audio Dataset (Zhang et al., 2022) includes 33 verbally fluent participants with ASD (26 male, 7 female), aged 16-37 years. The average age of ASD participants was 23.45 ± 4.76 years. Nine of these individuals were assessed twice, approximately six months apart, yielding a total of 42 recording sessions. As shown in Figure 1, 19 participants exhibited speech abnormalities (A2 ≥ 1), while 14 participants received an A2 score of 0. Based on this distribution, the recordings were grouped into ASD with vs. without speech-related abnormalities. To enhance granularity and contextual specificity, each session was further segmented into 15 structured scenario tasks, resulting in 42 × 15 = 630 scenario-level samples, which served as the basic units for subsequent binary classification analyses.

ScenarioNameExplanationS1Construction TaskInvolves the participant engaging in a task that requires constructing or assembling a set structure, testing spatial and motor skills, rather than communicative abilities.S2Telling a Story from a BookPrimarily a monologic task where the participant recounts a story from a book, differing from spontaneous dialogic interactions.S3Description of a PictureParticipants describe a picture, testing their ability to interpret visual information and articulate a coherent description.S4Conversation and ReportingFocuses on the ability to engage in back-and-forth conversation and to report on past events.S5Current Work and SchoolDiscusses participants' current educational and occupational engagements.S6Social Difficulties and AnnoyanceElicits experiences of social challenges and annoyances.S7EmotionsRequires participants to express and identify emotions.S8Demonstration TaskRequires the participant to demonstrate how to use an item or explain a process, which does not involve interactive communication with an examiner.S9CartoonsInvolves interpreting sequences and explaining cartoon strips.S10BreakA pause or intermission in the assessment, involving no communicative or cognitive tasks.S11Daily LivingCovers daily routines and personal care tasks.S12Friends, Relationships, and MarriageDiscusses personal relationships and social norms regarding friendships and marital status.S13LonelinessAddresses feelings and situations of loneliness and isolation.S14Plans and HopesInvolves discussing future aspirations and plans.S15Creating a StoryTests creative storytelling abilities in an unstructured task.

Overview of SCENARIO TASKS in ADOS-2 module 4 diagnosing process.

Bar chart illustrating ADOS-2 A2 scores. The horizontal axis shows scores 0, 1, and 2, while the vertical axis indicates the number of subjects. Scores of 0 and 2 have around 14 subjects each, and score 1 has 16 subjects.

Distribution of ADOS-2 Module 4 A2 scores across subjects (0 = normal intonation, 1 = mildly atypical intonation, 2 = markedly atypical intonation).

In addition, although the age range (16-37 years) may overlap with vocal maturation for some participants, we did not explicitly control for or model potential pubertal voice changes. Because our feature set includes acoustic descriptors (e.g., spectral measures), such effects cannot be fully ruled out; we therefore acknowledge this as a limitation and a direction for future, age-stratified analyses.

2.2 Feature extraction for identification of autism speech disorder

Feature extraction plays a crucial role in the analysis of speech data, especially in understanding complex disorders like ASD. It involves quantifying various aspects of speech that may reveal traits associated with ASD. For this study, a comprehensive set of speech features was extracted from recorded dialogues, grouped based on their relevance to ASD. Prosodic speech features, including the number of syllables, pauses, rate of speech, articulation rate, speaking duration, original duration, balance, and frequency, were extracted using the “Myprosody” tool (Shahab, 2025). This tool integrates multiple speech feature extraction methods, providing a detailed analysis of prosodic elements. Additionally, features such as Mel-Frequency Cepstral Coefficients (MFCCs), spectrograms, and chromagrams were extracted using “pyAudioAnalysis” (Giannakopoulos, 2015), enriching the dataset with diverse audio representations that are essential for analyzing ASD-related speech patterns. These features are described below and summarized in Table 3.

No.CategoryFeaturesExplanation#1IntonationFrequencyFundamental frequency, related to the pitch of the voice.1MFCCsMel Frequency Cepstral Coefficients, capture timbral aspects that are crucial for intonation.132VolumeEnergyMeasures the signal's loudness.1Entropy of EnergyIndicates variation in loudness within a frame.13RhythmZero Crossing Rate (ZCR)Reflects the number of times the waveform crosses zero, related to the frequency of the signal.14RateRate of SpeechMeasures how fast words are spoken.1Number of SyllablesCounts the syllables, indicating speech density and pace.15PauseNumber of PausesTotal pauses, reflecting speech interruptions and flow.1BalanceRatio of speaking to pausing, indicates rhythmic flow.16SpectralSpectral CentroidCenter of gravity, affects perceived pitch and sharpness.1Spectral SpreadMeasures the width of the spectrum, related to the sharpness of sound.1Spectral RolloffThe frequency below which 90% of energy lies, indicates the shape.1Spectral FluxMeasures the changes between frames, indicates rhythm changes.1Spectral EntropyReflects the entropy of spectral distribution, a complexity measure.17ChromaChromaA set of 12 coefficients each representing a semitone within an octave, used in harmony analysis.128DurationSpeaking Durationmeasure speaking time (excluding fillers and pause)1Original Durationmeasure speaking time (including fillers and pause)1

Detailed categorization of speech features into relevant categories, with explanations and specific feature counts, tailored for comprehensive speech pattern analysis in clinical assessments such as autism.

Each category of features captures different characteristics of speech that are potentially altered in ASD:

- Prosody features such as pitch (fundamental frequency) variations and speech rate are directly related to the emotional and syntactical aspects of speech, which are often atypical in ASD.

- Energy and Zero Crossing Rate provide basic information about the speech amplitude and frequency, which are useful for detecting abnormalities in speech loudness and pitch changes.

- Spectral and Chroma features reflect the quality of sound and harmony in speech. These features are sophisticated and can detect subtleties in speech that are not apparent through simple auditory observation.

- MFCCs and their deltas offer a robust representation of speech based on the human auditory system's perception of the frequency scales, essential for identifying nuanced discrepancies in how individuals with ASD perceive and produce sounds. By analyzing these features using machine learning models, we aim to identify patterns that are indicative of ASD, thereby assisting in the objective and efficient diagnosis of the disorder.

2.3 Classification models for diagnosis of speech abnormalities in ASD and analysis

To classify ASD-related speech patterns, we employed six machine learning algorithms, selected based on their effectiveness in speech processing and biomedical signal classification. The classification process follows three major stages:

(1) Model selection based on suitability for structured and unstructured speech features,

(2) Feature selection and optimization to improve performance, and

(3) Model interpretability to analyze which speech features contribute most to classification.

Model selection rationale

Each model was selected based on its unique advantages in handling high-dimensional, speech-derived features:

Support Vector Machine (SVM) (Cortes, 1995): Works well in high-dimensional spaces and can handle non-linear decision boundaries using Radial Basis Function (RBF) kernels.

Random Forest (RF) (Breiman, 2001): An ensemble learning approach that enhances prediction stability by aggregating multiple decision trees.

Gradient Boosting (GB) (Friedman, 2001): Sequentially builds trees to correct errors of previous iterations, optimizing for complex non-linear relationships.

Adaptive Boosting (AdaBoost) (Freund and Schapire, 1997): Assigns higher weights to misclassified samples, improving generalization while being prone to noise sensitivity.

K-Nearest Neighbors (KNN) (Fix and Hodges, 1951): A distance-based classifier, useful when labels have well-separated clusters in feature space.

Naïve Bayes (NB) (Rish et al., 2001): A probabilistic model assuming feature independence, known for fast training and robust results in speech applications.

Each model was implemented in Python (Scikit-Learn) and trained using 5-fold cross-validation to assess robustness.

Hyperparameter tuning

Hyperparameters were optimized using grid search and random search techniques:

Grid Search: Exhaustive search of pre-defined parameter sets for SVM, Random Forest, and Boosting models.

Random Search: Used for KNN and AdaBoost, where sampling over parameter space provides efficient exploration.

Each model's hyperparameter settings are detailed in Table 4.

ModelHyperparametersSVMC=0.1, Kernel=RBF, Gamma=scale, Tolerance=1e-3, Max Iterations=-1RFTrees=100, Max Depth=None, Min Samples Split=10, Min Samples Leaf=5, Bootstrap=TrueGBLearning Rate=0.1, Trees=100, Max Depth=3, Min Samples Split=5, Subsample=0.8AdaBoostEstimators=50, Learning Rate=1.0, Base Estimator=Decision Stump, Algorithm=SAMME.RKNNK=5, Distance=Euclidean, Weights=Uniform, Algorithm=Auto, Leaf Size=30NBDistribution=Gaussian, Variance Smoothing=1e-9

Machine learning models and hyperparameter settings for ASD classification.

The performance of each model was assessed using multiple metrics, including accuracy, precision, recall, and F1-score, calculated through cross-validation across the dataset.

In addition, we employed 5-fold GroupKFold cross-validation to evaluate model performance, ensuring that recordings from the same participant were not split across folds. This choice was made to balance bias and variance in model evaluation, given the limited dataset size.

2.4 Feature importance evaluation

To enhance transparency in ASD classification, we applied several interpretability techniques to analyze feature contributions. Shapley Additive Explanations (SHAP) (Lundberg and Lee, 2017) was employed to estimate the impact of each speech feature on model predictions. SHAP values were computed for all samples, allowing us to examine both individual and global feature influences. SHAP was chosen because it provides consistent, theoretically grounded attributions that are model-agnostic, making it especially suitable for comparing feature relevance across diverse classifiers (e.g., SVM, Random Forest, Gradient Boosting). Alternative methods such as LIME, permutation importance, or partial dependence plots (PDP) were considered; however, SHAP was prioritized due to its ability to capture both local and global interpretability in a unified framework. We acknowledge that SHAP is computationally more expensive than these alternatives, and this aspect is discussed further in the Limitations section. This approach provided insight into how changes in speech characteristics affect classification probability, facilitating a better understanding of model decisions.

For tree-based models such as Random Forest and Gradient Boosting, feature importance was derived using the Mean Decrease in Impurity (MDI) metric. This method ranks features based on their contribution to reducing uncertainty in classification. Additionally, we applied permutation importance to models that do not natively provide feature rankings, such as SVM and KNN. By randomly shuffling each feature and measuring its effect on model performance, we identified the most influential features for ASD classification.

Given that ADOS-2 Module 4 consists of 15 structured tasks, we conducted a scenario-specific feature analysis to investigate whether feature importance varies across different conversational contexts. This analysis involved computing SHAP values separately for each task, allowing us to assess how models rely on specific speech features under varying conditions.

To further interpret model decisions, we incorporated visualization techniques, including SHAP summary plots, feature importance rankings, and scenario-wise importance heatmaps. These visual tools help illustrate patterns in speech-related features and aid in understanding how classification decisions are made. By integrating multiple interpretability methods, we aimed to ensure that our models remain transparent and suitable for potential clinical applications.

The combination of SHAP analysis, feature ranking, and visualization techniques allows for a comprehensive assessment of model behavior. These interpretability methods provide essential insights for refining ASD classification models, validating the consistency of learned patterns, and supporting future improvements in automated diagnostic tools.

3 Results3.1 Experimental setup

To evaluate model performance, we applied a supervised classification framework using the extracted speech features. All experiments were conducted in Python (Scikit-learn) with 5-fold cross-validation to ensure robustness and reduce overfitting. Models were trained and tested on both feature sets described in Section 2.3 (the full 40-feature set and the reduced 15-feature set). We assessed diagnostic performance using four standard classification metrics:

Accuracy: The proportion of correctly classified samples out of all samples.

Precision: The proportion of predicted positive cases that are true positives, measuring the reliability of positive predictions.

Recall (Sensitivity): The proportion of true positive cases correctly identified, reflecting the ability to capture actual ASD cases.

F1-score: The harmonic mean of precision and recall, balancing the trade-off between false positives and false negatives.

Formally, given true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN):

These metrics are widely used in medical classification tasks and provide complementary perspectives on diagnostic reliability. Accuracy summarizes overall performance, precision emphasizes avoiding false positives, recall emphasizes capturing true cases, and the F1-score balances both aspects.

3.2 Analysis of speech pattern features

To explore the relationships between these features, we calculated Pearson correlation coefficients, measuring the degree and direction of linear relationships (see Figure 2). This approach is crucial for identifying redundancies, interdependencies, and unique contributions of each feature, which can enhance model interpretability and performance by mitigating multicollinearity. Several notable patterns emerge:

High Correlation Among Rate-Based Features: The rate of speech and articulation rate are strongly correlated, confirming that faster speech naturally leads to a greater number of syllables articulated per unit time. This redundancy suggests that only one of these features may be necessary for robust classification.

Duration and Pause-Related Measures: Speaking duration, original duration, and balance also show moderate-to-strong correlations, reflecting the intertwined nature of fluency, pause frequency, and overall timing. Longer utterances often correspond with proportionally longer pauses, which are captured in the balance measure.

Spectral and Prosodic Overlap: Several spectral features (e.g., spectral spread, centroid, and flux) cluster together, indicating they capture related aspects of energy distribution and spectral sharpness. This suggests potential dimensionality reduction opportunities for spectral descriptors.

Zero Crossing Rate (ZCR): Notably, ZCR exhibits a relatively high correlation with spectral flux and spectral centroid. This indicates that temporal fluctuations in signal polarity are linked to changes in frequency distribution and energy transitions. Since ZCR is a simple yet computationally inexpensive measure, its strong correlation with more complex spectral descriptors suggests it may serve as a lightweight proxy for certain spectral dynamics in ASD-related speech analysis.

MFCC and Chroma Clusters: MFCCs are highly intercorrelated, as expected given their derivation from the same cepstral representation. Similarly, the 12 Chroma features show block-wise correlations, particularly between adjacent chroma bands, reflecting harmonic relationships inherent in speech tonality.

Heatmap displaying the correlation matrix of various speech features, including Number of Syllables, Speech Rate, and MFCCs, among others. The color scale ranges from -1.0 to 1.0, with red indicating positive correlation and blue indicating negative correlation. A strong diagonal line of red squares shows high self-correlation.

Heatmap of Pearson correlation coefficients among all extracted speech features. The color scale represents the strength and direction of correlations (red = strong positive, blue = strong negative).

These findings highlight redundancy across certain features (e.g., rate measures, MFCCs, Chroma coefficients) as well as unique contributions (e.g., ZCR, spectral spread). This informed our decision to test both a full 40-feature set and a reduced 15-feature set, ensuring that classification models are not unduly biased by collinear predictors.

3.3 Classification and analysis of ASD using speech features

In this study, two distinct feature sets were used for classification: (1) all 40 features (including MFCCs and Chroma), and (2) 15 selected features after excluding MFCCs and Chroma. It allows us to assess the necessity of spectral features in ASD detection, especially for cases where computational simplicity is prioritized.

Results with all 40 features: Table 5 summarizes model performances when using all 40 features. Notably, SVM outperformed other models, achieving the highest F1-score of 84.49%, respectively, underscoring its robustness in capturing nuanced ASD-related speech patterns across a comprehensive feature set.

ModelAccuracyPrecisionRecallF1-ScoreSVM0.8360 ± 0.13340.9039±0.07880.7974 ± 0.15230.8449±0.1192Random Forest0.8505±0.08990.8733 ± 0.04810.8215 ± 0.11900.8423 ± 0.0724AdaBoost0.8253 ± 0.08150.8197 ± 0.09040.8190 ± 0.11000.8153 ± 0.0842Naive Bayes0.7776 ± 0.09060.7542 ± 0.08330.7800 ± 0.12160.7630 ± 0.0885KNN0.8349 ± 0.09120.8153 ± 0.04110.8202 ± 0.12320.8146 ± 0.0752Gradient Boosting0.8415 ± 0.07510.8296 ± 0.07410.8318±0.1094

Comments (0)

No login
gif