Polarity-considered EEG microstates improve classification accuracy of oddball stimulus

Abstract

Brain–computer interfaces (BCIs) require efficient feature extraction and dimensionality reduction from high-dimensional neural signals. Electroencephalogram (EEG) microstate analysis is a rapid and noise-resistant approach that classifies instantaneous EEG states into several spatial distribution patterns (templates). Previous BCI studies using the EEG microstate approach have typically used aggregated metrics, such as duration, frequency of occurrence, or time coverage, and have rarely applied pointwise microstate labeling as temporally ordered, one-dimensional sequences for robust classification. Moreover, the physiological relevance of EEG topographic polarity has often been overlooked, despite its potential to reveal smoother state transitions and align with event-related potential components. In this study, we applied polarity-considered microstate labeling to stimulus-driven classification in an oddball paradigm. EEG data from 40 healthy participants (20 per response type) were analyzed across three factors: stimulus modality (auditory or visual), modality condition (unimodal or cross-modal), and response type (key-response task or mental counting task). Preprocessed 32-channel EEG data were labeled with microstate templates (A–E ± topographical polarity) using a winner-take-all approach, and the resulting sequences were classified using multiple machine-learning models. The results showed that tree-based ensemble models (Random Forest, XGBoost, and CatBoost) achieved the most stable and accurate performance in the key-response task with cross-modal visual targets. These models reached an area under the receiver operating characteristic curve above 0.8 and a mean F1 score of 0.83. Preserving polarity improved classification by approximately 20% across tasks, doubling the label-space granularity and revealing temporal patterns aligned with the N200 and P300 components. Visual stimuli generally outperformed auditory stimuli, and cross-modal benefits emerged primarily in key-response tasks. These findings demonstrate that polarity-considered microstate labeling enhances classification accuracy and interpretability in BCIs. This method highlights the potential for real-time applications, such as P300 spellers and multimodal attention monitoring.

1 Introduction

Brain–computer interface (BCI) systems analyze brain signals to control external devices and are promising for a wide range of applications, including medical care and daily life support. A key requirement for BCI implementation is the extraction and dimensionality reduction of the features from high-dimensional brain signals. The P300 speller is a character input system that utilizes event-related potentials (ERPs) elicited by focusing attention on flashing characters, as first proposed by Farwell and Donchin (1988). Although BCI systems are considered practical for individuals with severe motor impairments such as those with amyotrophic lateral sclerosis, they have not yet achieved widespread usability. A major barrier is the limited classification performance and detection accuracy of the P300 component in single-trial analyses. Farwell and Donchin (1988) identified the Pz electrode (located over the parietal region) as the primary site for observing the P300 component, and Rezeika et al. (2018) emphasized its parietal predominance. However, Krusienski et al. (2008) demonstrated that using several electrodes improved classification accuracy. These findings suggest that expanding spatial coverage can enhance P300 detection performance.

Electroencephalogram (EEG) microstate analysis (Lehmann et al., 1987; Michel and Koenig, 2018) classifies multi-channel EEG signals into spatial patterns (templates) and extracts the characteristics of each template as features. This approach facilitates the interpretation of brain activity during task performance. Moreover, employing microstate analysis with many electrodes enhances the interpretability of brain states involved in EEG-based classification. This study applied EEG microstate analysis in a BCI system to detect P300 components and extract features reflecting more comprehensive brain dynamics.

Existing BCI studies utilizing EEG microstate analysis (Cui et al., 2023; Xiong et al., 2024; Wang et al., 2023; Kim and Kim, 2020; Zhao et al., 2025) have typically generated four to five microstate templates and extracted low-dimensional features such as mean duration, frequency of occurrence, time coverage, and transition probability. Cui et al. (2023) examined inter-subject variability in motor imagery BCI performance by analyzing four types of EEG microstates using the above metrics. However, to the best of our knowledge, EEG microstate labeling has not been previously applied to BCI systems, in which each EEG time point is assigned to the template with the highest spatial correlation. Although Zhao et al. (2025) employed sequence-to-sequence deep learning to predict microstate transitions for online applications, their approach targeted temporal forecasting rather than stimulus classification. The EEG microstate labeling approach enables the extraction of one-dimensional features that are computationally efficient and robust to noise, facilitating the practical deployment of future BCI systems.

Previous studies (Lehmann et al., 1987; Michel and Koenig, 2018; Cui et al., 2023; Xiong et al., 2024; Wang et al., 2023; Kim and Kim, 2020; Zhao et al., 2025) have typically ignored the polarity (positive/negative) of EEG signals in microstate analyses. However, Kashihara et al. (2025a) proposed that incorporating polarity into microstate labeling would enable smoother transitions and improve the detection of age-related changes in brain dynamics. Based on this insight, we hypothesized that incorporating polarity into microstate templates by adding polarity-inverted versions of the existing templates would allow the extraction of more informative features. Mahini et al. (2024) suggested that the topographical changes observed in the oddball response, particularly components such as N200 and P300, may reflect polarity-inverted isomorphic states. Therefore, this study introduces polarity-considered microstate labeling to achieve more effective feature extraction. Kashihara et al. (2025a) suggested that polarity reflects neural sources and structural constraints. Hence, the inclusion of polarity may allow microstates to capture neural activity with greater physiological significance. In addition, it is important to examine differences between conventional ERP measures based on raw EEG signals and ERP-like representations derived from microstate labeling. To this end, this study compared microstate sequences across stimulus conditions and analyzed time-resolved microstate label distributions. These analyses aimed to visualize structural differences in stimulus response patterns and identify brain states associated with successful recognition.

The oddball paradigm (Squires et al., 1975) is a widely used cognitive task for studying P300 responses in BCI research. This paradigm randomly presents infrequent (“target”) stimuli among frequent (“standard”) stimuli. We explored stimulus classification within the oddball paradigm to develop a BCI system incorporating polarity-considered microstate labeling for improved classification accuracy. We hypothesized that refining the task environment would lead to further improvements in performance. Kashihara et al. (2025b), used three experimental factors in the oddball paradigm: (i) the modality of infrequent stimuli (auditory or visual), (ii) the combinations of frequent and infrequent stimuli (unimodal or cross-modal), and (iii) the type of response to infrequent stimuli (key-response task or count task). Existing studies are inconsistent regarding whether visual (Katayama and Polich, 1999; Polich and Heine, 1996) or auditory stimuli (Bennington and Polich, 1999) elicit stronger responses. Research on stimulus combinations (Brown et al., 2006; Brown et al., 2007) suggests that late components, including the P300, tend to be enhanced under cross-modal conditions, whereas early components remain unaffected. Regarding the response type, some studies have reported enhancement of the P300 with key-response tasks (Brázdil et al., 2003; Kotchoubey, 2014), others with count tasks (Barrett et al., 1987; Salisbury et al., 2001), and others have reported no significant difference (Starr et al., 1995). Nevertheless, from the perspective of classification performance in oddball-based BCIs, cross-modal combinations of frequent and infrequent stimuli offer potential advantages. These experimental factors are known to influence ERP components, and evidence suggests that polarity-considered microstate labeling can capture similar brain responses (Kashihara et al., 2025b). Accordingly, applying polarity-considered microstate labeling to classification tasks under varied experimental conditions may help identify optimal task settings for this method.

Therefore, the present study focused on the following three objectives: (I) to evaluate the classification performance of P300 responses using microstate labeling, which differs from conventional approaches; (II) to investigate the potential improvement in performance achieved by utilizing polarity-considered microstate labeling, a recent advancement in the field; (III) to examine the correspondence between classification performance under different experimental conditions in the oddball paradigm and the application of these labeling methods. For objective (I), we evaluated whether the classification performance using one-dimensional microstate sequences exceeded the chance levels. Although linear classifiers, such as linear discriminant analysis and step-wise linear discriminant analysis (SWLDA), have been widely adopted for P300 classification, with SWLDA achieving over 90% accuracy in previous studies (Krusienski et al., 2008), this study compares a broader set of models (Support Vector Machine: SVM, Random Forest, Logistic Regression, XGBoost, CatBoost, and K-means). To achieve objective (II), we evaluated whether polarity-considered labeling performed better than conventional polarity-ignored labeling. Additionally, we explored the neuroscientific underpinnings of this approach from the perspective of topographical dynamics of brain activity. For objective (III), we investigated whether known differences in ERP responses under various oddball conditions, such as stimulus modality, frequency ratio, and response type, were reflected in the classification performance using polarity-considered microstate labeling. Based on these objectives, we discuss the implications of our results for future applications in BCI system development.

2 Materials and methods2.1 Experiment data

For classification, we used data from the oddball task conducted by Kashihara et al. (2025b). Two datasets were collected from 20 healthy participants in their 20s and 30s. These datasets were approved by the Ethics Committee of ATR (approval numbers: 21–144 and 21–143). The experiment was designed to examine the three factors described in Section 1. Four stimulus conditions were used: auditory only (unimodal auditory: uniA), visual only (unimodal visual: uniV), high-frequency visual with low-frequency auditory (cross-modal auditory target: croA), and high-frequency auditory with low-frequency visual (cross-modal visual target: croV). The modality conditions were counterbalanced across participants. Auditory stimuli consisted of pure tones at 1,000 Hz for standard stimuli and 2,000 Hz for target stimuli. The visual stimuli consisted of circles as the standard stimuli and stars as the target stimuli. Two task groups were defined for responses to target stimuli: the key-response task (keyresTask), in which participants (all right-handed) were instructed to press the spacebar with their right index or middle finger as quickly and accurately as possible, and the counting task (countTask), in which participants silently counted the number of target stimuli. In the countTask, stimulus frequency assignments were counterbalanced across participants. The keyresTask group included 12 females and eight males (mean age = 30.7 years, SD = 6.9), and the countTask group included 10 females and 10 males (mean age = 25.0 years, SD = 5.3). EEG recordings were performed using an R-Net 32-channel system (Brain Products GmbH, Gilching, Germany) and BrainAmp MR plus amplifier (Brain Products GmbH). The 32 electrode positions followed the international 10–10 system (Fp1, Fp2, Fz, F3, F4, F7, F8, F9, F10, FC1, FC2, FC5, FC6, Cz, C3, C4, T7, T8, CP1, CP2, CP5, CP6, Pz, P3, P4, P7, P8, P9, P10, Oz, O1, and O2). The EEG sampling rate was set to 500 Hz for the keyresTask and 5,000 Hz for the countTask. Each trial consisted of a 200 ms stimulus presentation and a 1,000 ms interstimulus interval. A total of 320 standard and 80 target stimuli were presented, with target stimuli pseudo-randomized to avoid consecutive occurrences. In keyresTask, trials containing errors, either false alarms (incorrect key press) or misses (no response), were excluded from the analysis. After artifact rejection, the minimum numbers of valid standard and target trials across participants were 312 and 68, respectively. Further details can be found in a study by Kashihara et al. (2025b).

2.2 Preprocessing

The EEG signals acquired in the experiment were resampled at 1000 Hz. A bandpass filter from 1 to 45 Hz was then applied using a finite impulse response filter. Subsequently, bad-channel detection and rejection as well as Artifact Subspace Reconstruction (ASR) (Mullen et al., 2015) were performed using the EEGLAB clean_rawdata plugin. EEG channels were identified as bad and removed when any of the following criteria were satisfied (keyresTask: mean number of removed channels = 1.50, SD = 1.54; countTask: mean number of removed channels = 1.83, SD = 1.63): a flat signal persisting for more than 5 s, line noise exceeding 4 standard deviations relative to the total channel signal, or a correlation coefficient with neighboring channels of less than 0.85. Subsequently, ASR was applied to reconstruct nonstationary noise. Sliding windows with RMS values exceeding 10 standard deviations relative to an automatically identified clean calibration dataset were reconstructed based on the remaining components. Next, the data were re-referenced to the average reference, after which Independent Component Analysis (ICA) was performed using Adaptive Mixture ICA (AMICA) (Palmer et al., 2008). Equivalent current dipoles were then estimated for each independent component (IC) using the DIPFIT plug-in, followed by bilateral dipole fitting using the fitTwoDipoles plug-in (Piazza et al., 2016). ICs were further classified using ICLabel (Pion-Tonachini et al., 2019). ICs were retained only when labeled as “brain” by ICLabel, exhibiting a dipole residual variance of less than 15% (Artoni et al., 2014), and localized within brain regions; all other ICs were removed. In addition, ICs with amplitudes exceeding 150 μV following ICA decomposition were rejected as artifactual components. This procedure resulted in an average of 12.63 retained ICs (SD = 3.37) in keyresTask and 13.60 ICs (SD = 3.26) in countTask. The EEG data were then segmented into epochs ranging from 200 ms before to 1,000 ms after the stimulus onset. All preprocessing steps prior to epoching were performed in MATLAB (R2019a, MathWorks, Natick, MA, United States) using EEGLAB (v2019.1) (Delorme and Makeig, 2004) and its associated plugins.

An overview of the post-epoch data processing pipeline is shown in Figure 1. The channel-wise mean was computed within the specified time windows. The resulting spatial distribution at each time point was correlated with the templates provided by Kashihara et al. (2025a), and the template with the highest correlation was assigned using a winner-take-all approach. The templates created by Kashihara et al. (2025a) were derived from the LEMON dataset (Babayan et al. 2019) using modified K-means clustering, resulting in five spatial maps. By including polarity-inverted versions of each, a total of 10 microstates (A± to E±) were obtained (Figure 1, center left). The template-matching results were then encoded as integers from zero to nine. Finally, a random subset of standard trials was selected to balance the dataset and equalize the number of trials between the standard and target stimuli.

Flowchart illustrating an EEG microstate classification pipeline for machine learning. Steps include epoch data selection, windowing, correlation with templates, labeling, encoding, microstate assignment, stratified 5-fold data splitting, model training, parameter selection, prediction, F1 score calculation, and averaging. Insets show epoch timing, sliding window process, example scalp maps with microstate labels, a labeling matrix, and score calculation for model evaluation.

Overview of the encoding and classification processes, where encoding denotes the transformation of EEG epochs (−200 to 1,000 ms relative to stimulus onset) into one-dimensional microstate label sequences, and classification denotes the binary discrimination of standard versus target stimuli in the oddball paradigm using these labeled sequences. The EEG topographies shown in the center represent a set of 10 microstate templates derived from the LEMON dataset (Babayan et al., 2019), obtained using modified K-means clustering with five clusters and their corresponding polarity-inverted maps. The resulting one-dimensional microstate label sequences (right) were used as inputs to machine learning models for stimulus classification. Model performance was evaluated using five-fold cross-validation, and F1 scores were computed from the predicted and true labels, then averaged across folds.

2.3 Label occurrence frequency

To investigate the temporal dynamics of the microstate labels obtained in Section 2.2, the occurrence frequency of each label was calculated within sliding time windows of 100 ms, advanced in 25 ms steps (i.e., 75% overlap). For each time window, the occurrence frequency was defined as the total count of each of the 10 template labels across all trials divided by the total number of trials.

2.4 Classification2.4.1 Temporal generalization matrix

To evaluate how temporal patterns generalize across trials, a temporal generalization matrix was constructed based on the processed data using a time window size of 100 ms, advanced in 25 ms steps (i.e., 75% overlap), following the approach of Veillette et al. (2023). To classify standard versus target stimuli in this matrix, a model was trained using a sliding window of three consecutive time points (i.e., a target time point and its immediate neighbors), and predictions were generated by shifting the test time window across time points. Classification performance was assessed using the F1 score (Equation 1).

The F1 score represents the balance between precision and recall, which increases with a higher number of true positives (TP) and decreases when false positives (FP) or false negatives (FN) are increased. TP refers to correctly classified target stimuli; FP, standard stimuli misclassified as target; TN, correctly classified standard stimuli; FN, target misclassified stimuli. This matrix aims to identify the most suitable time periods for classification. While Veillette et al. (2023) used logistic regression to classify EEG signals related to the sense of agency during self-executed and unexecuted motor tasks based on electromyogram information, this study employed a SVM, which also allows for nonlinear classification, to evaluate performance. The SVM was implemented using the Python-based scikit-learn package (version 1.5.2). Its hyperparameters were optimized through a grid search on the validation dataset, testing C values of 0.01, 0.1, 1, 10, and 100; kernel types of linear, rbf, and sigmoid; and gamma options of scale, auto, and 1.

2.4.2 Classification using six models

Based on the results in Section 2.4.1, the analysis was restricted to periods suitable for classification, and six different models were trained to predict whether each trial corresponded to a standard or target stimulus. The F1 score was used as an evaluation metric. For this analysis, the time window size was set to 10 ms, advanced in 1 ms steps (i.e., 90% overlap). Although this differs from the previous window size, a smaller window is expected to yield a higher temporal precision. The six classification models used were SVM, Random Forest, Logistic Regression, XGBoost, CatBoost, and K-means. These models were selected based on previous studies (Barrett et al., 1987; Salisbury et al., 2001; Starr et al., 1995), emphasizing those known for their high classification performance and computational efficiency. Model performance was evaluated using five-fold cross-validation at the trial level. For each fold, the training data were further split into training and validation sets using five-fold cross-validation to optimize the hyperparameters (Supplementary Table 1) using Python (version 3.12.7). The classifier was then retrained on the full training set using the best parameters, and predictions were made on the test set, which comprised 20% of the total data. The F1 scores of the five folds were averaged to obtain the final mean F1 score. The same analysis was performed for the polarity-ignored version.

2.4.3 Evaluation

The performances of the classification models were evaluated using the area under the receiver operating characteristic curve (AUC). According to Hanley and McNeil (1982), AUC represents the probability that a randomly selected positive instance is ranked higher than a randomly selected negative instance, with 0.5 representing chance-level performance. To assess the effects of the three experimental factors described in Section 1, namely, the modality of target stimuli, the combination of standard and target stimuli, and the response type to target stimuli, a three-way mixed-design analysis of variance (ANOVA) was conducted. This design was appropriate because response type (keyresTask vs. countTask) constituted a between-subject factor, as separate participant groups performed each task, whereas the remaining factors were within-subject variables. This analysis used the average F1 score of the classification model that demonstrated the most stable and accurate AUC performance. To evaluate the influence of polarity on classification outcomes, we also calculated F1 scores using microstate templates that did not account for polarity. Furthermore, to examine whether the classification performance depended on the number of templates, we computed the F1 scores under three additional conditions: (1) using four templates, (2) using seven clusters derived in a data-driven manner (7_dat), and (3) using five clusters with two additional artificially created templates generated by averaging the centroids of microstates A and B and microstates D and E, respectively (7_art). The results were analyzed using ANOVA.

2.5 Time-resolved microstate label analysis

To examine temporal differences in microstate label distributions between standard and target stimuli, EEG data from −200 to 400 ms relative to stimulus onset were segmented using a 10 ms sliding window with 90% overlap, and microstate labeling was applied to each segment. The resulting labels were one-hot encoded and averaged across trials for each participant. Permutation-based t-tests with cluster correction were conducted at each time point across participants, with shuffling restricted to the temporal dimension. In addition, false discovery rate correction was applied to incorporate multiple comparisons across microstate labels. To assess whether such differences were linked to successful classification, the same procedure was repeated separately for correctly and incorrectly classified trials using a high-performing Random Forest model.

2.6 Topographic reconstruction via microstate weights

To evaluate potential information loss when reducing EEG signals to microstate label representations, we reconstructed the temporal evolution of EEG topographies based on the averaged microstate label distributions obtained in Section 2.5. At each time point, the one-hot-encoded microstate label averages were multiplied by the corresponding microstate template maps. The weighted linear sum across all templates was computed to reconstruct the EEG topography at each time point. To quantify the fidelity of this reconstruction, we calculated the Global Dissimilarity (DISS; Equation 2) between the reconstructed topographies and grand-averaged EEG signals (i.e., across all trials and participants), following the approach described by Murray et al. (2008).

In Equation 2, u represents the reconstructed EEG signals, v represents the grand-averaged EEG signals, i denotes the channel index, and N is the total number of channels. DISS is mathematically equivalent to the Global Map Dissimilarity (Lehmann and Skrandies, 1980), differing only by a scaling factor, and is defined as the Euclidean distance between normalized voltage maps. Importantly, this measure is equivalent to a rescaled spatial correlation between maps. Thus, DISS provides a scale-independent measure of topographic dissimilarity, ranging from zero (identical maps) to two (polarity-inverted maps), and reflects differences in the spatial configuration of scalp potentials rather than amplitude.

3 Results3.1 Label occurrence frequency

Figure 2 shows the average label occurrence rates for EEG microstate labels across participants in the uniV-only condition. As shown in Figure 2A, during the keyresTask, target visual stimuli elicited higher occurrence rates of C+ and E+ than standard stimuli within the first 200 ms following stimulus onset. This was followed by a transition to D+, C−, and E− (300–400 ms) and a return to C+ and E+ after 500 ms. A similar pattern was observed for countTask (Figure 2B).

Figure contains two panels labeled A and B comparing keyresTask and countTask, each showing three heatmaps for uniV responses: standard, target, and target minus standard. Y-axes display labeled electrode positions from A+ to E–, x-axes show time in milliseconds from negative two hundred to nine hundred, and color bars indicate response strength. Standard and target conditions use a yellow zero-millisecond marker and share a color scale from zero point one to zero point five, while the difference map uses a red-blue color scale from negative zero point four to zero point four.

EEG microstate occurrence rates in the uniV condition for the keyresTask and countTask. (A) Results for the key-response task. From top to bottom: the occurrence rate during standard (frequent) stimuli, the occurrence rate during target (infrequent) stimuli, and the difference between target and standard stimuli. The horizontal axis represents time (ms), and the vertical axis indicates template labels. Color bars represent occurrence rates. For standard and target stimuli (top and middle panels), a sequential colormap was used, with higher values shown in warmer colors. For the difference map (bottom panel), a diverging colormap was applied, where red indicates higher occurrence during target stimuli relative to standard stimuli and blue indicates the opposite. (B) Results for the count task presented in the same format as in (A).

3.2 Result of temporal generalization matrix

The results of the temporal generalization matrix are shown in Figure 3. As shown in Figures 3A,C, F1 scores were low during the pre-stimulus period (−200 to 0 ms) and showed a clear diagonal pattern, with higher scores when the training and testing time points corresponded. Notably, training on data from the 0 to 250 ms range increased the F1 scores when testing on data after 500 ms. Conversely, training on the data after 500 ms improved classification for testing in the 0–250 ms range. Figures 3B,D show that regardless of response type, the croV condition yielded the highest F1 scores (keyresTask: F1 score at 275 ms = 0.745 ± 0.025; countTask: F1 score at 300 ms = 0.680 ± 0.020). By contrast, the uniA condition produced the lowest scores (keyresTask: F1 score at 175 ms = 0.615 ± 0.020; countTask: F1 score at 325 ms = 0.564 ± 0.013). Across all experiments, the average F1 scores across participants peaked in the 0–400 ms range. Therefore, the 0–400 ms time window was used in subsequent classification analyses.

Panel A and C each show four time generalization matrices with color maps for uniA, uniV, croA, and croV, plotting training time against testing time in seconds for keyresTask (A) and countTask (C). Panel B and D are line graphs showing F1 scores over time in milliseconds for the same four conditions, with shaded error regions; keyresTask results are in B and countTask results are in D. Color bar ranges from 0.3 to 0.7 F1 score. Legends and axes are clearly labeled.

Temporal generalization matrices and time-resolved F1 scores. (A,C) Present the temporal generalization matrices averaged across all participants (n = 20) for the keyresTask and countTask, respectively (horizontal axis: testing time; vertical axis: training time). (B,D) Show the mean F1 scores and 95% confidence intervals across participants along the diagonal of the matrices shown in (A,C), respectively.

3.3 Results of classification using six models3.3.1 Effects of stimulus modality, modality conditions, and response type

Based on the results in Section 2.4.2, six classification models were applied to the 0–400 ms time window, which showed higher F1 scores (Section 3.2). The results are shown in Figure 4. All models, except K-means, achieved classification performance above the chance level regardless of the response type, with the highest performance observed in the croV condition. However, the F1 scores for the SVM and Logistic Regression models were close to the chance level for the uniA condition in countTask.

Bar chart with error bars comparing F1 scores of six machine learning models across four data types (uniA, uniV, croA, croV) for panels A (keyresTask) and B (countTask). SVM, Random Forest, Logistic Regression, XGBoost, and CatBoost models show higher performance for croV, while K-means performs consistently lower. A red dashed line marks F1 score of 0.5 for reference.

F1 scores for each machine learning model across task types and stimulus conditions. (A) Results from keyresTask. (B) Results from countTask. Color represents stimulus condition (blue: uniA, orange: uniV, green: croA, red: croV). Error bar indicates 95% confidence intervals across participants. The red dashed line represents the chance level (0.5). Across models, classification performance varied depending on task type and stimulus condition, with croV in keyresTask yielding the highest F1 scores. By contrast, K-means clustering remained approximately at the chance level across all conditions.

Similar trends were observed across the five classification models, excluding K-means, for both the task types. A correspondence matrix was constructed to visualize these relationships (Supplementary Figure 1). Additionally, the AUC was calculated as a performance metric for each classification model. The AUC distributions for the highest-performing (croV) and lowest-performing (uniA) conditions are shown in Figure 5, where Random Forest, XGBoost, and CatBoost demonstrated the most stable and accurate classification performance. Notably, in keyresTask under the croV condition, the average AUC exceeded 0.8, even when SVM and Logistic Regression were included, indicating highly reliable classification results.

Figure containing four grouped bar charts comparing model performance (AUC) for SVM, Random Forest, Logistic Regression, XGBoost, CatBoost, and K-means across two datasets, uniA and croV, for keyresTask and countTask. Blue bars represent uniA, showing moderate AUCs, while red bars represent croV, showing higher AUCs except for K-means. Error bars indicate standard deviation.

AUC bar plot for each machine learning model. (A) Results from the conditions with the lowest F1 scores (uniA), in both keyresTask and countTask. (B) Results from the conditions with the highest F1 scores (croV). The vertical axis represents the AUC value, providing an additional evaluation of classification performance across models. Among the classifiers, the tree-based models—Random Forest, XGBoost, and CatBoost—achieved the most stable and reliable performance.

Based on the AUC values, a three-way mixed-design ANOVA was conducted using the average scores of the three decision-tree-based models (Random Forest, XGBoost, and CatBoost), which showed the most stable and accurate performance. The analysis revealed a significant second-order interaction effect [F(1, 38) = 4.88, p < 0.05, = 0.11]. With respect to the modality of target stimuli, visual stimuli yielded significantly higher f1 scores than auditory stimuli across all conditions except for the cross-modal condition in countTask (Figure 6A). For stimulus modality combinations, cross-modal conditions resulted in significantly higher F1 scores than unimodal conditions, except when the target stimulus was auditory in keyresTask (Figure 6B). With respect to the response type, keyresTask outperformed countTask under the uniA, uniV, and croV conditions, indicating different effects depending on the stimulus combination (Figure 6C). These results suggest that the highest classification performance was achieved using high-frequency auditory and low-frequency visual stimuli. Moreover, keyresTask enabled more accurate classification than countTask.

Bar chart with three panels labeled A, B, and C shows F1 score comparisons with error bars for different experimental conditions. Panel A compares target types A and V across modality and task type, Panel B compares unisensory and cross-modal conditions for each target, and Panel C compares key response and count tasks for different targets and modalities. Statistical significance is indicated with stars above relevant comparisons, where asterisk counts represent different p-value thresholds as outlined below panel C.

F1 scores for the three experimental factors in the oddball task. (A) Target stimulus modality (auditory vs. visual). (B) Modality combination (unimodal vs. cross-modal). (C) Response type (key-response vs. count). Error bar represents 95% confidence intervals across participants. Significant effects were primarily observed for visual stimuli in target modality factor, for cross-modal compared with unimodal conditions in the keyresTask with visual targets, and for the keyresTask compared with the countTask in conditions with auditory infrequent stimuli.

3.3.2 Effects of polarity and number of templates

Figure 7 presents a matrix of participant-averaged F1 scores across conditions with and without polarity and across different numbers of templates. All five models, excluding K-means, yielded higher classification scores when polarity-considered templates were used.

Figure presents two panels of color-coded heatmaps labeled A and B, each displaying classification scores for keyresTask and countTask using six machine learning models across two groups (non and polar) and four templates. Polar group consistently shows higher scores than non group across all models except K-means, with values ranging from approximately 0.54 to 0.75. Color intensity reflects score magnitude.

Effects of the number of templates and polarity consideration on classification performance. (A) Participant-averaged F1 scores for the keyresTask, and (B) Participant-averaged F1 scores for the countTask. Rows represent polarity conditions, and columns indicate the number of templates. This figure illustrates how polarity considered labeling (polar) generally improved classification performance compared to polarity-ignored labeling (non), whereas varying the number of templates had only a limited effect.

An ANOVA was conducted using the average F1 scores from the five classification models (excluding K-means) to examine the effects of polarity and the number of template clusters, limited to cluster sizes of four and five (Figure 8). The interaction between cluster number and polarity was observed in keyresTask [F(1, 19) = 4.78, p < 0.05, = 0.20], but not in countTask [F(1, 19) = 2.74, p = 0.11, = 0.13]. Adding polarity significantly improved classification accuracy in both tasks [keyresTask: 19.8% improvement, F(1, 19) = 190.28, p < 0.001, = 0.91; countTask: 19.5% improvement, F(1, 19) = 290.26, p < 0.001, = 0.94]. Regarding the number of clusters, performance was significantly higher with five clusters compared to four [keyresTask: F(1, 19) = 11.09, p < 0.01, = 0.37; countTask: F(1, 19) = 4.59, p < 0.05, = 0.20].

Bar graph illustration comparing F1 scores in keyresTask and countTask. For both tasks, polar groups show significantly higher F1 scores than non-polar groups. Additional comparison indicates greater F1 scores for five templates versus four in both tasks, with varying significance levels. Error bars show data variability. Significance markers are included.

Effects of polarity and number of templates on F1 scores for the keyresTask (A) and countTask (B). Error bars represent 95% confidence intervals across participants. In the keyresTask, polarity-considered labeling (polar) performed significantly better than polarity-ignored labeling (non-polar), and using five templates yielded significantly higher F1 scores than using four. In the countTask, only the main effect of polarity reached statistical significance.

3.4 Temporal differences in microstate label distributions

To identify temporal regions where microstate label distributions differ between standard and target stimuli, time-resolved statistical comparisons based on the procedure described in Section 2.5 are shown in Figure 9. In the croV condition of the keyresTask, microstates A+, B+, and C+ showed significantly higher average label values for standard stimuli in the 200–350 ms range. Additionally, microstate D+ exhibited a significant increase in standard stimuli at approximately 250 ms, and E+ showed a similar pattern near 300 ms. By contrast, microstate E+ showed significantly higher values for target stimuli at approximately 200 ms, whereas C− (250–300 ms), D− (approximately 250 ms), and E− (300–350 ms) also favored the target condition. In countTask, no significant differences were observed for microstates A+, C−, or D−, in contrast to keyresTask. Only microstate B+ showed a significant difference near 350 ms, with higher values for standard stimuli.

Comments (0)

No login
gif