APCformer: an aggregation-perception enhanced convolutional transformer network for MI-EEG decoding

Abstract

Electroencephalogram (EEG) decoding is essential for Brain-computer interfaces (BCI) systems to predict brain activity. However, existing methods usually suffer from two core problems: (1) existing networks lack effective interaction mechanisms and insufficiently capture spatial-temporal dynamic features, leading to the loss of critical fine-grained information; (2) the modeling of long-range dependencies and local features is unbalanced, making it difficult to adapt to the temporal characteristics of EEG signals. To address these issues, this paper proposes an Aggregation-Perception Enhanced Convolutional Transformer (APCformer) network. The network adopts a branch-interactive structure as its main body and jointly extracts shallow features via multi-scale spatial-temporal convolution; an Adaptive Feature Recalibration (AFR) module is embedded to realize cross-scale feature interaction and enhancement of critical fine-grained features. The Position-aware Enhancement (PAE) module is utilized to integrate learnable positional encoding, improving the ability of deep networks to characterize the temporal positional relationships of EEG sequences and enhancing adaptability to temporal dynamic features. We further propose a Sparse Information Aggregation Transformer (SAT), which combines the attention mechanism with the maximum attention mechanism to achieve a balanced modeling of global long-term dependencies and local fine-grained features. Experimental results on the public BCI-IV 2a and BCI-IV 2b datasets show that APCformer achieves superior performance in EEG decoding tasks, with average decoding accuracies of 85.53% and 89.15%, respectively. These results highlight APCformer's strong capability in handling complex EEG features and dynamic patterns, effectively improving the efficiency and accuracy of EEG decoding.

1 Introduction

Brain–computer interfaces (BCI) is a cutting-edge technology emerging from the intersection of neuroscience, biomedicine, and engineering, enable direct communication between human thoughts and external devices. Their applications are gradually extending into everyday life (Willett et al., 2021; Mansour et al., 2025). With the rapid development of BCI, motor imagery-based electroencephalography (MI-EEG) has been widely utilized in BCI research (Hsu and Cheng, 2023). By decoding the brain's electrical activity, MI-EEG allows individuals to operate various devices using only their thoughts, offering vast possibilities for future lifestyles (Liu et al., 2025). However, EEG signals are inherently weak physiological electrical signals characterized by nonlinearity, non-stationarity, and high dimensionality. They are also highly susceptible to external interference, making accurate decoding a highly challenging task (Guo et al., 2025; Ke et al., 2024).

Accurate EEG signals decoding is a critical foundation for the stable operation of BCI systems. At the application level, BCI technology is closely tied to EEG decoding, with the realization of the former largely depending on the effectiveness of the latter. In EEG decoding, effective feature selection plays a pivotal role. Traditional feature extraction methods primarily focus on the time, frequency, and spatial domains, yet these approaches typically capture only a single, specific aspect of EEG information (Guan et al., 2025). As a result, recent research has explored multi-domain feature fusion methods that combine various types of features to obtain richer EEG representations (Zhang et al., 2021; Chen et al., 2025). Although such approaches can extract discriminative multi-domain features, they often rely on prior knowledge during the extraction process, which may lead to the loss of a substantial amount of valuable information (Geng et al., 2022).

With continuous breakthroughs in EEG decoding technology, which play a decisive role in advancing efficient human–computer interaction, early EEG decoding methods were primarily based on traditional machine learning algorithms (Zhao et al., 2019). For example, (Richhariya and Tanveer 2018) achieved epileptic signal recognition by improving the support vector machine (SVM). (Wei et al. 2021) performed EEG classification using the filter bank common spatial pattern (FBCSP) method. With the enhancement of computational power, deep learning methods have gradually replaced traditional machine learning approaches and become mainstream, among which convolutional neural networks (CNN) are the most widely used for EEG decoding. (Schirrmeister et al. 2017) explored end-to-end EEG decoding using a deep convolutional neural network (DeepConvNet), achieving performance comparable to that of FBCSP. (Lawhern et al. 2018) developed EEGNet, a compact network specifically designed for EEG analysis based on depthwise separable convolution, which achieved high performance across various benchmark tasks. (Salami et al. 2022) proposed an interpretable network (ITNet) derived from CNN that integrates Inception modules and dilated causal convolutions, leading to improved performance on public datasets. However, these CNN-based models tend to overemphasize local features when dealing with the complex and dynamic characteristics of EEG signals. As a result, they struggle to effectively capture long-range dependencies within EEG data, which limits further improvements in EEG decoding accuracy.

In contrast, the Transformer architecture, by virtue of the powerful sequence modeling and long-range dependency capture capabilities of the self-attention mechanism, has provided novel approaches for the global temporal feature modeling of EEG signals, making Transformer-based decoding methods a research focus in the MI-EEG field. To balance the advantages of CNNs in local feature extraction and the global modeling capabilities of Transformers, many studies have begun to attempt combining different model structures with Transformers, aiming to improve overall performance through model complementarity. For example, (Song et al. 2022) proposed the EEG Conformer model, which extracts local spatio-temporal features using convolution while employing a self-attention mechanism to capture global temporal dependencies, achieving accurate EEG signal decoding. (Gong et al. 2023) designed the Attention-based Convolutional Transformer Neural Network (ACTNN), which demonstrated excellent performance and cross-subject robustness on the SEED emotion dataset. Zhao W. et al. (2024) employed a convolutional module similar to EEGNet to extract local features and used a multi-head attention mechanism to model global dependencies, with their proposed CTNet showing outstanding performance on corresponding datasets. However, these models may overlook fine-grained local features when extracting EEG representations.

In recent years, a number of cutting-edge studies have further expanded the technical boundaries of the Transformer architecture in the field of MI-EEG decoding. Among them, (Liu et al. 2024) proposed an MSVTNet network, which combines multi-scale CNNs and Transformers to extract local spatial-temporal features and cross-scale global coupled features respectively, and uses auxiliary branch loss to optimize the parameter imbalance problem, achieving dual improvement in decoding performance and robustness. (Han et al. 2025) proposed a spatial-spectral and temporal dual-prototype learning framework (SST-DPN), which realizes efficient feature modeling through spatial-spectral fusion and multi-scale variance pooling, and introduces dual-prototype learning for the first time to enhance the intra-class aggregation and inter-class discriminability of features, which can alleviate small-sample overfitting, with high model accuracy, high computational efficiency and no need for preprocessing. (Zhao et al. 2025) proposed a CNNViT-MILF-a dual-branch collaborative architecture, which uses CNNs to extract local spatial-temporal features, models global temporal dependencies through Vision Transformer (ViT), and adopts a ViT-dominated late fusion strategy to integrate local and global features, realizing the synergistic enhancement of CNNs and ViT.

In addition, the academic community has also carried out extensive research focusing on multi-scale feature fusion, collaborative optimization of CNN and Transformer, feature representation enhancement and other related directions, and proposed a series of improved CNN-Transformer architectures for MI-EEG decoding, which also provides an important reference foundation for the research of this paper. (Tao et al. 2023) proposed the ADFCNN model, which combines dual-scale fusion and attention mechanism, enabling it to overcome the limitations of single-scale feature extraction and improving the performance of the model. (Yang and Liu 2024) proposed the MSFCNNet, which enhances feature extraction and the capture of spectral and spatial features through multi-head self-attention mechanism and multi-scale inputs. (Zhao et al. 2025b) proposed MSCFormer and TCANet (Zhao et al., 2025a), both of which are advanced models for MI-EEG classification. Among them, MSCFormer extracts local spatiotemporal features through a multi-branch multi-scale convolutional network and further optimizes the decoding performance of the model by combining Transformer to model the global dependency relationships. TCANet integrates local and global features gradually through multi-scale convolution modules, time-domain convolution modules, and stacked multi-head self-attention, and has been systematically evaluated on public datasets, achieving good results.

Although the aforementioned works have achieved optimization of CNN and Transformer architectures from various aspects and made a series of progress in MI-EEG decoding tasks, further improvements are still needed. First, in terms of multi-scale feature modeling, most methods simply concatenate multi-scale features after parallel extraction, resulting in insufficient multi-scale feature extraction and lack of effective cross-scale feature interaction and adaptive screening mechanisms. For example, MSVTNet, MSCFormer, TCANet and other methods have not designed a dedicated cross-branch interaction mode, and only complete information fusion through feature concatenation, which cannot realize information sharing of features at different scales and enhancement of critical fine-grained features, easily leading to the loss of effective information with low discriminability. Second, due to the strong temporal dependence of EEG signals, in the process of learning deep features in most decoding models, fine-grained information along spatial and temporal dimensions is often ignored, making it impossible to ensure the temporal integrity of deep fine-grained features. For instance, ADFCNN, MSFCNNet and other methods only adopt the standard Transformer encoder structure, using schemes of superimposing fixed positional encoding on the input layer or no positional information; in deep networks, positional information gradually degrades with convolution operations, affecting the model's temporal modeling of EEG signals. Third, most methods cannot focus on local high-value features while modeling long-range dependencies, resulting in an imbalance between global long-range dependency modeling and local fine-grained feature modeling. For example, methods such as CNNViT-MILF-a, MSCFormer, and TCANet all rely on the global self-attention mechanism of Transformers, which has high computational complexity; while improving the global feature modeling capability, it weakens the capture of local fine-grained features, making it difficult to achieve dual optimization of decoding accuracy and computational efficiency.

To address the limitations discussed above, we proposes a multi-scale interactive APCformer network for EEG signal decoding. The research focuses on the optimization and improvement of the CNN-Transformer hybrid architecture, enhancing the network's ability to capture local and global fine-grained features and its cross-scale correlation. The main contributions of this work are as follows:

To address the issues of the absence of multi-scale feature interaction and the insufficient capture of spatiotemporal dynamic fine-grained features, a multi-scale information interaction sharing network is designed. It breaks through the limitation of fixed receptive fields in convolution and incorporates an AFR module. Through cross-branch interaction, it enhances the model's perception and learning ability of key fine-grained features, achieving the complementation and selection of multi-scale features.

To address the issue of poor temporal dynamic adaptability of EEG signals in deep networks, a PAE module was constructed to focus on the identification of deep fine-grained features, while integrating learnable feature position encoding information to deeply model the temporal correlations of EEG sequences and enhance the model's adaptability to the non-stationary temporal features of EEG.

To address the imbalance between the overall long-term dependency and the local detail modeling, the SAT module was proposed. The EEG signals were divided into blocks using a sliding window. Through sparse filtering and core representation aggregation strategies, combined with the linkage of aggregation attention and the highest attention mechanism, the balance between the model's decoding accuracy and efficiency was achieved.

We conducted experiments on the BCI-IV 2a and BCI-IV 2b public datasets. Compared with the mainstream methods in the field, APCformer demonstrated superior learning ability and performance.

The rest of this article is organized as follows. Section 2 provides a detailed description of the network architecture. Section 3 introduces the experimental design. Section 4 focuses on comparative experiments and analyses. Section 5 discusses this article and looks forward to the future. Finally, Section 6 summarize this article.

2 Methods2.1 Overview of APCformer

In order to effectively decode the spatiotemporal dynamic information of EEG signals, we propose APCformer, a novel EEG decoding network, whose overall architecture is illustrated in Figure 1. The network comprises five main components: Spatio-temporal convolution module (STConv), AFR, PAE, SAT, and classification module. Batch EEG data are first fed into the STConv to extract multi-scale shallow local features. The AFR module then highlights key spatiotemporal features, which are subsequently passed into the PAE module to extract deep fine-grained features and perform adaptive feature encoding. These enhanced features are further refined by the SAT module to extract long-range dependencies and local associations, achieving effective fusion of local and global representations. Finally, the classification module outputs the decoding result. The remainder of this section provides a detailed explanation of the APCformer architecture and its components. Specific parameters are listed in Table 1.

Flowchart depicts an EEG-based neural network model including input EEG, spatial and temporal convolution module, adaptive feature recalibration module with CBAM, concatenated features, position aware enhancement, sparse information aggregation transformer, and final classification layer. Symbols for each module are labeled below.

Overall architecture of the APCformer network, which consists of five main components: STConv module, AFR module, PAE module, SAT module, and the classification module. Among them, CBAM is a Convolutional Block Attention Module.

ModuleLayerSizeOutputInput-B = 32(1, C, T)STConvBranch-1Temporal Conv(F1, 1, Fs/2)(F1, C, T)Branch-2Temporal Conv(F1, 1, Fs/4)(F1, C, T)Branch-3Temporal Conv(F1, 1, Fs/8)(F1, C, T)GeneralSpatial Conv(F2, C, 1)(F2, 1, T)BatchNormaxis = –1-ActivationELU-AvgPooling(1, 75)(F2, 1, T/75)Dropout0.3-AFRAdd-(F2, 1, T/75)CBAM-(F2, 1, T/75)FusionConcatenation-(F2, 1, T/25)PAEEnhance Conv(F2, 1, 3)(F2, 1, T/25)Enhance Conv(F2, 1, 7)(F2, 1, T/25)BatchNormaxis = –1-Add-(F2, 1, T/25)ActivationELU-AvgPooling(1, 3)(F2, 1, T/75)Dropout0.3-Rearrange-(T/75, F2)PE-(T/75, F2)SATAttention-(T/75, F2)ClassifierFlatten-(F2×T/75)DenseSoftmaxClass

Parameters of the APCformer.

Among them, C = the number of channels, T = the number of sampling points, Fs = the sampling frequency, B = the batch quantity, Class = the number of classifications, F1 and F2 represent the number of filters corresponding to the convolutional layer, F1 = 16, F2 = 32. The specific parameters of SAT can be found in Section 4.4 - Model parameter sensitivity analysis.

2.2 Spatio-temporal convolution module

To break through the limitations of single convolutional perception and achieve multi-scale feature processing, the shallow local features of EEG are extracted through multi-scale STConv, enhancing the feature recognition ability without increasing the depth of the network. The first core component of the network is inspired by (Schirrmeister et al. 2017); (Lawhern et al. 2018), with a total of three branches set up. Each branch successively has a temporal convolutional layer, a spatial convolutional layer, a layer normalization, an ELU activation function layer, an average pooling layer, and a Dropout layer. The composition of each branch is roughly the same, all adopting double-layer convolution as the shallow feature encoder. However, there are differences in the design of temporal convolution, with 16 large-kernel convolution of different scales respectively as suggested by Zhao W. et al. (2024), namely (1, Fs/2), (1, Fs/4) and (1, Fs/8), extracting rich temporal features through differentiated receptive fields. Spatial convolution uses 32 (C, 1) convolution kernels to adapt the number of sampling channels. After convolution, batch normalization is added to alleviate the covariate offset problem, and the nonlinear expression ability of the model is enhanced through ELU. Subsequently, the average pooling layer is used to reduce redundant features, and Dropout is employed to reduce the risk of overfitting. Eventually, more representative shallow spatio-temporal features are output.

2.3 Adaptive feature recalibration

Cross-scale correlation can effectively enhance the model's ability to perceive multi-scale features, and improve the model's performance in recognizing complex patterns and achieving effective generalization. In multi-branch network structures, the feature scales and temporal locality corresponding to different branches are different, and there are certain internal correlations among the features of each branch. Fully exploring and utilizing these internal correlation information is of great help for subsequent feature extraction (Zhao P. et al., 2024; Cai et al., 2025). The AFR module designed for this purpose can achieve information sharing among branches through interactive feature stacking. Its interaction structure is related to the inherent characteristics of EEG signals. Different time feature scales and temporal locality corresponding to each branch are independently encoded, containing irreplaceable fine-grained discriminative features. If directly merged, the unique feature information of each branch will be blurred and diluted, resulting in the loss of key information with discriminative power in the weak EEG signal from the root (Cai et al., 2025). The interaction structure of the AFR module retains the features of each branch and adds the features of the other branches, allowing each branch to efficiently obtain complementary information from other branches while completely preserving its own unique feature representation. This can effectively avoid the loss of key information and achieve cross-scale association enhancement among branches.

Inspired by (Tang et al. 2023), based on the interaction mechanism, introduces the Convolutional Block Attention Module (CBAM) (Woo et al., 2018) to recalibrate the channels and spatial features, enhancing the model's sensitivity to important information. Since EEG signals contain both rhythmical feature differences in different channels and local spatial distribution patterns in the time dimension, CBAM can act on the interaction features that retain branch specificity, and can better adapt to the characteristic requirements of EEG signals. Through the dual-branch collaboration calibration of channels and spaces, CBAM can specifically enhance the weights of key channels within each branch, and further focus on the local time periods with significant discriminative properties in the time series. The structure is illustrated in Figure 2. In CBAM, the channel attention processes the input features through global max pooling downsampling and global average pooling downsampling, and then inputs them into multiple layers of perceptrons (MLPs). Subsequently, these features are added and processed through the Sigmoid activation function to generate channel weights. The spatial attention also processes the input features through global max pooling downsampling and global average pooling downsampling, and then concatenates the two features along the channel dimension. Finally, it generates spatial weights through convolution and the Sigmoid activation function. These weights are applied successively to the channel features and spatial features. Subsequently, the features of each branch are effectively fused and input into the PAE for deep refinement processing.

Diagram of a neural network attention module showing input features passing through channel attention and spatial attention mechanisms, with operations including max pooling, average pooling, MLP, convolution, and sigmoid activations, resulting in processed output features.

CBAM structure. It is mainly composed of channel attention and spatial attention. Among them, while Conv denotes a 3 × 3 convolution.

2.4 Position aware enhancement

In previous decoding models, fine-grained information along spatial and temporal dimensions was often lost. PAE adopts a parallel small kernel convolution structure and can capture deep and fine-grained local features (Altaheri et al., 2022), as shown in Figure 3. Specifically, PAE further extracts features along the temporal dimension using 32 convolutional kernels of sizes (1, 3) and (1, 7). After batch normalization and feature fusion, the features are activated by the ELU function. An average pooling layer with kernel size (1, 3) and a dropout layer are then applied to optimize the model and address EEG adaptability issues that arise from network deepening. Subsequently, the feature dimensions are linearly transformed, and a learnable positional encoder is introduced. A trainable matrix with the same dimensionality as the input features is randomly initialized, following a standard normal distribution to generate the random tensor. During model training, the parameters of the position encoder will participate in backpropagation to calculate the gradient, continuously updated as the network is trained, and autonomously perceive the position correlation. In the feature fusion stage, the position vectors generated by the position encoder are fused element-by-element with the features after linear transformation, enabling the network to perceive the position information of each feature (Dosovitskiy et al., 2020; Xie et al., 2022). Finally, the features containing location encoding information will be transferred to the SAT to explore complex dependency relationships, further enhancing the recognition ability of APCformer for EEG signals.

Diagram of a neural network architecture showing input passing through two parallel convolutional layers, batch normalization, ELU activation, average pooling, dropout, rearrangement, position encoding, and final output, with addition operations after each major processing step.

Structure of the PAE module, which features undergo parallel enhanced convolutions, followed by dimensional transformation, and positional encoding information is added to the sequence.

2.5 Sparse information aggregation transformer

Global modeling approaches based on attention mechanisms face a trade-off between computational complexity and feature preservation (Vaswani et al., 2017): excessive focus on global dependencies can weaken local feature emphasis, while overemphasizing local features may result in loss of global context. The SAT overcomes the traditional challenge of balancing global and local features while improving EEG data processing efficiency. The SAT structure as shown in Figure 4.

Flowchart illustrating a deep learning model architecture with input embedding at the bottom, followed by sliding window blocks, aggregation averages for each block, aggregation-attention, top-attention, and the output layer at the top. Arrows indicate data flow between components.

Structure of the SAT, which divides the sequence into multiple blocks using a sliding window, computes the mean attention of each block through Aggregation-Attention, and selects important blocks with the Top-Attention.

2.5.1 Sliding window

The sliding window partitions the input sequence into blocks, replacing the approach of feeding the entire sequence directly into subsequent layers. Using a sliding window of length s, the input sequence of length n is divided into m blocks with a stride of d, where , m<n. When d≥s, the blocks do not overlap with each other. When d<s, there will be overlap between the blocks. Each block contains s consecutive tokens, producing a total of ns = m×s tokens, ns>n. Compared to simply dividing the sequence into equal-sized, non-overlapping blocks, the sliding window approach increases the total number of tokens, which leads to higher computational cost. However, it partially compensates for the loss of inter-block connections. By adjusting the stride, the degree of overlap between blocks can be controlled, allowing the network to maintain block feature independence while enhancing global correlations. This design makes the overall network more flexible.

2.5.2 Aggregation-attention

SAT introduces the concept of sparse attention (Beltagy et al., 2020; Huang et al., 2024), which eliminates the need for computing attention between every pair of positions. The m blocks divided by the sliding window are averaged respectively, and the average operation of s tokens within the blocks is aggregated into individual blocks to achieve the representation of global coarse-grained features. If all tokens within the blocks are expanded in sequence, the positional information of any token within a block can be represented as (j−1) × d+i. Let Kj, i, Vj, i denote the key and value pairs of the ith token in the jth block, and the corresponding block-level key and value vectors be and , respectively. These can be formalized as follows:

where 1 ≤ j ≤ m and 1 ≤ i ≤ s. Each query Qi is dot-multiplied with all to compute attention scores. Then, the weighted sum is obtained through , and the output of the aggregate attention path can be expressed as follows:

where Kavg and Vavg are the sets of and , denoted as and . By employing a block aggregation approach, the global features of the entire sequence can be obtained, and the computational complexity can be reduced from O(n2) to O(n·m), thereby enhancing the efficiency of sequence processing.

2.5.3 Top-attention

Relying solely on aggregated blocks inevitably leads to feature information loss. Therefore, all blocks are selectively evaluated to retain the key feature blocks, minimizing accuracy loss during the modeling of long-range dependencies. Based on Aggregation-Attention, Top-k attention (Chen et al., 2023) is integrated. Specifically, the importance of each block is first evaluated using the attention scores obtained by the block, which avoids introducing new computational overhead. Then select the k highest blocks with the largest attention scores. In the EEG sequence scenario of this study, since the input sequence length to the Top-Attention module is relatively short (only T/75), the number of blocks m obtained after window partitioning is approximately 10, which is a small scale. At this point, the selection of k must balance sparsity and feature retention. Specifically, the introduction of sparse filtering in Top-Attention aims to focus on a small number of key blocks to reduce computational redundancy while preventing non-critical information from diluting the attention weights. When the input sequence itself is short and the total number of blocks is small, if k is set too large, the number of selected blocks will approach the total number of blocks, which is almost equivalent to no effective sparsity. In this case, Top-Attention degenerates into a near full-block computation mode, losing the advantage of reduced complexity through sparse selection. On the other hand, introducing too many non-key blocks causes the attention weights to become dispersed, weakening the model's ability to focus on core features. This contradicts the design logic of enhancing efficiency and precision through sparsity. Therefore, the value of k is generally chosen to be less than half of the total number of blocks m. Then generate a mask matrix M that depends on the attention score matrix Sm. The selected k index positions are assigned a score of 1, while the remaining positions are assigned a score of −∞, forcing the weights of non-top blocks to approach zero. The attention score Sk can be expressed as:

where attention score matrix . Subsequently, the s original tokens of the top blocks are recovered, and the position of the token corresponding to the jth top block can be denoted as . After the original tokens of the top blocks are unfolded in sequence, the total number of tokens nk = k×s, k≪m. At this point, the original key-value pairs are restored to Ktop and Vtop, and then the attention is calculated through Qi, which can be represented as:

where Ktop and Vtop are the sets of and , denoted as and . Top-Attention filters crucial blocks for computation, significantly reducing computational load compared to processing the original input sequence. By sparsely selecting k, complexity is further optimized to O(n·k), cutting computational redundancy while retaining key fine-grained features. Finally, SAT output combines results from Aggregation-Attention and Top-Attention, and is used in multi-head attention to achieve local and global perception of EEG information.

2.6 Classification

The classification module receives features processed by SAT. After passing through two fully connected layers with Dropout layers in between, the softmax function is used to compute the prediction probabilities for each class. The network is trained using a cross-entropy loss function to optimize the difference between the model's predicted probability distribution and the true labels, and can be represented as follows:

where Nb denotes the batch quantity, M denotes the classification number, y represents the actual sample label, and ŷ represents the predicted sample label.

3 Experimental design3.1 Datasets preparation

MI is the mainstream paradigm of EEG, allowing subjects to generate electrical signals in their brains merely by imagination. This paper takes the two datasets of BCI Competition IV 2a (Brunner et al., 2008) and BCI Competition IV 2b (Leeb et al., 2008) based on MI normal form as the experimental data support. Through EEG cross-session experiments, the effectiveness of the proposed APCformer decoding algorithm is verified. The specific details of the dataset are as follows.

3.1.1 Dataset I

The BCI Competition IV 2a dataset was derived from the four types of motion imagination data of the left hand, right hand, feet and tongue of 9 subjects. Each subject performed two experiments, which were used respectively as the training set and the test set. Each Experiment consists of 6 groups of tests, and each group of tests contains 48 trials. At the beginning of the test (t = 0s), a fixed cross mark appears. When t = 2s, the subjects are guided up, down, left and right for 1.25s. The subjects need to complete the corresponding action imagination until the cross disappears at t = 6s, and then enter a 2s resting stage. The entire process lasts approximately 8s. In this paper, a 0.5–30Hz band-pass filter is applied to this dataset. For each trial, the 2–6s MI segment is retained. Based on the number of segments, the number of channels, and the number of samples, a single subject training dataset and test dataset are constructed with a size of [288, 22, 1,000].

3.1.2 Dataset II

The BCI competition IV 2b dataset consists of two-class left-hand and right-hand MI-EEG data from 9 subjects. Each participant's data consists of 5 experiments, but the types of visual feedback are different. The first two experiments have no visual feedback and consist of 4 groups of trials, with 20 trials for each hand in each group. At the beginning of the experiment, a cross mark appears and then changes to a left or right arrow, which is displayed for about 1.25s. The participant needs to complete the imagination within 3 to 7s, and there will be a 1.5s rest period after that. The entire process lasts for 8.5 to 9.5s. The last three experiments have visual feedback and consist of 6 groups of trials, with 10 trials for each hand in each group. At the beginning of the experiment, a gray smiling face is displayed. At t = 3s, the participant is required to imagine moving the smiling face to the left or right. If correct, the smiling face turns green; if incorrect, it turns red and becomes a sad face. At t = 7.5s, the screen goes blank and there is a 1 to 2s rest period. In this paper, a band-pass filter retaining the 0.5–30Hz frequency band is applied. The 3–7s MI segment of each trial is extracted for subsequent analysis. Based on the number of segments, the number of channels, and the number of samples, a single-subject training dataset and test dataset are constructed with a size of [n, 3, 1,000], where n denotes the number of trials completed by each subject.

3.1.3 Data preprocessing

EEG data can be represented as X∈ℝC×T, where C denotes the number of EEG channels and T represents the number of sampling points along the time dimension. First, a bandpass filter is applied to extract the task-relevant frequency band of 0.5–30Hz from the EEG signals, eliminating noise and artifact components. Each MI-EEG fragment signal has independent temporal characteristics. If it is uniformly standardized across fragments or channels, it is easy to mask the baseline fluctuations of individual fragments and channel-specific differences. Therefore, Z-score standardization processing from fragment to channel is adopted one by one (Song et al., 2022; Zhao W. et al., 2024). The specific implementation is as follows: Let xn, c, t denote the EEG signal at the t-th sampling point of channel c in segment n, and let zn, c, t denote the standardized signal. For each segment n and each channel c, compute the mean μn, c over all sampling points, reflecting the local baseline level. It can be expressed as:

where tn represents the total number of sampling points for a single fragment, ensuring that the mean calculation is limited to the current channel of the current fragment only. Then calculate the unbiased standard deviation σn, c of all sampling points and quantify the local amplitude dispersion, which can be expressed as:

where the denominator is taken as T−1 to reduce the statistical bias of the single-fragment sampling points. Then substitute the data xn, c, t into the formula to obtain the standardized signal zn, c, t, which can be expressed as:

Each channel of each segment transformed from the Z-score satisfies the distribution characteristics of a mean of 0 and a standard deviation of 1, thereby reducing the volatility and non-stationarity of the data.

3.1.4 Data augmentation

EEG data training in small sample scenarios is highly prone to model overfitting, which seriously restricts the accuracy and generalization performance of the model. In this paper, the Segmentation and Reconstruction (S&R) technique is employed in the time domain to conduct data enhancement processing on EEG signals (

Comments (0)

No login
gif