DeepRespNet: a hybrid attention-recurrent framework for non-contact respiratory rate estimation

Abstract

Introduction:

The noncontact measurement of respiratory rate (RR) has gained considerable attention recently due to its relevance to remote healthcare and continuous physiological monitoring. However, existing camera-based approaches often exhibit reduced accuracy in subjects having darker skin tones, primarily owing to melanin absorption and variations in illumination that degrade remote photoplethysmographic (rPPG) signal quality. This study aims to develop a robust deep learning framework that ensures reliable RR estimation across diverse skin tones.

Methods:

A hybrid deep learning framework, referred to as DeepRespNet, is proposed that jointly analyzes rPPG and motion signals extracted from facial and thoracic regions in RGB video sequences. The core feature encoder, termed RespFormer, integrates spatiotemporal convolution with multi-head self-attention to capture both local and long-range respiratory patterns. The learned representations are further processed by a separate bidirectional long short-term memory (BiLSTM) network to model temporal coherence and generate physiologically stable respiratory waveforms. Optical-flow-based motion features and rPPG color variations are combined to form a multi-channel respiratory representation.

Results:

The proposed framework was evaluated on a multi-subject dataset with synchronized reference respiration signals. Experimental results achieve mean absolute errors of 0.45 breaths per minute (BPM) for light-skinned subjects and 0.80 BPM for dark-skinned subjects. Bland–Altman and cross dataset analyses further confirm strong agreement and consistent performance across skin tone groups.

Conclusion:

The proposed framework enables reliable and skin-tone-aware noncontact respiratory rate estimation. Initial findings indicate its potential suitability for camera-based respiratory monitoring in remote healthcare and telemedicine applications, though further validation on larger populations is required.

1 Introduction

Respiratory rate (RR) is a vital physiological parameter and an important early indicator of respiratory distress, sleep apnea, and other cardiorespiratory conditions (Maharaj et al., 2015; Rambaud-Althaus et al., 2015). Traditional methods of RR monitoring, such as chest belts, spirometry, or capnography, provide reliable measurements but typically require contact-based sensors. These can be uncomfortable, may restrict movement, and are unsuitable for long-term or remote monitoring applications, especially in telemedicine settings (Gwak et al., 2023). Over the past decade, non-contact approaches to respiratory monitoring have emerged as promising alternatives. These methods can be classified into one of two categories. The first category is remote photoplethysmography (rPPG), or camera-based PPG, which derives physiological signals from color variations in the skin captured via RGB camera (Bauer et al., 2008). The other category comprises motion-based approaches, which extract respiratory signals by tracking subtle chest, torso, or facial motion using optical flow, motion magnification, or feature tracking.

The rPPG-based estimation of RR and heart rate (HR) has been widely studied, and accurate cardiorespiratory monitoring can be achieved by using standard cameras under controlled conditions (Mateu-Mateus et al., 2021; Kashevnik et al., 2021). These methods leverage variations in blood volume pulse linked to physiological rhythms, thereby enabling noncontact measurement at a distance. Chrominance-based approaches, such as the CHROM method, have shown enhanced robustness to motion and illumination compared with blind source separation (BSS) techniques (de Haan and Jeanne, 2013). Similarly, a new method named GRGB rPPG employs linear RGB transformations to improve performance under indoor lighting and motion conditions (Haugg et al., 2023). Moreover, BSS methods like ICA and PCA, along with envelope-based fitting, have improved pulse morphology reconstruction in noisy environments, emphasizing the importance of advanced signal processing for robust rPPG estimation (Sun et al., 2023).

However, despite these strengths, rPPG has certain limitations. One of the drawback is its sensitivity to skin pigmentation, as melanin concentration tends to attenuate optical signals and results in reduced signal-to-noise ratios (SNRs) for darker-skinned individuals. Several studies have reported a measurable difference in performance across Fitzpatrick skin types, raising concerns about demographic fairness (Addison et al., 2018; Dasari et al., 2021). Meta-analyses further confirm that imaging based PPG techniques consistently underperform for the darker skin tones (Nowara et al., 2020). More recent investigations on publicly available rPPG datasets have identified persistent higher error rates in darker-skinned subjects, highlighting structural bias in existing benchmarks (Bondarenko et al., 2025). Additionally, Diverse R-PPG (Chari et al., 2020) have demonstrated that while HR estimation can be evaluated across a wide range of skin tones, performance disparities remain significant. To overcome these drawbacks, approaches such as PhysFlow (Comas et al., 2024) have introduced skin-tone-aware augmentation strategies to improve HR estimation robustness. However, these studies primarily focus on HR estimation, and the influence of skin tone variation on camera-based RR estimation has not been significantly investigated, highlighting a critical research gap addressed in this work.

There are widely used public benchmark dataset for physiological signal estimation, including PURE (Stricker et al., 2014), UBFC-rPPG (Bobbia et al., 2019), COHFACE (Heusch et al., 2017), VIPL-HR (Niu et al., 2018), and MAHNOB-HCI (Soleymani et al., 2011), which primarily focus on blood volume pulse (BVP) and HR estimation, with limited emphasis on RR estimation as mentioned in Table 1. Therefore, this study introduces a custom dataset aimed at addressing these gaps by focusing on RR estimation across diverse skin tones.

DatasetNo. of subjectsPhysiological signalGround truth sensorPURE [Stricker et al., 2014]10BVP, HR, SpO2CMS50E pulse oximeterUBFC-rPPG [Bobbia et al., 2019]42BVP, HR, SpO2CMS50E Pulse oximeterVIPL-HR [Niu et al., 2018]107BVP, HR, SpO2CONTEC CMS60C BVP sensorMANHOB-HCI [Soleymani et al., 2011]27ECG, RRCONTEC CMS60C BVP sensorCOHFACE [Heusch et al., 2017]40BVP, RRRespiration beltTokyoTech rPPG [Maki et al., 2019]9BVP, HRProComp Infiniti T7500MMMSE-HR [Rapczynski et al., 2019]40HR, RR, Blood pressureBiopac Mp150 BVP SensorMMPD [Tang et al., 2023]33BVP, HRFinger PPG (HKG-07 C+) at 30 H

Available public datasets.

In motion-based respiratory estimation, optical flow and motion magnification techniques are used to capture subtle thoracic displacements to monitor breathing patterns. These methods perform effectively under low illumination or partial occlusion, as they rely on geometric motion cues rather than variations in pixel intensity (Pediaditis et al., 2022; Ahani et al., 2024). Recent studies have integrated optical flow with deep spatiotemporal networks, enabling stable respiration tracking across different poses and lighting conditions (Manne et al., 2023). Such approaches are particularly beneficial when rPPG quality is poor, such as for subjects with darker skin tones, as they provide a reliable, motion-driven alternative to noncontact respiratory monitoring.

Despite these strengths, motion-based approaches also face notable limitations. Basic optical flow is highly sensitive to non-respiratory body movements, camera jitter, or background interference, which often introduces noise into the breathing signal (Gwak et al., 2022). Motion magnification techniques, although effective at amplifying subtle respiratory movement, can also unintentionally enhance irrelevant motion or noise, particularly in recordings by a handheld or unstable camera, unless additional stabilization techniques are applied (Alam et al., 2017).

Recently, deep learning-based approaches have gained attention for RR estimation by integrating data-driven feature extraction with the modeling of temporal and spectral dynamics. Convolutional neural networks (CNNs) effectively learn the spatial representations of skin pixels, improving robustness against illumination changes and noise compared to traditional signal processing techniques (Shihab, 2025). Recurrent models such as long short-term memory (LSTM) networks capture long-term temporal dependencies, enhancing the accuracy of cycle-level predictions in noisy sequences (Lin et al., 2019). Attention mechanisms further refine these models by adaptively emphasizing informative temporal and spatial features, thereby improving generalization to real-world conditions (Chen et al., 2024). Moreover, generative architectures, including adversarial and CycleGAN-based frameworks, have been explored to reconstruct respiratory components from weak or corrupted signals, demonstrating strong adaptability under low signal-to-noise conditions (Aqajari et al., 2021).

While deep learning has shown strong potential, most existing approaches face limitations that affect reliability and generalization. Many rely on a single region of interest (ROI), typically the face or chest, making them susceptible to signal degradation caused by posture changes, occlusions, or localized motion artifacts (Romano et al., 2021; Yu et al., 2019a). Systems that use facial regions alone often fail under masks, hair occlusion, or non-frontal poses (Speth et al., 2022). Additionally, skin tone variability remains underexplored, even though higher melanin concentration absorbs a larger portion of the incident visible light. This reduces the amount of reflected light available for rPPG extraction, weakening the optical signal and potentially affecting performance across demographic groups (Dasari et al., 2021; Bondarenko et al., 2025). Finally, while deep learning has shown strong potential in contactless HR estimation (Chen and McDuff, 2018), its integration into RR estimation remains limited and has prevented current models from fully leveraging both physiological priors and data-driven learning (Park and Hong, 2023).

To address these limitations, this study develops a unified learning framework that jointly analyzes rPPG and motion-derived cues from facial and thoracic regions. The framework first employs a lightweight RespFormer module, which integrates spatio-temporal convolution with multi-head self-attention to capture both local and long-range respiratory patterns and enhance respiration-related feature representations. These features are then processed by a bidirectional long short-term memory (BiLSTM) network, which models temporal dependencies in the frame sequence and enables accurate reconstruction of the respiratory waveform. The key contributions of this study are as follows:

A novel RespFormer feature encoder that integrates convolutional and multi-head attention mechanisms to effectively capture spatiotemporal respiratory cues from RGB-derived signals.

A bidirectional temporal modeling module using BiLSTM to improve cycle-level coherence and enhance the stability of estimated respiratory waveforms.

A dual-ROI strategy incorporating both facial rPPG and thoracic motion signals to improve robustness under challenging illumination and subject variability.

A skin-tone-aware evaluation protocol that analyzes performance across light and dark skin groups, providing insights into the fairness and cross-demographic consistency of the proposed method.

This manuscript is organized into three main sections. The Materials and Methods section describes the proposed framework, including signal extraction, preprocessing, the hybrid RespFormer-BiLSTM architecture, and experimental setup. The Results section presents performance metrics, comparative analysis, and validation across skin tone groups. The Discussion section interprets findings, addresses limitations, and outlines future research directions.

2 Material and methods

This work proposes a noncontact RR estimation framework using RGB video as the primary input. Each frame provides high-resolution spatial and color information across the red, green, and blue channels, thereby enabling reliable ROI detection and tracking. The system integrates facial and chest ROIs as complementary sources, with facial regions offering rPPG signals and chest regions encoding respiratory motion. The overall processing pipeline is illustrated in Figure 1.

Flowchart illustrating a pipeline for estimating respiratory rate from input video frames, showing stages of ROI processing, signal extraction, denoising, deep neural network analysis (RespFormer and BiLSTM), and final respiratory rate estimation with corresponding signal graphs and network diagram.

Overall workflow of the DeepRespNet framework.

2.1 Experimental setup and data collection

The experimental data were collected using a Sony Alpha 7 III camera recording at a resolution of 6000 × 4000 pixels and 30 fps. The camera was positioned 1 m from each subject (Figure 2) to maintain consistency in framing and image quality. Each recording captured both the facial and upper chest regions to represent respiration-induced motion under ambient lighting conditions of approximately 500 lux, measured using a digital lux meter. A uniform green background was used to reduce visual noise, whereas other environmental conditions remained natural. Participants were seated comfortably and instructed to breathe normally without motion restrictions, ensuring realistic acquisition conditions.

Research setup showing a person seated in front of a green backdrop, wearing a chest-mounted device connected to a medical monitor on the table, with a camera on a tripod positioned nearby and a laptop displaying data recording software on a chair beside the table.

Overview of the data acquisition setup showing camera position, subject posture, and the placement of the reference respiration belt.

The dataset consisted of 28 participants (14 male, 14 female) from diverse ethnic groups, including South Asian, Middle Eastern, and East Asian populations, all aged between 25 and 62 years (mean ± SD: 28.71 ± 8.37 years). The recorded respiratory rates ranged from 6 to 30 BPM, covering slow to normal resting breathing patterns. Participants were categorized into five skin tone groups (types I–V) based on the Fitzpatrick skin type scale. Classification was performed through visual inspection of facial and forearm skin under controlled indoor illumination conditions. This classification represents a subjective visual grouping for analytical purposes rather than a clinical dermatological diagnosis. Consistent with prior work in camera-based physiological monitoring (Talukdar et al., 2023), participants were subsequently divided into lighter (types I–II, n= 17) and darker (types III–V, n= 11) skin tone groups for comparative analysis. This study was conducted in accordance with the Institutional Review Board (IRB) regulations as per approval, and also informed consent forms were collected prior to the experiment. This diversity enabled an evaluation of model consistency across varying skin tones and facial characteristics, though the sample size limits broader generalization claims. Each recording session lasted for 60 s.

Ground-truth respiratory signals were simultaneously recorded using a Go Direct Respiration Belt (Code: GDX-RB), worn around the thorax (Figure 2). The sensor output was recorded in CSV format for synchronization and comparison with the non-contact respiratory signals extracted from the proposed framework.

Since the camera and respiration belt operate as independent acquisition systems, temporal synchronization was performed in the post-processing step by matching their dominant breathing cycles using cross-correlation analysis. The time delay corresponding to the maximum cross-correlation between the two signals was used to determine the temporal offset, after which a constant temporal shift was applied to the camera-derived signal to align it with the belt reference. Prior to alignment, both signals were resampled to the same temporal resolution based on the camera frame rate.

All preprocessing, signal processing, and model implementation were developed in Python 3.12. Video processing and region extraction were implemented using OpenCV and NumPy, signal processing operations including filtering and cross-correlation were performed using SciPy, and the deep learning components were implemented in PyTorch (version 2.9.1).

2.2 ROI extraction

The initial step involves acquiring RGB video sequences of the subject, followed by the detection of facial and chest regions—two areas exhibiting respiration-induced and blood volume pulse motion. Facial detection was performed using the Haar cascade classifier (Viola and Jones, 2001), selected for its computational efficiency and real-time capability on resource-constrained devices. The classifier computes Haar-like features as intensity differences between adjacent rectangular regions, as defined in Equation 1:

where I(i) denotes the pixel intensity at location i, Rwhite and Rblack represent the white and black rectangular regions in the Haar kernel, respectively. After the facial bounding box (x,y,w,h) was detected, the location of the chest ROI was geometrically inferred by projecting a second region directly below the face. The vertical and horizontal boundaries of the chest ROI were defined in Equations 2 and 3, respectively.

where α and β are scaling factors that determine the vertical and horizontal extent of the chest region relative to the detected face. The values were set to α = 2.5 and β = 0.2, determined empirically across all subjects in the dataset to identify the configuration that most consistently captured the upper thoracic region across varying body sizes and camera distances. The resulting ROI extends approximately 2.5h below the top of the face bounding box and is expanded horizontally by 20% of the face width on both sides. These proportional factors were selected to reliably includes the region where respiratory motion is most prominent, while maintaining spatial consistency across subjects.

Skin segmentation was then performed using dual color-space thresholds in the YCrCb and HSV domains, ensuring illumination invariance. In the YCrCb space, skin pixels were identified based on chrominance ranges of the Cr and Cb channels, while in the HSV space, thresholds were applied to the hue and saturation components. The corresponding binary masks were defined in Equations 4 and 5, respectively.

where Cr, Cb, H, and S denote the chrominance and hue-saturation components of each pixel, and the thresholds (Crmin,Crmax,Cbmin,Cbmax) and (Hmin,Hmax,Smin,Smax) define the skin-color ranges in the respective color spaces. The threshold ranges were adopted from previously reported skin-detection models in the YCrCb and HSV color spaces (Chai and Ngan, 1999), which characterize typical skin chrominance and hue-saturation distributions across varying illumination conditions.

Pixels satisfying either condition formed the final skin mask, as expressed in Equation 6:

The resulting mask was further refined using Gaussian smoothing and morphological filtering to suppress noise and small isolated regions (Saryanto and Andriyani, 2024). The final face and chest ROIs were extracted using Equations 7.

where I denotes the original RGB frame. This ensures that only skin-reflective regions are preserved, minimizing interference from background or clothing. Figure 3 illustrates this process, showing the detected ROIs and corresponding binary skin masks across consecutive frames. This dual-ROI configuration effectively captures plethysmographic variations from the face (Verkruysse et al., 2008) and respiration-induced motion from the chest, thereby providing a reliable mechanical coupling of breathing dynamics (Cheng et al., 2023).

Top row shows three images of a person with eyes and mouth blacked out, each labeled as Frame 100, 200, and 300, with red and blue rectangles around the face and upper body. Bottom row displays corresponding skin mask binary images labeled as Skin Mask 100, 200, and 300, highlighting the facial area in white with red outlines against a black background.

Illustration of ROI extraction and skin masking across frames. The upper row shows detected face (red) and chest (blue) ROIs, and the lower row shows corresponding binary skin masks highlighting skin-reflective regions.

2.3 ROI processing

After the face and chest ROIs were extracted, a preprocessing stage was applied to enhance signal reliability for subsequent analysis. Non-skin pixels identified during segmentation were masked to zero intensity, thus ensuring that only skin-reflective regions contributed to feature computation. To suppress high-frequency noise from illumination flicker, sensor variability, or minor pixel motion, temporal smoothing was applied using a Gaussian filter. For each pixel intensity sequence x(t), a Gaussian kernel Gσ(t) was convolved as defined in Equation 8.

where σ controls the smoothing bandwidth and k denotes the kernel half-width. This step effectively removed rapid fluctuations while preserving low-frequency components corresponding to respiratory modulations of 0.1–0.6 Hz.

In addition, frame normalization was applied to minimize inter-subject variability in skin tone and illumination. For each frame If, pixel intensities were rescaled is defined in Equation 9.

where and denote the per-frame mean and standard deviation of pixel values within the skin mask, respectively. These steps standardized and denoised the ROIs, thereby ensuring uniform feature quality across varying skin tones (Talukdar et al., 2023).

2.4 Signal processing

From the pre-processed ROIs, two complementary physiological signals were derived to capture respiration-related variations, an rPPG-based signal and a motion-based respiratory component. The rPPG signal was obtained by averaging green-channel intensities across skin pixels, as the green band provides the highest signal-to-noise ratio (SNR) for capturing subtle blood volume variations (Verkruysse et al., 2008). Simultaneously, respiration-induced motion was estimated using the Lucas–Kanade optical flow algorithm (Queiroz et al., 2020), with vertical displacements weighted toward thoracic regions to emphasize breathing dynamics. These two signal modalities capture distinct and complementary physiological relevence.

Both signals were first detrended to remove slow baseline variations caused by illumination changes and sensor drift. The detrended signals were then bandpass-filtered (0.1–0.6 Hz) using a fourth-order Butterworth filter to suppress residual low-frequency drift and high-frequency noise while preserving respiratory oscillations. To avoid phase distortion, the filter was implemented using a zero-phase forward–backward filtering procedure. The Butterworth filter response is defined in Equation 10:

where represents the cutoff frequency and n is the filter order.

Following the filtering, signals were normalized using a three-sigma rule to ensure comparability across subjects having different skin tones. The normalization is expressed in Equation 11.

where is the input signal, is the temporal mean, and is the temporal standard deviation of the signal.

To facilitate temporal analysis, overlapping windows of 200–300 frames were extracted with 80%–90% overlap, balancing temporal resolution and data continuity. To further investigate the relative contribution of different ROIs, a weighted fusion of face and chest signals was analyzed using Equation 12:

where and are the chest and face signals, respectively. A higher weight was assigned to the chest component due to its stronger physiological coupling and superior SNR than the face signal (Schrumpf et al., 2019). A grid search over different weight combinations was performed to evaluate how these regions contribute to respiratory representation (Table 2).

Chest weightFace weightMAE (Light)MAE (Dark)0.50.50.721.150.60.40.611.020.70.30.530.910.80.20.450.800.90.10.520.881.0 (chest only)0.00.580.95

Ablation study on signal fusion weights.

MAE values are reported in BPM. Bold values indicate the selected configuration. The 0.8/0.2 weighting achieves optimal performance by leveraging the high-SNR chest signal while retaining complementary facial information.

This analysis (Figure 4) shows that chest motion signals exhibit higher SNR value, indicating stronger and more stable respiratory patterns. In contrast, facial rPPG signal exhibits lower SNR value, however, it captures complementary physiological variations influenced by blood perfusion. These findings suggest that while chest motion provides a dominant respiratory signal, rPPG signals from facial regions provide additional physiological information that enhances robustness under varying skin tones and environmental conditions. Overall, this signal processing strategy highlights the physiological relevance of combining multiple ROIs, demonstrating that both regions provide complementary insights of the respiratory activity.

Box plot compares SNR (dB) across four categories: Face Motion, Chest Motion, Face PPG, and Chest PPG. Median SNR values are labeled as 1.4 dB, 5.2 dB, 2.8 dB, and 0.9 dB, respectively. Each colored box shows spread and range, highlighting Chest Motion with the highest median and range, and Chest PPG with the lowest median SNR. Title reads “SNR Comparison” and axes are labeled accordingly.

SNR comparison of respiration signals extracted from face and chest ROIs. Chest motion exhibits the highest reliability, which justifies the weighted fusion scheme.

2.5 DeepRespNet architecture

The proposed DeepRespNet framework is designed to fuse motion- and color-based physiological cues from two key ROIs: the face and the chest. From each region, an rPPG signal and a motion signal are extracted, and these produce a four-channel temporal input. These channels are first synchronized, normalized, and then transformed into a unified tensor of shape (Batch, Time, 128), where the final dimension corresponds to the projected feature embedding. The model processes this tensor using a hybrid architecture composed of a RespFormer encoder, which integrates convolutional feature extraction with multi-head self-attention, and a BiLSTM module to capture both local and long-range respiratory dynamics. The following subsections describe each component of this architecture. .

2.5.1 RespFormer encoder

The first stage of the framework is the RespFormer encoder, which integrates convolutional feature extraction and multi-head self-attention to generate physiologically meaningful representations from the rPPG and motion signals. As illustrated in Figure 5, the input sequence is first processed through three one-dimensional convolutional layers (Conv1–Conv3) with progressively decreasing kernel sizes (7, 5, and 3). This hierarchical design enables the network to capture both low-frequency respiratory oscillations and subtle temporal variations. Each convolutional block includes ReLU activation, batch normalization, and dropout to stabilize the training and reduce overfitting. The one-dimensional convolution operation is mathematically expressed in Equation 13:

Diagram of the RespFormer neural network architecture showing a four-channel input of rPPG and motion data passing through multiple convolution layers, multi-head attention, bidirectional LSTM, and further convolution layers to output a predicted respiratory waveform. A legend explains colored blocks as Conv, Relu, Batchnorm, and Dropout layers.

Overall architecture of the DeepRespNet framework. The 4-channel rPPG–motion input is first processed by the RespFormer module to extract multi-scale temporal features. A two-layer BiLSTM then models bidirectional temporal dependencies, and Conv4–Conv6 reconstruct the final respiratory waveform.

where L denotes the kernel size, wirepresents the learnable weights, and b is the bias term.

The multi-head attention module (Figure 6) refines these feature maps by enabling global interactions across the sequence. The attention mechanism (Vaswani et al., 2017) computes pairwise dependencies across the entire sequence, which enables the model to capture long-range temporal relationships. For each attention head, the input is projected into query (Q), key (K), and value (V) representations, and the output is computed using Equation 14 as.

Diagram illustrates the architecture of multi-head attention in transformers, showing linear projections of input into queries, keys, and values, scaled dot-product attention, concatenation, and final outputs with residual connection and layer normalization.

Illustration of the multi-head attention mechanism. The left part shows the scaled dot-product attention, and the right part depicts the multi-head structure with eight parallel heads followed by concatenation and normalization.

where dkis the dimension of the key vectors, and the scaling factor prevents the softmax function from producing extremely small gradients for large dimensionalities.

Eight attention heads are used to capture multi-scale temporal dependencies, and their outputs are concatenated, linearly transformed, and normalized by a residual connection to stabilize optimization. This mechanism allows the RespFormer to emphasize temporally coherent respiratory cues while suppressing unrelated fluctuations.

2.5.2 BiLSTM module

Following the RespFormer encoder, a two-layer BiLSTM (Figure 7) models bidirectional temporal dependencies by leveraging both past and future contexts. Each LSTM unit updates its gates according to Equation 15.

Diagram illustrating a two-layer bidirectional LSTM network architecture. Each layer contains a sequence of forward (FW) and backward (BW) LSTM cells, with input and output batches labeled along the bottom and top respectively. Connections between LSTM cells, concatenation of outputs, and directionality of data flow are shown. Legends define FW as Forward LSTM and BW as Backward LSTM.

Two-layer BiLSTM architecture showing forward (FW) and backward (BW) flows. Outputs from both directions are concatenated at each timestep to form the final representation.

The memory cell is then updated using Equation 16.

and the hidden state is computed as expressed as in Equation 17.

where , , and denote the forget, input, and output gates, respectively; and represent the learnable weight matrices; and are bias terms.

In the bidirectional configuration, the hidden states from the forward and backward passes are concatenated as shown in Equation 18:

In the time-unrolled view (Figure 7), the input sequence Xt,Xt+1,…,Xt+n represents 128-dimensional feature vectors, forming a tensor of shape (Batch,Time,128). The BiLSTM produces corresponding outputs Yt,Yt+1,…,Yt+n of the same dimension. This bidirectional temporal modeling effectively captures the complete dynamics of the breathing cycle, enhancing temporal coherence and robustness in RR estimation.

Table 3 summarizes the training configuration and hyperparameters used to optimize the proposed RespFormer–BiLSTM framework. Subject-level k-fold cross-validation (k = 5) is used, where subjects are split into mutually exclusive folds. In each fold, the model is trained on multiple subjects and tested on unseen subjects, ensuring no subject overlap between each splits. The same evaluation protocol was consistently applied across the public datasets individually.

ParameterValue

Comments (0)

No login
gif