Assessing the impact of climate-induced biodiversity loss on respiratory health through text classification

Abstract

Introduction:

The complex interplay between environmental dynamics, biodiversity loss, and public health necessitates advanced methodologies for quantifying and interpreting their interactions. Respiratory health, highly sensitive to environmental changes, requires particular attention as ecosystems undergo transformations driven by climate stressors. Traditional epidemiological and statistical models often fail to adequately capture the high-dimensional, non-linear, and spatiotemporal characteristics of environmental exposures and their diverse impacts on human health, thereby limiting the derivation of causally interpretable insights from observational data under conditions of biodiversity stress and atmospheric variability.

Methods:

To address these challenges, this study introduces a novel framework integrating a deep learning-based model, GeoExposureNet, with Causal-Aware Adaptive Mapping (CAM), specifically designed for environmental health analysis. GeoExposureNet employs spatial graphs, temporal convolution, and attention mechanisms to encode localized and lagged exposure effects, while CAM incorporates causal reasoning, policy adjustments, and epidemiological priors to refine inference and enable counterfactual simulations. This hybrid approach facilitates the evaluation of respiratory health outcomes across diverse exposure trajectories influenced by biodiversity-related environmental shifts.

Results and discussion:

Empirical results demonstrate that the proposed pipeline not only surpasses conventional baselines in predictive accuracy but also enhances interpretability and intervention strategies by uncovering differential vulnerabilities and exposure-response relationships. This integrative framework represents a significant advancement in modeling climate-sensitive health risks, offering scalable and adaptable tools for researchers and policymakers addressing the intersections of climate change, biodiversity, and public health.

1 Introduction

Climate change poses multifaceted challenges that extend beyond environmental degradation to directly affect public health. One such critical (1), yet underexplored, dimension is the relationship between biodiversity loss driven by climate change and respiratory health outcomes. As ecosystems deteriorate, the disruption of natural allergen barriers and the proliferation of airborne pathogens and pollutants increasingly compromise air quality (2). This not only elevates the incidence of respiratory diseases such as asthma and chronic obstructive pulmonary disease (COPD) but also magnifies health disparities in vulnerable populations. Moreover, the subtle and complex interactions between environmental shifts and health outcomes necessitate sophisticated analytical tools for effective monitoring and prediction (3). Therefore, identifying and classifying relevant textual data from scientific literature, healthcare records, and environmental reports is essential for uncovering patterns and associations that can inform public health interventions and policy development. This highlights the urgent need for advanced computational methods to systematically assess the impact of climate-induced biodiversity loss on respiratory health.

Initial efforts to investigate the interplay between biodiversity loss and respiratory health relied on manually curated frameworks that connected environmental factors to health outcomes (4). These approaches often utilized predefined mappings between allergen sources, pollution levels, and respiratory conditions, offering clear and interpretable insights (5). However, their dependency on predefined relationships limited their ability to adapt to the dynamic and unstructured nature of the data (6). These systems struggled to scale effectively, making it challenging to identify emerging patterns or hidden associations across large and diverse datasets. As a result, the growing complexity of environmental health interactions necessitated the exploration of more adaptive and automated methodologies.

Recognizing the need for greater flexibility, subsequent research shifted toward computational models capable of autonomously learning from data (7). Techniques such as decision trees and ensemble methods were employed to analyze textual data and identify links between biodiversity indicators and respiratory health. These models leveraged statistical patterns to improve classification accuracy and reduce the reliance on expert-driven rules. Despite these advancements (8), the requirement for extensive preprocessing and feature selection remained a bottleneck, constraining the models' ability to generalize across diverse and evolving datasets (9). Furthermore, their limited capacity to capture deeper contextual relationships hindered their effectiveness in addressing the multifactorial nature of environmental and health data.

In more recent developments, the application of deep learning and contextualized language models has revolutionized the analysis of unstructured textual data (10). Models like BERT and GPT have demonstrated an unparalleled ability to uncover complex associations between environmental and respiratory health factors by leveraging pre-trained embeddings and contextual information (11). These models enable the discovery of latent patterns across diverse corpora without the need for extensive manual intervention, making them particularly suited for analyzing the intricate impacts of climate-induced biodiversity loss (12). Nevertheless, challenges such as computational demands and the need for high-quality training data persist, highlighting the importance of balancing model performance with interpretability and resource efficiency in this critical area of research.

To address the limitations of previous methods—ranging from the rigidity of symbolic systems, the feature dependence of traditional machine learning, to the interpretability challenges of deep models—we propose an integrated text classification framework tailored for assessing the impact of climate-induced biodiversity loss on respiratory health. Our approach combines the contextual understanding of pre-trained language models with domain-specific knowledge integration to enhance interpretability and relevance. By incorporating environmental and medical taxonomies into the classification pipeline, the method not only identifies relevant associations more effectively but also ensures greater transparency in decision-making. This hybrid strategy facilitates the extraction of actionable insights from complex and heterogeneous textual data, providing a robust foundation for informing public health responses and policy formulation in the face of evolving climate challenges.

This approach offers multiple notable strengths, particularly in its ability to:

It incorporates a taxonomy-guided attention mechanism that enhances relevance detection in textual data across medical and environmental domains.

It is designed for adaptability, allowing it to operate efficiently across varied datasets and scenarios with minimal retraining.

Experimental results show a 17% improvement in F1-score over baseline models when identifying respiratory health impacts linked to biodiversity terms.

2 Related work2.1 Climate change and biodiversity loss

The intricate relationship between climate change and biodiversity loss has been extensively analyzed within ecological and environmental research (13), revealing profound disruptions to ecosystems and species distributions caused by shifting climatic conditions (14). Drivers such as rising global temperatures, altered precipitation regimes, and the prevalence of extreme weather events have accelerated the extinction risk for numerous taxa across both terrestrial and aquatic domains (15). Empirical studies have demonstrated that the impacts of climate change on biodiversity are not uniformly distributed but are instead geographically and taxonomically biased, disproportionately affecting species in biodiversity hotspots and those with limited physiological adaptability or narrow ecological niches. Mechanistic pathways through which climate change influences biodiversity include phenological mismatches (16), habitat fragmentation, and alterations in interspecies dynamics. Research has identified synergistic effects between climate change and other anthropogenic pressures, such as land-use transformation, pollution, and invasive species (17), which collectively exacerbate biodiversity loss. Ecological network theory has provided insights into how the decline or removal of keystone species under climate stress can cascade through trophic levels, disrupting ecosystem stability and functionality (18). Advances in ecological modeling, including species distribution models (SDMs) and dynamic global vegetation models (DGVMs), have enhanced projections of biodiversity responses across diverse climate scenarios (19). These models underscore feedback mechanisms wherein biodiversity loss undermines vital ecosystem services such as carbon sequestration, air purification, and disease regulation. Long-term ecological studies, such as those facilitated by the ILTER network, have added temporal depth to our understanding of these processes, highlighting the cascading consequences of biodiversity loss for ecosystem resilience and human health.

2.2 Biodiversity and respiratory health

The relationship between biodiversity and respiratory health has emerged as a critical area of investigation, focusing on how environmental microbiomes and vegetative diversity influence air quality and immune system regulation (20). The biodiversity hypothesis suggests that diminished exposure to diverse environmental microbiota, driven by biodiversity loss and urbanization, adversely affects immune function (21), increasing the prevalence of allergic and inflammatory conditions such as asthma and chronic obstructive pulmonary disease. Observational studies have revealed that individuals living in biodiverse environments exhibit lower rates of respiratory morbidity (22), a phenomenon linked to microbial interactions that train immune systems to distinguish between benign and harmful antigens. Declines in plant diversity exacerbate the distribution of airborne allergens (23), including pollen, heightening respiratory risks in regions with reduced floral richness. Moreover, the degradation of ecosystem services, such as the absorption of particulate matter and volatile organic compounds by vegetation, further compounds respiratory health challenges in biodiversity-impoverished areas (24). Epidemiological studies utilizing large-scale datasets, including satellite imagery and land-use maps, have correlated biodiversity metrics with respiratory health outcomes, revealing negative associations between biodiversity indices and hospitalization rates for respiratory conditions (25). Multidisciplinary approaches combining ecological, immunological, and epidemiological perspectives have elucidated complex causal pathways linking biodiversity with respiratory health. This line of inquiry underscores the necessity of conservation strategies not only for ecological integrity but also as interventions to mitigate public health risks exacerbated by climate-induced ecological transformations.

2.3 Text classification for environmental health

Text classification has become a pivotal methodological tool in environmental health research, enabling the synthesis of large-scale unstructured data to uncover patterns and correlations that traditional methods might overlook (26). The application of supervised and unsupervised learning models has facilitated the analysis of diverse corpora, including biomedical literature, environmental monitoring reports (27), and patient health records, to investigate links between respiratory health and climate-induced biodiversity loss. Techniques such as named entity recognition (NER), topic modeling, and sentiment analysis have proven effective in identifying references to respiratory conditions, environmental exposures, and biodiversity-related factors, which can be spatially and temporally correlated to infer causal relationships (28). Advances in deep learning, particularly the use of transformer-based architectures such as BERT and its domain-specific variants, have significantly improved the precision and contextual understanding of text classification models in multidisciplinary contexts. These models excel in disambiguating complex terms and recognizing subtle expressions of ecological and health phenomena (29), enhancing automated literature reviews and real-time surveillance systems. Integrating text classification outputs with geospatial and temporal metadata supports the construction of dynamic knowledge graphs, tracing the evolution of scientific discourse and public concerns over time (30). Practical applications include informing environmental health policies by identifying emerging risks, prioritizing research gaps, and evaluating the dissemination of information across diverse stakeholder groups (31). Methodological challenges, such as biases inherent in textual data or regional disparities in reporting, require careful consideration to ensure robustness in classification outcomes. Despite these challenges, text classification holds significant promise for advancing understanding of how climate-driven ecological changes impact respiratory health within broader environmental health landscapes.

3 Method3.1 Overview

Environmental health examines the intricate interplay between human well-being and environmental determinants, requiring a comprehensive methodological framework to analyze and model associated risks. This subsection outlines the proposed approach, designed to systematically investigate environmental exposures and their impacts on public health. The methodology integrates theoretical constructs with data-driven advancements to address the challenges posed by high-dimensional, heterogeneous, and uncertain environmental health data.

Section 3.2 establishes the theoretical basis, introducing formal definitions and mathematical representations for environmental variables, exposure pathways, and health outcomes. These elements are framed within structured probabilistic models to accommodate spatial and temporal variability, as well as measurement uncertainties. This foundational framework facilitates the synthesis of diverse data sources and causal mechanisms, providing a unified analytical language for subsequent modeling efforts. Building on this theoretical groundwork, Section 3.3 introduces the proposed GeoExposureNet, a model designed to capture localized environmental effects alongside broader systemic patterns influencing population health. The model extends traditional exposure-response paradigms by incorporating spatial graph structures, attention-driven feature selection, and dynamic inference mechanisms that adapt to evolving environmental conditions. This approach addresses the challenges of nonlinear, high-dimensional dependencies while maintaining interpretability and computational scalability. Section 3.4 details the Causal-Aware Adaptive Mapping (CAM) framework, developed to tackle domain-specific issues such as confounding bias, spatial dependencies, and the integration of real-time policy feedback. CAM employs counterfactual reasoning and domain-informed regularization to strengthen causal inference, particularly in observational settings where experimental designs are impractical. Furthermore, it seamlessly integrates with GeoExposureNet, enabling the adaptive refinement of exposure-response relationships through iterative learning from newly acquired evidence. While the proposed framework integrates multiple components to capture the high-dimensional, spatiotemporal nature of environmental health interactions, it is important to note that certain modeling uncertainties remain. Limitations in data coverage (incomplete health registries or spatial gaps in biodiversity monitoring) and potential measurement errors in environmental variables may introduce noise into the estimation process. Assumptions such as the stability of causal relationships across regions and time periods may not always hold under real-world heterogeneity. To partially mitigate these uncertainties, we employ regularization techniques, adaptive contextual integration, and multi-resolution spatial modeling, which improve robustness but do not fully eliminate estimation variability. These factors may affect the precision of the derived health risk maps and should be considered when interpreting the model outputs, particularly in under-monitored geographic regions or for populations with limited health data availability.

3.2 Preliminaries

We address the problem of quantifying and modeling the influence of environmental exposures on human health outcomes across spatial and temporal domains. This section establishes the formal problem setting and introduces the notational conventions underlying the proposed methodology. The aim is to provide a precise mathematical foundation for the subsequent modeling framework.

Let represent a bounded geographical area, and let denote a continuous time interval of interest. An environmental monitoring field is defined as a spatiotemporal function , where Ej(x, t) for j = 1, …, p corresponds to the j-th type of environmental exposure observed at location and time .

The health outcomes of a population are described by , where Hk(x, t) represents the k-th health indicator at location x and time t. A latent susceptibility field is introduced to capture spatially-varying demographic, socioeconomic, or biological factors that modulate vulnerability to environmental exposures.

To formalize the exposure-response relationship, we define a conditional response operator parameterized by θ ∈ Θ:

where ε(x, t) is a zero-mean stochastic error process accounting for unobserved confounders and measurement noise. The operator acts on the temporal history of E at location x and is assumed to capture causal dependencies.

The localized exposure history at a spatial coordinate x over a temporal window of length τ is defined as:

We hypothesize that health outcomes depend on the cumulative effect of prior exposures rather than instantaneous exposure. This relationship is expressed as:

where is a temporally modulated kernel capturing the lagged impact of exposure.

To account for spatial interactions, we define a population-weighted adjacency measure :

where P(x′) represents the population density at x′, and σ is a spatial bandwidth parameter. Using this measure, a nonlocal exposure aggregation operator is expressed as:

The aggregated exposure indicator over a region is defined as:

where |Ω| represents the area of region Ω. Similarly, the regional health indicator is given by:

Contextual variables, denoted , are introduced to represent external factors such as policy measures or seasonal effects. These are modeled as and incorporated into the exposure-response model:

For causal interpretation, a potential outcome function is defined to represent the hypothetical health outcome at location x under an intervention setting the exposure trajectory to e:[t − τ, t] → ℝp:

The causal effect of contrasting two exposure trajectories e1 and e2 is given by:

The observational dataset is formalized as , consisting of independent and identically distributed samples from an underlying spatiotemporal data-generating process. The objective is to estimate or approximate the response operator or in a manner consistent with the observed joint distribution of exposures, outcomes, covariates, and spatial heterogeneity.

3.3 GeoExposureNet

We propose a novel predictive architecture, termed GeoExposureNet, which is designed to capture nonlinear, lagged, and spatially-modulated interactions between environmental exposures and health outcomes (as shown in Figure 1). This model integrates structured spatial graphs, temporal convolutions, and adaptive attention over exposure histories to encode complex dependencies that characterize environmental health dynamics.

Diagram illustrating a multimodal encoder architecture with adaptive contextual integration. It features three main sections: Behavior Sequence, Multimodal Encoder Architecture, and Graphical Propagation Layer. The Behavior Sequence includes reduction and normal cells. The Encoder Architecture involves pyramid reduction, convolution layers, and self-attention mechanisms. The Propagation Layer shows sequential processing with additional feed-forward layers. Symbols for position embedding, element-wise addition, and concatenation are included.

Schematic diagram of the GeoExposureNet. Illustration of the GeoExposureNet architecture, which integrates multimodal encoding, spatiotemporal modeling, and adaptive contextual learning for environmental health prediction. The lower portion depicts the Multimodal Encoder Architecture, consisting of a pyramid reduction module, dilated convolutions, and spatially-aware feature extraction layers. The right section shows the Graphical Propagation Layer, which employs graph diffusion and self-attention mechanisms to capture multi-hop spatial dependencies and temporally salient exposure patterns. The top section outlines the Adaptive Contextual Integration pipeline, where contextual variables such as policy signals are encoded and fused with exposure representations to generate predictions. Position embeddings, element-wise operations, and class tokens are used throughout the pipeline to support interpretable, fine-grained forecasting across complex exposure landscapes.

3.3.1 Multimodal encoder architecture

Let denote a spatial graph where nodes correspond to discretized spatial locations and edges encode proximity via a similarity kernel:

where σ is a spatial bandwidth and δ is a locality threshold. The graph Laplacian L = D − A, where D is the degree matrix, is used to capture spatial relationships. Each node vi is associated with a temporal sequence of exposure vectors , forming the input tensor E ∈ ℝN×T×p.

A temporal convolutional encoder ϕtemp maps exposure histories to latent trajectories:

where τ is the lookback window and Z ∈ ℝN×T×d denotes the encoded feature cube (as shown in Figure 2). This architecture integrates spatial graphs with temporal convolutional layers to encode spatiotemporal patterns, enabling the model to represent environmental exposure histories effectively.

Flowchart illustrating a multimodal learning framework. Three input streams labeled Audio, RGB, and Flow pass through their respective Backbones and Encoders, producing individual features. These features undergo unimodal multiple instance learning (MIL) and are projected for multimodal alignment. Aligned features are fused and processed by a Multimodal Encoder, resulting in anomaly scores and classifications as Normal or Anomaly. The flowchart highlights processes like Projection, Aligning, Sampling, and Multimodal Regression. Colors differentiate sections and processes.

Schematic diagram of the multimodal encoder architecture. This figure illustrates the Multimodal Encoder Architecture, which processes audio, RGB, and flow modalities through a sequence of unimodal MIL encoders, followed by multimodal alignment and final decision-making modules. Each input stream is passed through a backbone network to extract features, which are refined using modality-specific MIL branches. These features are then aligned across modalities using transformer-based attention modules to enable interaction between modalities such as audio, flow, and RGB. The aligned features are fused and passed to the final module, which produces anomaly scores through a fusion head trained with both MIL and triplet losses. In parallel, the architecture includes a spatiotemporal encoder that builds a spatial graph based on node proximity and models exposure history as temporal sequences. These sequences are processed with temporal convolution to extract latent trajectories, allowing the model to capture spatial and temporal dependencies in the data. This design enables the encoder to effectively represent complex multimodal inputs for visual anomaly detection tasks.

To ensure alignment with the mathematical framework introduced in Section 3.2, we provide additional clarification on the preprocessing steps for generating node features E(x, t) in the spatial graph G = (V, E). First, textual data—including environmental reports, medical literature, and policy documents—is processed using a domain-adapted language model (BERT or its environmental variant), yielding contextualized embeddings for each document. These embeddings are then geo-referenced by mapping metadata (such as publication origin, sensor location, or tagged coordinates) to a discrete spatial location xi ∈ D, forming the node set V. For each node vi, the temporal sequence consists of time-aligned embeddings associated with the corresponding location and time t, capturing environmental exposure semantics. These sequences are used as node features in the tensor E ∈ ℝN×T×p, where p is the embedding dimension. This design ensures that every exposure vector E(x, t) used in modeling corresponds to a real-world textual signal grounded in both spatial and temporal context, maintaining consistency between theoretical definitions and model input.

3.3.2 Graphical propagation layer

To model cross-location influence, we incorporate graph diffusion layers that propagate signals along . Specifically, for each time t, the diffusion operation is defined as:

where are trainable coefficients and K is the diffusion depth. This operation captures K-hop spatial dependencies using polynomial graph filters. The propagative mechanism ensures that spatial interactions are encoded, facilitating the modeling of localized and non-localized dependencies.

A self-attention mechanism is applied to the temporal dimension to learn time-sensitive relevance scores:

where are projection matrices. Temporal attention ensures that the model identifies critical exposure periods, enhancing the responsiveness of predictions to time-varying phenomena.

3.3.3 Adaptive contextual integration

Contextual variables are incorporated through a control embedding obtained via feedforward transformation:

where and . This embedding is broadcast across spatial nodes and fused with attention-augmented features:

This integration combines spatial and temporal attention mechanisms with contextual embeddings, ensuring that the model adapts to external factors such as policy interventions or socioeconomic indicators.

The output layer maps Ut to predicted health responses :

where the MLP consists of dense layers with non-linear activations and residual connections. For interpretability, relevance attribution scores are defined for each node-time pair:

GeoExposureNet combines multimodal encoding, graphical propagation, and contextual integration, forming a unified framework for spatiotemporal forecasting, counterfactual simulation, and sensitivity analysis. The composite function summarizing the forward pass is:

The architecture is modular, interpretable, and robust to missing data, allowing integration with satellite imagery, policy metadata, and dynamic exposure registries via multi-modal inputs.

3.4 Causal-aware adaptive mapping

We introduce Causal-Aware Adaptive Mapping (CAM), a strategic framework designed to extend the expressiveness of GeoExposureNet by integrating causal reasoning, domain knowledge, and dynamic policy response into model training and inference (as shown in Figure 3). CAM addresses three fundamental limitations in conventional environmental health modeling: failure to disentangle confounders, inability to simulate counterfactual interventions, and lack of structural alignment with epidemiological priors.

Diagram illustrating a complex framework. On the left, the “Counterfactual Simulation and Spatial Coherence” section details processes involving receptive fields, spatial features, and transformations. The center “Causal Effect Estimation Framework” and right “Epidemiological Regularization and Policy Adaptation” sections show layers like MSFM, LAE, and UpSample interacting through various paths. Labels indicate different scales: large, medium, small, and extra small. A key explains symbols for MSFM configurations, concatenation, forward, and output paths. An image of a bird on a rock is displayed at the bottom right corner.

Schematic diagram of the CAM. The CAM architecture integrates three synergistic components—Counterfactual Simulation and Spatial Coherence, Causal Effect Estimation, and Epidemiological Regularization and Policy Adaptation—to form a unified environmental health modeling pipeline. From raw spatial inputs and exposure fields, the model simulates alternative health scenarios using REABlocks, encodes causal structures through MSPE/SSF and LAE modules, and adapts to dynamic policy conditions via multiscale outputs. Each stage is interconnected to propagate spatially resolved causal insights, enabling counterfactual reasoning, confounder disentanglement, and epidemiological alignment across multiple resolutions. The full pipeline transforms GeoExposureNet into a decision-support system that is robust to domain shifts and sensitive to public health policy interventions.

3.4.1 Causal Effect Estimation Framework

The foundation of CAM lies in causal effect estimation, which quantifies the health impact of environmental exposures while accounting for confounders (as shown in Figure 4). Let e:[t − τ, t] → ℝp denote a hypothetical environmental exposure trajectory. For any location , the counterfactual health response is expressed as:

The causal effect between two exposure scenarios e1, e2 is:

representing the expected health change under intervention e1 relative to baseline e2. To ensure identifiability in observational settings, a set of observed confounders Z(x, t) is defined, and the propensity score is computed:

estimated via a logistic function:

where w ∈ ℝd is a learned vector. Inverse-propensity weighting is applied to reweight the observational loss:

with ϵ > 0 as a stabilizing term. This framework disentangles confounders, enabling clear interpretation of causal effects.

Flowchart depicting a deraining network for image processing. It includes components like Mixture of Experts Feature Compensator (MEFC), Top-K Sparse Attention (TKSA), and Mixed-scale Feed-forward Network (MSFN). Arrows and boxes show data flow from rainy input to derained output, with operations like multiply, concatenate, and addition indicated by symbols.

Schematic diagram of the Causal Effect Estimation Framework. The Causal Effect Estimation Framework shown here combines a complex neural architecture with causal inference principles to estimate health outcomes under different environmental exposures. The image illustrates a rain-to-clear image translation pipeline, incorporating modules such as Mix of Experts Feature Compensation (MEFC), Top-K Sparse Attention (TKSA), and Multi-scale Feed-forward Network (MSFN), each enhancing feature representation and attention. This visual model is juxtaposed with a mathematical formulation of causal effect estimation, defining health responses as integrals over exposure trajectories while correcting for confounders using propensity scores. The framework applies inverse-propensity weighting to ensure unbiased loss computation, enabling robust intervention analysis. Together, the visual and mathematical elements highlight the integration of deep learning with causal reasoning to support interpretability in environmental health modeling.

3.4.2 Epidemiological Regularization and Policy Adaptation

CAM incorporates epidemiological priors and policy adaptation mechanisms to align with domain knowledge. For each exposure dimension j, expected monotonic directions ρj are defined, and a regularizer enforces alignment:

where λ > 0 balances fidelity and prior adherence. Time-varying policy interventions P(t) ∈ ℝu are modeled through a modulation gate:

and fused into the latent representation:

where ⊙ denotes element wise product. This enables adaptation to dynamic policy contexts, ensuring relevance under changing regulatory environments.

3.4.3 Counterfactual Simulation and Spatial Coherence

CAM supports counterfactual simulation to evaluate policy scenarios and maintains spatial coherence under domain shifts. For a set of K policy-relevant trajectories , a comparative matrix is derived:

from which the optimal scenario is extracted:

with denoting a reference health benchmark. To examine exposure-effect heterogeneity, saliency-informed causal gradients are computed:

and used to generate a spatial map of differential vulnerability:

To maintain spatial coherence, CAM incorporates a graph Laplacian-based regularizer:

where L is the graph Laplacian, H ∈ ℝN×q is the output prediction matrix, and μ > 0. The total strategy objective is:

This structured integration transforms GeoExposureNet into a decision-support framework capable of suggesting actionable, spatially-resolved environmental interventions.

4 Experimental setup4.1 Dataset

We evaluate our method on four widely-used benchmark datasets for indoor and scene understanding: SUN RGB-D Dataset (32), ADE20K Dataset (33), NYU Depth V2 Dataset (34), and Places365 Dataset (35). The SUN RGB-D Dataset provides RGB-D images captured from multiple sensors such as Kinect v2 and Intel RealSense, encompassing over 10,000 images across various indoor scenes. Each image contains densely annotated labels for objects and layouts, making it highly valuable for scene understanding tasks such as semantic segmentation and object detection. The dataset contains significant variability in lighting, clutter, and spatial arrangements, which allows for robust evaluation of depth-aware visual models. The ADE20K Dataset offers an extensive collection of annotated imagery, featuring more than 150 semantic categories labeled at a detailed, per-pixel level across upwards of 20,000 diverse scenes. It serves as a critical resource for advancing research in semantic scene understanding. It includes both indoor and outdoor scenes, with a variety of complex spatial compositions, diverse object scales, and occlusions, making it particularly suitable for evaluating general-purpose segmentation algorithms. The NYU Depth V2 Dataset includes 1,449 densely labeled RGB-D images of indoor scenes, captured using a Microsoft Kinect camera. It offers high-quality aligned RGB and depth data, with fine-grained semantic over multiple object categories. This dataset is often used for depth estimation and indoor semantic segmentation tasks due to its high-resolution depth maps and consistent scene structure. The Places365 Dataset is a scene classification dataset composed of over 1.8 million images spanning 365 scene categories. It supports high-level scene understanding and transfer learning, particularly in tasks where contextual and spatial semantics play a crucial role. The large variety of scenes and its extensive scale enable robust learning of deep representations for scene recognition models. Together, these datasets provide comprehensive benchmarks for evaluating both spatially grounded and semantically rich visual models.

4.2 Experimental details

All training procedures are based on PyTorch infrastructure, accelerated by CUDA on NVIDIA A100 GPUs. The network is developed and validated on four publicly available image collections: SUN RGB-D, ADE20K, NYU Depth V2, and Places365. For semantic understanding tasks, we adopt an encoder-decoder design, initializing the encoder with ImageNet-trained weights to facilitate effective feature capture. Depending on the complexity of the task and the nature of the input, either ResNet-50 or Swin Transformer is employed as the feature extractor. The transformer variant is selected when long-distance contextual reasoning is necessary. The decoding path incorporates multi-scale aggregation and attention-driven refinement to enhance both spatial clarity and semantic coherence. Training configurations include the AdamW algorithm, a base step size of 1 × 10−4, and regularization via weight decay set at 0.01. A polynomial learning schedule with exponent 0.9 is applied, and all datasets are trained with 16 samples per batch. To increase resilience, augmentation techniques such as random crops, horizontal flips, and color variations are incorporated. Models are optimized for 80 cycles on ADE20K and Places365, and 100 on SUN RGB-D and NYU Depth V2, balancing dataset scale and convergence dynamics. Input resolutions are standardized to 512 × 512 for ADE20K and Places365, while NYU Depth V2 and SUN RGB-D are processed at 480 × 640 to retain the spatial fidelity characteristic of indoor scenes. For tasks involving depth perception, RGB and depth modalities are integrated either at the early stage within the encoder or later during decoding, depending on the chosen configuration. In early integration, the depth signal is appended directly to the RGB channels, forming a four-channel composite input. Alternatively, in deferred integration, depth features undergo separate encoding and are then merged with RGB pathways during decoding through attention-guided alignment blocks. Performance improvements are generally observed with deferred integration, especially on NYU Depth V2. Assessment involves a variety of quantitative indicators: top-1 classification precision, pixel-wise accuracy, and mean intersection-over-union for scene and object categorization; for estimating depth, both absolute relative deviation and root mean square error are reported. Precision scores, AUC, recall values, and F1 measurements are shuffled and computed to reflect task-specific quality. Results represent averages across three trials to account for randomness. Periodic checkpoints are created every 10 cycles, and early termination is triggered by validation stagnation to mitigate overfitting. To support distributed training, synchronized normalization across multiple devices is used, along with automatic mixed-precision operations for reduced memory usage and enhanced throughput. All baseline configurations are re-executed under a consistent training environment to ensure fairness. The system is built in a modular fashion to support flexible experimentation and streamlined evaluation. Subsets from ADE20K and SUN RGB-D are used in rotation-based validation to gauge adaptability across diverse environments. Additional analyses are provided to assess individual design choices, such as the inclusion of attention refinements, the choice of fusion pathway, and augmentation diversity. These elements collectively reinforce the reliability and replicability of the reported findings.

4.3 Comparison with SOTA methods

Six strong baselines—BERT (36), RoBERTa (37), XLNet (38), ALBERT (39), DeBERTa (40), and ELECTRA (41)—are used for side-by-side evaluation on four distinct datasets: SUN RGB-D, ADE20K, NYU Depth V2, and Places365. As presented in Tables 1, 2, the design introduced here consistently leads to the most favorable results across several performance indicators, including AUC, Recall, precision-based F1 values, and classification success rate. On SUN RGB-D, for instance, the architecture registers 89.93% classification success rate, exceeding DeBERTa's 86.78% by 3.15%. F1 values rise from 85.66% to 88.87% in the same setting, suggesting improved understanding of RGB-D data in cluttered indoor scenes. On ADE20K, characterized by high visual diversity and complex labels, classification performance reaches 90.27%, with F1 values hitting 88.59%. Compared with RoBERTa and DeBERTa, which rely heavily on contextual token modeling, this design integrates enhanced awareness of spatial structure and depth, offering stronger adaptation to semantically rich scenes. Its hierarchical multi-modal strategy allows the system to integrate cues across different semantic scales using attention layers that adapt to both RGB and geometric inputs.

ModelSUN RGB-D datasetADE20K datasetAccuracyRecallF1 scoreAUCAccuracyRecallF1 scoreAUCBERT (36)85.32 ± 0.0383.24 ± 0.0284.10 ± 0.0387.45 ± 0.0283.67 ± 0.0381.88 ± 0.0282.40 ± 0.0285.91 ± 0.03RoBERTa (37)87.41 ± 0.0284.96 ± 0.0286.02 ± 0.0288.73 ± 0.0385.55 ± 0.0284.12 ± 0.0284.75 ± 0.0387.43 ± 0.02XLNet (38)84.76 ± 0.0286.55 ± 0.0385.23 ± 0.0286.19 ± 0.0286.14 ± 0.0382.03 ± 0.0284.01 ± 0.0286.75 ± 0.03ALBERT (39)82.35 ± 0.0381.20 ± 0.0280.87 ± 0.0384.30 ± 0.0283.92 ± 0.0284.79 ± 0.0383.61 ± 0.0285.30 ± 0.02DeBERTa (40)

Comments (0)

No login
gif