This section offers an in-depth analysis of the datasets utilized in the training and testing phases of the developed models, as well as a discussion of the features of the specialized blocks in the model architecture and the methodologies employed.
DatasetIn this study, the open-access Kvasir dataset was employed for training and evaluating the developed models. This dataset is organized into three primary anatomical landmarks and three clinically significant findings, along with two categories related to endoscopic polyp removal procedures. The dataset’s sorting and annotation were carried out by skilled endoscopists. Data collection was performed using endoscopic equipment at healthcare facilities affiliated with Vestre Viken Health Trust (VV) in Norway, which operates four hospitals and serves a population of approximately 470,000. The training data were sourced from Bærum Hospital, home to a large gastroenterology department. Expert annotators from VV and the Cancer Registry of Norway (CRN) meticulously labeled the images. The Kvasir dataset contains images that have been categorized and validated by experienced endoscopists, covering anatomical features, pathological conditions, and endoscopic procedures within the GI tract. The images, with resolutions ranging from 720 × 576 to 1920 × 1072 pixels, are organized into directories based on content. Additionally, the Kvasir-SEG dataset, derived from Kvasir v2, includes 1000 polyp images along with their corresponding ground truth masks. The dataset is 46.2 MB in size, with image resolutions varying from 332 × 487 to 1920 × 1072 pixels [22]. Example images from the Kvasir dataset are provided in Fig. 1. The dataset can be accessed through the following link: https://www.kaggle.com/datasets/abdallahwagih/kvasir-dataset-for-classification-and-segmentation
Fig. 1Sample images Kvasir dataset
Efficient strategies for optimizing convolutional operationsConvolutional Neural Networks (CNNs) are deep learning architectures widely used in areas such as image recognition and processing. These architectures are designed to extract diverse features from input images, enabling them to perform tasks such as classification, object detection, and segmentation. The fundamental operation of Convolutional Neural Networks (CNNs), convolution, can take several forms, such as standard convolution, depthwise convolution, and atrous (dilated) convolution. However, directly applying these convolution techniques can lead to an increase in the number of model parameters and computational complexity, presenting challenges in terms of optimization. In this study, to enhance the performance of convolution operations, standard, depthwise, and atrous convolutions have been adapted into asymmetric convolution techniques and applied more efficiently. The standard convolution operation can be described by the general formula shown in Eq. 1.
$$y\left(i,j\right)=\sum_^\sum_^x\left(i+m,j+n\right)*k(m,n)$$
(1)
In Eq. 1, \(x(i,j)\) denote the pixel value of the input image at position \((i,j)\), while \(k\left(m,n\right)\) represents the \(M\times N\) filter kernel. The value at position \(\left(i,j\right)\) of the resulting output feature map is denoted as \(y\left(i,j\right)\), and \(M,N\) specify the dimensions of the filter kernel.. While Eq. 1 represents the fundamental mathematical structure of convolution operations in CNNs, the asymmetric convolution techniques employed in this study aim to reduce computational complexity and enhance efficiency.
Asymmetric convolution offers an alternative approach to traditional convolution methods, aiming to significantly reduce the model’s parameter count and computational cost. This method involves decomposing a two-dimensional \(n\times n\) convolution kernel into two sequential one-dimensional convolution operations. In the initial step, a vertical convolution kernel of size \(n\times 1\) is applied, followed by a horizontal convolution kernel of size \(n\times 1\) to complete the operation. This transformation allows the computation to be performed with only \(2n\) parameters, compared to the \(^\) parameters required when using a traditional \(n\times n\) kernel. As a result, both memory consumption is reduced and computational cost is optimized. The mathematical expression of the asymmetric convolution operation can be defined as shown in Eq. 2.
$$y\left(i,j\right)=\sum_^\left[\sum_^x\left(i+m,j\right)*_(m)\right]*_(n)$$
(2)
In Eq. 2, \(x(i,j)\) represents the pixel value at position \((i,j)\) of the input image, \(_(m)\) denotes the vertical convolution kernel of size \(n\times 1\), \(_(n)\) is the horizontal convolution kernel of size \(1\times n\), \(y(i,j)\) represents the value at position \(\left(i,j\right)\) of the output feature map, and \(M,N\) refer to the kernel dimensions. Equation 2 clearly illustrates the operation structure of asymmetric convolution and the advantages it provides. Unlike traditional methods, this approach offers an optimization that enhances performance while reducing computational cost.
Atrous convolution, unlike traditional convolution methods, is a technique where spaced zero values are added within the convolution kernel. This method aims to increase the effective receptive field of the kernel while maintaining the number of pixels, thus improving computational efficiency. Atrous convolution can be expressed as shown in Eq. 3.
$$y\left(i,j\right)=\sum_^\sum_^x(i+r\cdot m,j+r\cdot n)\cdot k(m,n)$$
(3)
One of the key parameters used in atrous convolution is the dilation rate, which refers to the number of zeros added between consecutive elements of the kernel. When the dilation rate \(r=1\), atrous convolution operates in the same way as standard convolution. However, when the dilation rate \(r>1\), the receptive field of the kernel is expanded, allowing information to be gathered from a larger region. This enables the consideration of a broader contextual area without increasing the kernel size. Atrous convolution is particularly favored in tasks such as segmentation and other applications that require dense feature maps, as it provides a wider contextual awareness. Visual representations of convolution kernels with different dilation rates are shown in Fig. 2.
Fig. 2Illustration of convolution kernels with varying dilation rates
Atrous convolution is an effective technique that expands the spatial sensing capacity of the convolution kernel, maintaining resolution without additional parameter burden or computational cost. By using the dilation rate, this method achieves an enhanced contextual coverage, enabling the extraction of more complex and richer features. The extended effective receptive field of an \(nxn\) convolution kernel with a dilation rate a is expressed as shown in Eq. 4.
$$_=\left[1+\left(n-1\right)*a\right]*\left[1+\left(n-1\right)*a\right]$$
(4)
In Eq. 4, the dilation rate a determines the width of the gaps left between the pixels on both axes of the kernel. The kernels used in atrous convolution can be transformed into an asymmetric structure, resulting in two one-dimensional expanded kernels with different dilation rates along the \(x\) and \(y\) axes. These kernels are denoted as \(_\) and \(_\), respectively. The mathematical expressions of the expanded receptive fields along the \(x\) and \(y\) directions are shown in Eqs. 5 and 6.
$$_=\left[1+\left(n-1\right)*_\right]$$
(5)
$$_=\left[1+\left(n-1\right)*_\right]$$
(6)
In Eqs. 5 and 6\(_\) and \(_\) represent the dilation rates along the \(x\) and \(y\) axes, respectively. This strategy aims to optimize the computational cost while expanding the receptive field of the kernels. The total spatial receptive field is determined by the product of the kernels through the sequential application of asymmetric kernels. This process allows for capturing more detailed features while providing an expanded contextual awareness. The total spatial receptive field is calculated as shown in Eq. 7.
$$_=\left[1+\left(n-1\right)*_\right]*\left[1+\left(n-1\right)*_\right]$$
(7)
In Eq. 7, the combined effect of the expanded kernels along the \(x\) and \(y\) axes is expressed, allowing the atrous convolution kernel to gather information from a larger area. This approach provides notable benefits in terms of enhanced contextual understanding and improved computational efficiency, especially in tasks like image segmentation and object recognition. In this research, the asymmetric atrous convolution technique is investigated as an alternative to conventional atrous convolution methods, aiming to reduce the number of parameters and enhance computational performance. Asymmetric atrous convolution optimizes processing times by expanding the receptive field while minimizing parameter usage. This approach provides significant benefits, especially during the processing of large and complex datasets, by optimizing memory usage and increasing process speed. Depthwise convolution is a method that processes multi-channel input data with kernels specifically defined for each channel. This method independently processes each input channel, enabling more efficient extraction of channel-specific features. Depthwise convolution is performed in a two-stage process: first, a convolution is applied along the x-axis with an \(n\times 1\) kernel, and then the process is completed along the y-axis with a \(1\times n\) kernel. As a result of this sequential process, an n × n receptive field is created. Asymmetric convolutions provide parameter efficiency compared to standard convolution operations. For example, when the kernel size is \(k=3\), asymmetric convolutions use 33% fewer parameters than standard methods. This reduces the computational load significantly while improving processing speed and memory usage without any loss in performance. The asymmetric convolution strategy is implemented through the sequential application of two 1D convolutions. This method minimizes the overall number of parameters without compromising the model’s ability to learn, thereby enabling efficient high-performance inference in sophisticated models. Focused on parameter optimization, this method was developed to achieve improvements in memory savings and processing time, especially for deep learning-based tasks. The model developed in this study aims to reduce both computational cost and improve output quality by utilizing optimized asymmetric atrous convolution blocks.
Efficient hybrid attentional atrous convolution (EHAAC) moduleMedical imaging technologies are critically important for the detection and segmentation of abnormalities in the human gastrointestinal (GI) system. This process plays a vital role in areas such as early diagnosis, treatment planning, and disease management. In response to these needs, the Efficient Hybrid Attentional Atrous Convolution (EHAAC) module has been developed to accurately detect abnormalities in endoscopic images of the GI system and optimize this process. EHAAC enhances contextual sensitivity, providing an effective tool for identifying and segmenting clinically significant findings such as polyps, ulcers, or inflammation. This module ensures clearer and more accurate delineation of target regions, thereby increasing diagnostic precision and facilitating better management of treatment processes. Furthermore, by combining expanded atrous convolution with attention mechanisms, it can efficiently analyze complex tissue structures. The structural components and functioning of the EHAAC module are detailed in Fig. 3. This structure aims to both improve accuracy in clinical applications and accelerate real-time diagnostic processes.
Fig. 3Structural components and functioning of the efficient hybrid attentional atrous convolution (EHAAC) module
The integration of Atrous Convolutions with the Efficient Channel Attention Module (ECAM) seeks to enhance efficiency by detecting semantic features across a broader area in segmentation tasks, all while preserving resolution. This strategy aids in removing redundant information, thereby improving the network’s overall performance. Atrous convolution blocks process the input data with different dilation rates, allowing each block to extract features at various scales. The ECAM module is employed during this feature extraction process to filter the features based on their semantic importance. By processing feature maps with atrous convolutions at different dilation rates, this module enhances the semantic emphasis. This operation can be defined by Eq. 8.
$$ESAM\left(F\right)=_,\text\}}\upsigma (^\left(F\right))\odot ^\left(F\right)$$
(8)
The ECAM module focuses on the channel level, highlighting important features. Following the Global Average Pooling (GAP) operation, the mean value of each channel is transformed into a weight vector, facilitating more effective feature extraction. The processing flow of the ECAM module is expressed as shown in Eq. 9.
$$ECAM\left(_\right)=_\odot (_\odot\upsigma (_^}\odot\upsigma (_^}\odot _(_))))$$
(9)
Finally, the EHAAC module applies dilated convolutions on the feature map processed by the ECAM. This operation is expressed as shown in Eq. 10. This process improves the network’s efficiency and accuracy, leading to more accurate results in the segmentation process.
$$EHAAC\left(F\right)=^\left[_+\sum_,\text\}}\upsigma (^\left(F\right))\odot ^\left(F\right)\right]$$
(10)
The combined structure shown in Eq. 10 provides an effective solution for high-accuracy and fast segmentation tasks, while also optimizing computational efficiency.
Encoder decoder residual blockIn this study, a neural network architecture has been developed to detect and segment abnormalities in the human gastrointestinal (GI) system. The proposed model is designed to identify abnormalities at the pixel level in images captured using remote sensing technology. It facilitates the efficient transmission of both local and contextual information, ensuring accurate classification of abnormalities. To optimize information flow, the Encoder-Decoder-Residual (EDR) block is incorporated into the network architecture, specifically tailored for abnormality detection. The EDR block features asymmetric convolution layers, allowing for deep and efficient processing of information. This design enhances the model’s ability to detect gastrointestinal abnormalities with greater precision and clarity. Figure 4 provides a detailed schematic of this innovative EDR block, which contributes to improving the model’s accuracy.
Fig. 4Schematic configuration of the encoder-decoder-residual (EDR) block
In the evaluation process outlined within the structural framework shown in Fig. 4, the input feature map F is first processed through a 1 × 1 convolutional layer, followed by batch normalization and ReLU activation, yielding \(_^}\). This operation is mathematically represented by Eq. 11.
$$_^}=ReLU(BN\left(_\left(F\right)\right))$$
(11)
Subsequently, \(_^}\) is processed by two different convolutional blocks (Asymmetric Convolutional Block (ACB) and Depthwise Asymmetric Convolutional Block (ADB)), resulting in \(_^}\) and \(_^}\), as expressed in Eqs. 12 and 13, respectively.
$$_^}=ACB\left(_^}\right)$$
(12)
The two feature maps expressed in Eqs. 12 and 13 are then combined to create \(_\), as shown in Eq. 14, for a richer flow of information.
The operation expressed in Eq. 14 is repeated to enhance the features and strengthen the flow, producing a second feature map, \(_^}\), as expressed in Eq. 15.
$$_^}=EDRBlock(_^})$$
(15)
Finally, the obtained \(_^}\) and \(_^}\) are combined to enhance the performance of the deep learning model, resulting in the final output, \(_\), as expressed in Eq. 16.
This approach guarantees that crucial information is retained in the deeper layers of the model, thereby enhancing its ability to detect abnormalities with higher accuracy.
Proposed segmentation model (GISegNet)This study introduces a novel convolutional neural network model, GISegNet, designed for the detection and segmentation of abnormalities in the human gastrointestinal (GI) system. The model comprises two primary components: the encoder and the decoder. The encoder includes sampling blocks and the Efficient Hybrid Attentional Atrous Convolution (EHAAC) module, which improves the model’s performance by integrating attention mechanisms with atrous convolution, thereby preserving both local and semantic information. The sampling blocks, in combination with Encoder-Decoder-Residual (EDR) blocks and max pooling layers, enable the model to detect even minor anomalies. The architecture of the GISegNet segmentation model is illustrated in Fig. 5.
Fig. 5Overall architecture of the GISegNet segmentation model
The developed GISegNet model’s decoder section processes the input feature maps using 3 × 3 kernels and is then resampled with 2 × 2 stride transpose convolution layers. During this process, residual connections ensure the preservation and enhancement of features. The final feature maps are processed using 1 × 1 convolution and a sigmoid activation function, resulting in binary segmentation outputs. GISegNet is an effective model designed with an optimized structure to accurately detect abnormalities in the GI system.
Data preparation process for classificationIn the conducted study, an eight-class gastrointestinal (GI) system dataset was used. The images were preprocessed to a resolution of 224 × 224 pixels. Data augmentation techniques, such as RandomResizedCrop and RandomHorizontalFlip, were applied during training, while basic scaling operations were performed during testing and validation. The dataset was divided into 60% for training, 20% for validation, and 20% for testing purposes. Figure 6 presents example images from the dataset following the preprocessing steps.
Fig. 6Sample dataset ımages following preprocessing steps
Support vector machinesSupport Vector Machines (SVM) are an effective supervised learning method aimed at classifying data points by determining a hyperplane that separates different classes in classification problems. SVM is particularly successful with small to medium-sized, complex datasets [23]. The fundamental principle of the model is to create a decision boundary that maximizes the distance (margin) between classes. In Fig. 7, the classification process with SVM is visualized, where the black and white points represent different classes, and the pink area represents the margin region at a distance of ± 1 from the hyperplane. A wider margin leads to better model performance.
Fig. 7Support vector machines classification example
In the SVM shown in Fig. 7, classification is performed using the weight vector \(w\), input vector\(x\), and bias\(b\). If the computed value is less than 0, the sample is assigned to the green class, and if it is greater than or equal to 0, the sample is assigned to the blue class. This process is expressed by Eq. 17 [24].
$$y=\left\0\,if\, ^.x+b<0,\\ 1\, if\, ^.x+b\ge 0\end\right.$$
(17)
In the hybrid model proposed for classifying abnormalities in the GI system, the SVM method uses a Cubic kernel function, with the Kernel scale set to Auto. The Box constraint level is chosen as 1, and the Multiclass method is implemented using the One-vs-One approach. These parameters have been appropriately configured to optimize the performance of the model.
Feature selection approach: mRMR techniqueFeature selection techniques are employed to enhance the classification performance of deep learning models. These methods focus on improving the model’s accuracy by identifying the most relevant features. In this research, the Minimum Redundancy Maximum Relevance (mRMR) method was selected for binary classification [25]. mRMR is a filtering technique used to ensure the highest correlation and minimum redundancy among features. The algorithm ranks each feature and evaluates their relationships, defining less important features as “redundant” and significant ones as “relevant.” This process was carried out using MATLAB’s Feature Selection Methods tool.
Transformer modelsRecent advancements in image analysis have been driven by deep learning models, especially Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). ViT models, by processing long-range dependencies more effectively, can achieve higher performance in the classification of abnormalities in the human gastrointestinal (GI) system. In this study, Deit3, Maxvit, Swin, and ViT models were used, with all images processed at a resolution of 224 × 224 pixels. The models’ architecture is based on six main processing stages, which may include additional layers. A visual representation of these processing stages is presented in Fig. 8.
Fig. 8Processing stages of transformer models
In the ViT architecture shown in Fig. 8, the input images are processed by dividing them into patches. These patches, which are of size 16 × 16 or 24 × 24 pixels, are passed through an embedding layer to be converted into vector representations. The order of the patches is determined through positional embedding. The Transformer Encoder Blocks incorporate multi-head self-attention (MHSA) and feed-forward neural network (FFNN) components, enabling the model to capture more complex features. The output layer at the end of the model is responsible for performing the classification task. In this study, the following models with an input resolution of 224 × 224 pixels were utilized: 1 st DeiT3 Base Patch16-224, MaxViT Base-tf-224, Swin Base Patch4 Window7-224, and ViT Base Patch16-224. These models demonstrate notable performance in visual data analysis.
Proposed hybrid approach for classificationThe hybrid model proposed in this study aims for the fast and accurate analysis of human gastrointestinal (GI) images. Vision Transformer (ViT) models are central to this framework due to their efficiency in image processing and feature extraction. The model follows six primary stages: preprocessing, model training, feature extraction, feature fusion, feature selection, and classification. During the preprocessing phase, gastrointestinal (GI) images are enhanced, cropped, and resized to 224 × 224 pixels to provide ViT models with data that is both efficient and suitable for processing. In the training phase, models such as DeiT3/base patch16, MaxViT/base-tf, Swin/base patch4, and ViT base patch16 are employed to ensure the rapid and precise analysis of human GI images. These models can operate with low hardware requirements due to their simple architectures and low parameter counts. By using different ViT models, the feature sets obtained from GI images were diversified, and each model learned important features for classifying GI abnormalities with 224 × 224 pixel resolution input images. This approach ensures more efficient analysis of the data. During the feature extraction phase, features are extracted from the fully connected (FC) layer, prior to the classification layer, of each model. The DeiT3, MaxViT, and ViT models each produce a feature set of 768, whereas the Swin model generates 1024 features due to its unique architecture. The models’ performance was assessed using Support Vector Machine (SVM), and the feature sets from the top three models were identified. These sets were subsequently combined for further analysis, leading to the creation of four new feature sets derived from the union of the DeiT3, Swin, and ViT models. The performance of these combined feature sets was also evaluated using SVM. In the feature selection phase, the best-performing set among the combined sets was selected for subsequent steps. The selection was conducted using the Minimum Redundancy Maximum Relevance (mRMR) method, and new sets with 100, 300, 500, 700, and 1000 features were generated from the chosen set. These new sets were then classified using the SVM method. The overall architecture and processing steps of the proposed hybrid model are visually depicted in Fig. 9.
Fig. 9Overall architecture of the proposed hybrid model
Comments (0)