In this study, five different pre-trained deep learning models obtained from the Hugging Face Transformers library were fine-tuned for the task of multi-label emotion classification. The models used include RoBERTa-base (Facebook 2025), which was pre-trained on large-scale English texts; mental-RoBERTa-base (Mental 2025a) and mental-bert-base-uncased (Mental 2025b), which were pre-trained on mental health forums and medical texts such as PubMed and PMC; distilbert-base (DistilBERT 2025), which is a distilled version of the BERT architecture, and deberta-v3-base (Microsoft 2025), developed by Microsoft, which offers improved performance in terms of context representation. RoBERTa has a 12-layer architecture with 768 hidden units and 12 attention heads, containing a total of 125 million parameters (HuggingFace, 2019). Both mental-RoBERTa-base and mental-bert-base-uncased include 12 layers and 768 hidden units; mental-bert-base-uncased has approximately 110 M parameters, whereas mental-RoBERTa-base has about 125 M parameters (Ren et al. 2024; Yang et al. 2025). DistilBERT-base consists of 6 layers, 768 hidden units, 12 attention heads, and approximately 66 million parameters (Hugging Face 2024). deberta-v3-base model is a 12-layer structure containing 768 hidden dimensions (Hugging Face 2022). It has a vocabulary of 128,000 words and contains 86 million parameters in the backbone section, while this structure allows it to reach 98 million parameters in the embedded layer. Each of these models was reconstructed with an eight-neuron output layer with a sigmoid activation function to predict eight different emotion labels (anger, cognitive dysfunction, emptiness, hopelessness, loneliness, sadness, suicide intent, worthlessness).
The hyper-parameters used during model training are as follows: mini batch size 8, maximum input length 256 tokens, learning rate 2 × 10− 5 and total epoch number 25. However, due to the early stopping, training usually ended between 6 and 10 epochs. The early stopping criterion was triggered when the validation loss did not improve for three consecutive epochs. We applied a layer-wise, progressive dropout schedule across the transformer encoder layers, with the dropout rate increasing linearly from 0.1 (shallow layers) to 0.3 (deeper layers) during fine-tuning. AdamW algorithm was used for the optimization process and the L2 weight decay value was left as 0.01 by default. The learning rate was set with the get_linear_schedule_with_warmup method along with a 10% warm-up process at the beginning of training. Gradient clipping was applied to prevent excessive gradient values from negatively affecting the model performance and the maximum norm value was limited to 1.0. For reproducibility, we fixed the random seed to 42 for data splitting and model training.
In order to increase the reliability of the model performance, we performed 5-fold cross-validation using the MultilabelStratifiedKFold method, which performs data splitting by preserving the class distribution in multi-label tasks. This ensured a balanced distribution of labels in each fold, and the model weights from the epoch that achieved the highest validation F1 Macro score in each fold were saved. The final evaluations were performed using these best weights.
Evaluation metricsThe experimental results are evaluated based on F1 Macro, Precision Macro, Recall Macro, F1 Micro, Precision Micro, and Recall Micro metrics. F1 Micro measures the overall performance of the model without considering class distinctions, whereas F1 Macro calculates the average of the F1 scores for each emotion class, offering a more balanced, class-specific evaluation. The equations for the F1 Micro and F1 Macro formulations calculated based on the precision and recall metrics are presented below.
$$\:_=\frac_^T_}_^T_+F_}$$
(7)
$$\:_=\frac_^\frac_}_+F_}$$
(8)
$$\:_=\frac_^T_}_^T_+F_}$$
(9)
$$\:_=\frac_^\frac_}_+F_}$$
(10)
In multi-label classification, \(\:C\) denotes the total number of labels and N represents the number of instances in the dataset. For each label \(\:c\in\:\,...,C\}\), the classification performance can be represented using standard confusion matrix components. \(\:T_\) denotes the number of true positives for label \(\:c\), i.e., instances correctly predicted as belonging to label \(\:c\). Similarly, \(\:F_\) refers to the number of false positives, representing instances incorrectly predicted as belonging to label \(\:c\), while \(\:F_\) denotes the number of false negatives, corresponding to instances that actually belong to label \(\:c\) but were not predicted as such.
$$\:_=\frac_\times\:Recal_}_+Recal_}$$
(11)
$$\:_=\frac_^\frac_\times\:Recal_}_+Recal_}$$
(12)
F1 Micro and F1 Macro performance metrics are the most suitable evaluation metrics for multi-label classification, as presented in Eqs. 11 and 12 (Hinojosa Lee et al. 2024). F1 Macro assigns equal importance to each class by computing the F1 score separately for each class and then averaging the results. The F1 Macro score provides an advantage in cases of class imbalance by ensuring that each class has an equal impact on the final metric. In contrast, F1 Micro aggregates the true positives (TP), false positives (FP), and false negatives (FN) across all classes to obtain a unified score. This method is beneficial when the focus is on overall system performance, particularly emphasizing the majority class while still accounting for the minority classes.
To provide a more comprehensive evaluation of the experimental results, in addition to standard Macro-F1 and Micro-F1 metrics, Jaccard index (Jaccard accuracy) was also used as an evaluation metric. Each textual item in the DepressionEmo dataset can be labeled with a set of eight distinct emotions. The Jaccard index evaluates the partial prediction performance of the model by measuring the intersection between the predicted label set and the true label set for each textual example. The formulation for calculating Jaccard Accuracy is defined in Eq. 13.
$$\:JaccardAccuracy=\frac_\frac_\cap\:_}_\cup\:_}$$
(13)
here, |T| represents the number of samples in the dataset, Gt represents the true label set, and Pt represents the predicted label set.
A high Jaccard score indicates a higher overlap of predicted labels with the true labels, allowing for effective evaluation of model performance on multi-label datasets like DepressionEmo.
Experimental ResultsTable 4 summarizes the performance of the algorithms on multi-label depression emotion classification of user posts in terms of F1 Macro, Precision Macro, Recall Macro, F1 Micro, Precision Micro and Recall Micro.
Table 4 Performance comparison of text classification methods on the test setAmong the individual transformer-based models, RoBERTa achieved the highest F1-macro score of 0.7968, outperforming strong baselines such as DeBERTa-v3 with F1-macro score of 0.7895. Notably, this score also exceeds the reference values reported in the original dataset paper by Rahman et al. (2024), where the benchmarks were 0.76 F1-macro and 0.80 F1-micro.
Beyond single models, ensemble learning approaches demonstrated clear performance gains. In particular, the Stacking-FFNN method attained the best results, with F1-macro of 0.8121 and F1-micro of 0.8421, surpassing both individual transformer models and simpler ensemble strategies such as hard and soft voting. Ensemble learning methods are used as a powerful solution to prevent overfitting by leveraging the complementary strengths of different models and thus reducing model-specific bias and variance. In ensemble learning models, the selection of base classifiers and their ability to accommodate diversity in their decisions are as important as how these models are combined. Hard Voting explicitly combines the decision classes of individual models, providing more robust predictions, thereby smoothing out model-specific errors. However, stacking, another more intelligent model fusion strategy, can capture nonlinear relationships between base model outputs and yield higher performance on challenging tasks such as depression classification. Unlike voting-based methods, stacking learns how base model predictions relate to actual outputs and captures nonlinear dependencies between probability distributions. Furthermore, when stacking is combined with an FFNN, it results in a more robust model that learns how to optimally weight and integrate the predictions of the base models. Consequently, Stacking-FFNN simultaneously corrects for systematic biases inherent in individual models while reducing model variance that contributes to overfitting. These advantages are particularly valuable for multi-label depression detection, where textual inputs often contain overlapping emotional cues, high inter-label correlation, and significant noise. Consequently, the Stacking-FFNN ensemble achieved superior performance because it is better suited to model the noisy nature of the DepressionEmo dataset.
Overall, the findings highlight that ensemble methods built upon transformer-based encoders consistently outperform individual models in challenging NLP tasks like multi-label emotion classification. Notably, combining diverse transformer outputs with a neural network meta-learner provides a significant performance boost. From a practical perspective, this is especially important in psychology, where recognizing and analyzing multiple co-occurring emotional states -rather than a single dominant emotion- is critical for more accurate and meaningful assessments.
To provide a more comprehensive evaluation of the experimental results, Table 4 includes the emotion predictions produced by the proposed model for selected textual data examples, in addition to the standard classification performance metrics, allowing for a qualitative discussion of the experimental outcomes.
Table 5 Qualitative analysis of model predictions showing instances of exact match, partial error, and failure casesTable 5 presents representative test samples illustrating the proposed model’s prediction behavior. In the first example (ID: e38ntt), multiple negative emotions are expressed directly, and the model correctly predicts all ground-truth labels. It achieves a Jaccard score of 1.00. This indicates that the model can detect emotions that appear together when explicitly expressed. For the second post (ID: ryjtfw), the model correctly predicted most of the true labels but missed the anger label. As a result, the Jaccard score was calculated as 0.75. While feelings of hopelessness, sadness, and worthlessness are conveyed more directly in the text, anger is expressed more indirectly. This example shows that while some emotions are expressed more explicitly in the text, implicit emotions are more likely to be missed by the model. In the third example (ID: 179fkof), although the ground-truth label was only sadness, the model predicted emptiness, hopelessness, and worthlessness in addition to sadness; therefore, the Jaccard score was 0.25. Because the text conveyed a general negativity about life and the future rather than directly stating a single emotion, the model predicted semantically similar labels in addition to sadness. This suggests that the model may overpredict in texts where emotional cues are presented within a broader context. Finally, in the fourth example (ID: 17csiq0), although the ground-truth label was only hopelessness, the model also predicted cognitive dysfunction, emptiness, loneliness, sadness, and worthlessness, resulting in a low Jaccard score of 0.17. Similar to the third example, this suggests that broad and overlapping cues in the text can lead the model to assign multiple related emotions rather than a single specific label. In the fourth post, expressions like difficulty focusing, reduced functioning, and emotional numbness can fit several labels. For this reason, the model predicts multiple related emotions instead of a single label. Overall, our proposed model produces more consistent results in texts where emotions are expressed explicitly and strongly, but it may assign additional emotion labels when cues are implicit.
Comparison with literatureIn this section, the findings from our study are compared with the results reported in the existing literature that used the DepressionEmo dataset, and the similarities and differences are discussed in Table 6.
Table 6 Comparison with existing literatureStudies using the DepressionEmo dataset show that F1 Scores reported in the literature generally range from 0.61 to 0.81, depending on the method used. Rahman et al. (2024) compared classical ML methods with various transformer-based models and reported that Micro F1 Scores ranged from 0.61 to 0.80, with the highest performance achieved by the BERT-uncased and BART models. Khan et al. (2024) achieved an average F1 Score of 78.16% with the contrastive learning + RoBERTa approach, while Violides et al. (2024) achieved an average F1 Score of 81.00% with the fine-tuned RoBERTa model. Furthermore, Younas et al. (2025) reported an F1 Score of 0.70 using the Stacked LSTM model.
Compared to these studies, the proposed Stacking-FFNN approach, with a Macro F1 value of 0.8121, appears to outperform or even compete with many transformer-based singleton models in the literature. This result demonstrates that the model offers balanced inter-class performance and strong generalization ability. Furthermore, the Micro F1 Score of 0.8421 demonstrates the model’s high overall classification accuracy and that the stacking structure effectively integrates complementary information obtained from different models. Although direct comparison is challenging due to differences in evaluation metrics, the results suggest that the proposed approach is competitive with, and in some cases exceeds, the performance of existing state-of-the-art models reported in the literature. While comparing the proposed method with existing methods in the literature is difficult due to differences in evaluation criteria, it is clearly seen that the proposed method is competitive with, and in some cases even exceeds, the performance of the state-of-the-art models currently available.
LimitationsWhile BERT-based architectures demonstrate high performance in learning semantic representations, their black-box nature makes it difficult to clearly and interpretably explain the underlying reasons for their decisions. This presents a significant limitation, particularly in clinically critical applications. The performance and generalization capabilities of BERT and its derivative models depend heavily on the quality and scope of the data they are trained on. If the dataset does not adequately reflect the diversity of depressive symptoms and emotional expressions across different demographic and sociocultural groups, the generalization success of the developed classification models can be negatively impacted. While domain-specific models such as MentalBERT and MentalRoBERTa are fine-tuned for specific psychological tasks, they can sometimes fall short in capturing implicit meanings and subtle linguistic nuances (Garg et al. 2024). While evaluating model performance on unseen data is critical, such evaluations can also lead to results that do not fully reflect real-world conditions.
Another important limitation concerns the DepressionEmo dataset used in this study. While the DepressionEmo dataset used in this study has been used for academic purposes on the automatic classification of depression, it was not derived from a real patient-physician interview. The dataset consists of user generated comments, which limits its clinical validity. The depression emotion labels, and especially suicide intent in the dataset are based on user statements and annotator interpretations rather than formal clinical diagnoses. This carries the risk of label noise and subjective interpretation. Therefore, models trained with DepressionEmo are limited in their ability to perform diagnoses on real clinical data.
Additionally, transformer-based language models used in this study were trained on Reddit posts that carry some significant risks. Texts produced on online platforms like Reddit can contain everyday language, exaggeration, irony, humor, and community-specific discourse patterns. This can lead to the models becoming overly accommodating to platform-specific language use and failing to accurately generalize mental states expressed in clinical or real-world contexts. Particularly with sensitive labels such as suicidal intent, this type of bias increases the risk of misinterpretation and requires ethical caution.
Furthermore, the lack of universal representation of demographic variables such as age, gender, and cultural background in the data source used is one of the main limitations of the study (Huang et al. 2023). Platform-specific language use, irony, and lack of context can lead to misinterpretations of the texts (Das et al. 2024).
Finally, false positive or false negative predictions, particularly those related to suicide intent, pose significant ethical risks. False positive predictions can lead to unnecessary labeling and stigmatization of individuals, while false negative predictions can cause real risks to be overlooked. Therefore, the proposed model should be considered solely as a decision support tool, not for clinical or automated decision-making mechanisms.
Comments (0)