Comparative evaluation of imputation and batch-effect correction for proteomics/peptidomics differential-expression analysis

Abstract

Mass spectrometry (MS)-based proteomics offers powerful opportunities for biomarker discovery; nevertheless, it is associated with technical challenges, some of them being missing values and batch effects. Both can obscure biological signal and bias results. Although imputation and batch-correction methods are well established in transcriptomics, their impact, particularly on large-scale, real-world clinical proteomics datasets, remains unclear. In this study, we systematically compared the impact of two popular imputation methods (½ LOD replacement and KNN) in combination with three batch-effect correction approaches (ComBat, ComBat with disease covariate, and MNN) on differential expression analysis in a CE-MS urine peptidomics dataset of 1,050 samples across 13 batches collected for early detection of chronic kidney disease (CKD), separated into discovery (n = 525) and validation (n = 525) sets. Our results show that the choice of imputation method (between ½ LOD and KNN) had minimal impact on the final list of differentially expressed peptides (DEPs). In contrast, batch-effect correction had a much stronger influence on the results. ComBat without covariate adjustment removed most DEPs, suggesting loss of true biological signal. Along these lines, incorporating disease status into the model preserved most of this information. MNN yielded a moderate to low number of validated DEPs overall, especially when paired with KNN imputation. These findings show that imputation and batch correction are not entirely independent processes and that they can influence downstream results. Overall, preprocessing choices should be chosen based on the characteristics of each dataset and especially considering batch severity and biological covariates.

Statement of significance of the study Finding reliable biomarkers in clinical proteomics first requires addressing the technical noise that can hide true biological signals. In this work, we investigate how different imputation and batch correction methods influence the list of peptides that emerge as differentially expressed. Instead of relying on simulations or small datasets, we examine a large, real-world urine-peptidomics cohort of more than 1,000 samples screened for early-stage chronic kidney disease. The results show that no preprocessing pipeline is universally optimal and that the best choice depends on the characteristics of the dataset. This study offers practical guidance for improving reproducibility in urine-based peptide studies and supports more confident identification of disease-associated molecular signatures.

Competing Interest Statement

H.M. is the co-founder and co-owner of Mosaiques Diagnostics. A.L. is employed by Mosaiques Diagnostics. All other authors declare no conflict of interest.

Funding Statement

This article/publication is based upon work conducted within a Short-Term Scientific Action from COST Action CA21165 (PerMedik) supported by COST (European Cooperation in Science and Technology).

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

All datasets were from previously published studies and fully anonymized. Ethical review and approval are not required for this study due to all data being fully anonymized, based on the opinion of the ethics committee of the Hannover Medical School, Germany (no. 3116-2016).

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

All data needed to reproduce the results of this study are provided in the accompanying Supporting Information files.

AbbreviationsCKDChronic Kidney DiseasedBPDiastolic Blood PressureDEDifferential ExpressionDEPsDifferentially Expressed PeptidesGFRGlomerular Filtration RateJJaccard Similarity IndexKNNk-Nearest NeighborsMARMissing at RandomMCARMissing Completely at RandomMNARMissing Not at RandomMNNMutual Nearest NeighborssBPSystolic Blood Pressure

Comments (0)

No login
gif