Radiation therapy is a core component of the treatment of cancer patients, with over 50% of patients receiving radiation therapy at some point during their treatment course, and approximately 60% of these patients are treated with curative intent.1 Personalization of radiation therapy is a critical goal for radiation oncologists who aim to increase the chance of controlling disease while limiting the harmful side-effects and toxicities of treatment. Current strategies center around several frameworks – utilizing clinical characteristics to stratify patients, image guided radiation therapy that allows for higher doses to the tumor with improved sparing of normal tissue,2 and biomarkers to guide treatment.3, 4, 5 The first 2 are widely clinically adopted, but development of radiation tumor biomarkers has proven more difficult. Prognostic tests can help guide decisions about treatment intensification in high-risk patients or de-intensification in low-risk patients, but they provide limited insight into the most appropriate type of intervention. The ideal radiation tumor biomarker for the radiation oncologist would be treatment predictive, that is, it would provide insight into whether a patient's specific cancer is radiosensitive or radioresistant, or on the benefit of radiation. Such predictive biomarkers could then guide the decision on whether to treat with radiation and help determine the appropriate dose and fractionation schedule. Predictive biomarker discovery can be more difficult than identifying prognostic biomarkers, but advances in high-throughput molecular profiling technologies have provided an opportunity to create clinically useful radiation-based predictive biomarkers, and many biomarkers of radiation sensitivity and resistance have been developed using various machine learning techniques6 to harness the power of these assays. However, there are no tumor biomarkers specifically designed for radiation therapy that are clinically used today. One of the challenges with modern high-throughput assays is the complex nature of the data produced and the special analytical considerations required. In this review we set out to provide a broad overview of the computational machine learning methodologies used in of the development of high-throughput radiation biomarkers as well as review challenges and future possibilities.
Modern high-throughput assays, or “omics” technologies, have revolutionized the study of biological systems. The assays provide an opportunity to comprehensively characterize the tumor DNA alterations (genomics), gene expression profiles (transcriptomics), protein abundance and modifications (proteomics), epigenetic modifications (epigenomics), metabolic profiles (metabolomics), etc. Herein, we will focus on the special analytical considerations needed for these molecular data. The common theme for all “omics” assays is that for every tumor sample (observation) there are hundreds to thousands to millions of variables measured (features, or dimensions, eg genes, proteins, etc.). The number of observations is usually lower than the number of features due to sample availability and cost. Given that (1) most features are typically not informative for the specific question asked, thus contributing only “noise,” and (2) there is a level of background random variability for each feature, there are usually features that can distinguish between groups by pure chance. Thus, adding hundreds or thousands of variables of which most are contributing noise does not necessarily make it easier to find the true biological signal in the data, a phenomenon that is sometimes called the “curse of dimensionality.” The key analytical challenge in biomarker development is to find the true signal in the noise.
This challenge is compounded by the power of modern machine learning techniques, which can be extremely efficient at finding subtle patterns in the data. Given the high-dimensional nature of “omics” data, it is easy for these methods to fit models on the noise in a specific dataset and the more flexible a model is, the easier it may be to pick up the noise in the data. By picking up noise instead of signal, the model weights are more sensitive to changes in the input data, and the model will have a high variance. Conversely, less flexible models usually have a bias towards the constraints of that model, making it more robust against noise, but also less flexible to the true signal. Ideally, a good model is flexible enough to find a close fit to the true signal (low bias) and have low variance. There is usually a trade-off between these properties, referred to as the bias-variance trade-off. In summary, these properties of high-dimensional data and the flexibility of models available in modern machine learning, requires a specific strategy and a set of steps for model development and validation where a model is trained, tuned, tested, and finally validated in independent data to ensure that the model has indeed found a true signal in the noise.
Before the machine learning procedures can start, the data needs to be preprocessed for further analysis. The full details of these techniques are outside the scope of this review. Briefly, preprocessing typically includes normalization, batch-effect correction, and feature selection. Normalization is intended to reduce the technical variability between samples and can be done for a set of samples together (cohort) or on a single-sample basis. Single-sample techniques have the advantage of easier implementation in personalized medicine workflows as it can be performed continuously on each new sample that is processed and independently of other cohorts. Batch-effects refer to systematic technical variability in the data due to groups of samples having similar patterns due to experimental or technical factors, in contrast to biological differences. This can be due to different assay platforms being used, samples being processed at different centers or at different timepoints, differences in sample handling, storage, processing, etc. It is important to reduce these potential sources of variability, and to identify and account for batch-effects when present. Feature selection is intended to reduce the number of noninformative, or noise, features. It can be done based on unbiased data-driven methods (such as removing lowly expressed genes and selecting the most variable genes), by prior biological knowledge (eg, selecting genes in specific biological pathways) or at later analytical stages by how the features contribute to model performance. The overall goal of the preprocessing is to yield a dataset with as little noise and as much true signal as possible.
The terms artificial intelligence (AI) and machine learning (ML) are used colloquially and often interchangeably, but they do hold separate meanings. AI refers to any technique utilized on a computer that seems to replicate a form of intelligence. ML represents a subset of AI where computers can learn directly from labeled (ie, patient outcomes or radiosensitivity status of a sample is known) or unlabeled data. The goal of all ML methods is to take data as input and create a more structured and simplified output. There are 2 primary methodologies to perform this task–unsupervised learning and supervised learning. The primary difference between these 2 methods is that supervised models train on labeled data for a specific outcome–a continuous variable (regression) such as the survival fraction of cells after receiving 2 Gy of radiation (SF2) or a discrete variable (classification) such as whether a tumor is known to be radiation sensitive or resistant—while unsupervised models are used to identify the intrinsic structure in a dataset without utilizing any labels (e.g. clustering).
Since the number of observations usually is limited, how the limited data gets used is one of the most important aspects of ML to ensure generalizability to future samples. Data spending refers to the task of splitting up data at steps throughout the workflow so that the model reproducibly finds signal instead of noise. The data used to build a model is called training data, while we refer to the data that the final and locked model is run on to validate the performance as independent validation data. The training data can be further split for model tuning, which we refer to as a hold-out test set. A true validation data set is meant to be entirely separate from the model training workflow and only used as a final independent confirmation of model performance. Various strategies have been utilized to train radiation signatures, such as Eschrich et al.7 utilizing the NCI-60 cell lines, Zhao et al.8 utilizing orthotopic glioblastoma patient-derived xenografts, and Sjöström et al.9 utilizing breast cancer patient samples. All 3 studies had independent clinical validation datasets that were used to confirm the clinical validity and generalizability of the models. A critical requirement for radiation biomarkers is a validation dataset which is independent from training,7,8,10, 11, 12, 13, 14, 15 such that no validation data is used to adjust the model. If the validation data is used for training, or informing the training (ie, information leakage), the model may pick up noise that is present in both the training and validation data, instead of true signal, and will not generalize to future datasets. Validation data may come from a different source than the training data. In that case, differences in platforms or experimental protocols may need to be accounted for via normalization and/or batch correction strategies described above. However, this would not generally be considered a major component of information leakage, since the validation data is not being directly trained on the outcome of interest.
As discussed above, ML models can be so flexible and effective in finding patterns that they find noise that does not represent true biological signal and is only specific to that data set, which is referred to as overfitting.16 Thus, a model that is over-fit to a particular training dataset usually performs poorly on validation. A bias term making the model less flexible can be added so that the model's fit is reduced on training data, increasing the generalizability and ideally with improved performance on validation (regularization, described in detail later) (Figure). The bias term represents a hyperparameter, which is a feature that can only be optimized through evaluation of a model's performance, unlike a traditional parameter in a linear regression model which has a best-fit solution. Both the selection of an appropriate model with the optimal level of flexibility and multiple hyperparameters need to be tested to identify the optimal model and settings. However, selecting the best model and settings in the training data commonly leads to over-fitting, while selection based on the validation data would be information leakage. Thus, one strategy is to further split the training data into a model building set and a hold-out test set from which optimal model and bias hyperparameters are selected (Figure), so that all of this occurs only within the training data, keeping the validation data separate.
Selecting a hold-out test dataset within the training data can result in very different results depending on how the data are split. Resampling by repeatedly generating the hold-out test set multiple times is one way to minimize this variability. The two most common resampling techniques are k-fold cross validation (CV) and bootstrapped resampling (Figure). CV involves randomly splitting the data into k groups, with a model trained on k-1 groups and validated on that final hold out group (repeated k times). Leave-one-out CV is a special case where k equals the number of observations. Bootstrapping is where observations are randomly sampled from the data (allowing for repeats). The model is then trained on that sampled group and tested on the remaining observations. This can be performed n times and averaged out. To use this technique for hyperparameter optimization, one approach is performing resampling within the training data. Once the best hyperparameters are selected, the model can be locked and validated on that hold out test set. This is called double-loop resampling, with Sjöström et al.9 using this approach to develop a breast cancer radiation sensitivity signature. The training data they collected was split 50/50 into a model building and hold-out test set, where both gene selection and model hyperparameter optimization were performed using cross validation within the model building set. The model was first validated against their own hold-out test set and then on independent, publicly available datasets, demonstrating the clinical utility of the signature as prognostic for ipsilateral breast tumor recurrence and predictive of radiotherapy benefit in estrogen receptor positive patients. In summary, strict data spending schema and proper training using resampling minimizes information leakage, while maximizing the performance and generalizability of machine learning models.
Comments (0)