The multistep model from Armitage and Doll [6] describes a pathogenic process with the following characteristics [19]:
1.The pathogenic process is the result of several discrete changes with very low rates of occurrence;
2.Changes are stable (i.e. irreversible);
3.Changes must proceed in a unique order;
4.Changes occur independently (can occur in any order but only cause the disease if they occur in a specific order);
5.The probability of a specific change occurring in a given interval is only a function of the length of the interval and not when the interval occurs (i.e. independent increments).
6.The waiting time for the i-th step is exponentially distributed.
7.Almost surely, every person develops the disease (assuming an infinite lifespan).
If the pathogenic process has these characteristics, the relationship between incidence I and age t can be approximated (at least for a certain age range) as a power law
$$\:I\left(t\right)=_^$$
(1)
where \(\:_\) is a positive multiplicative constant that is proportional to the product of probabilities of each step and k is an integer that defines the number of steps (k > 1). Since
$$\:\text\text\left(I\right)=\text\text\left(_\right)+\left(k-1\right)\text\text\left(t\right)$$
(2)
the power law (Fig. 1A) is linear in log-log scale with respect to age (Fig. 1B) and sublinear in semi-log scale (Fig. 1C).
A reasonable alternative to the power law for describing age-specific incidence is the exponential function, which is typically used to describe age-specific all-cause mortality [21], and it is generally associated to processes that consist of continuous accumulation of damage [22,23,24,25]. The exponential function describes the relation between incidence I and age t as
$$\:I\left(t\right)=_^$$
(3)
where \(\:_\) is again a positive multiplicative constant and b is the degree of exponential acceleration (i.e. higher b implies faster increase with age). Since
$$\:\text\text\left(I\right)=\text\text\left(_\right)+bt$$
(4)
the exponential function (Fig. 1A) is linear in semi-log scale (Fig. 1C) and supralinear in log-log scale (Fig. 1B). The exponential function is by definition incompatible with the multistep pathogenic process defined by Armitage and Doll [6]. Note that the exponential function implies a faster increase of incidence with age compared to the power law, and that the difference between the two functions is conceptually profound but visually subtle, so that a true exponential function would be fitted pretty well by a power law and viceversa. Consequently, R2 values are expected to be high for both functions when fitting real age-incidence data, and small R2 values differences become crucial to determine whether the age-incidence curve is truly a power law (thus compatible with the Armitage-Doll multistep model) or an exponential function (thus incompatible with the Armitage-Doll multistep model). Importantly, for both the power law (with k > 2) and the exponential function (with b > 0) the first and second derivatives are always positive, i.e. incidence accelerates with age, so the fitting performance of the two functions can be fairly compared only for age ranges in which the acceleration assumption is met. Violations of the acceleration assumption erroneously bias the fitting performance toward the power law. Moreover, in the Armitage-Doll multistep model, the power law mathematically arises as an approximation for low ages only. This means that, if the Armitage-Doll multistep model is true, the power law applies for age-incidence data in younger adults, but not necessarily in older persons [6, 26]. At high ages, where the acceleration assumption is not met and the Armitage-Doll multistep model’s age-incidence relationship becomes undefined, the data cannot be used to support the multistep model.
DatasetsThree population-based datasets of age-specific incidence of ALS were analyzed in the present study:
(1)Dataset 1 was the dataset of the original study proposing that ALS is a multistep process, which comprises a total of 6274 cases from 5 registers [5]. Age-specific incidence data were extracted by graphical reading of the figures (WebPlotDigitizer Version 4.6; https://automeris.io/WebPlotDigitizer/) individually for each register. For the main analyses, the data were pooled across registers in order to obtain an overall estimate for the 5 registers combined (Fig. 2A), as in the original work [5]. The graphical reading was validated by verifying that the extracted data were consistent with the results reported in the original paper (Table 1).
(2)Dataset 2 included data from all studies suggesting that ALS is a multistep process published between November 2014 and April 2020 (total 19187 cases from 9 registers) [5, 7,8,9]. Age-specific incidence data were again extracted by graphical reading of the figures, but this time not pooled across registers due to the different age resolution (Fig. 2B). Again, the graphical reading was validated by verifying that the data were consistent with the results reported in the corresponding original papers (Table 2).
(3)Dataset 3 included data from a large number of studies published until July 2016, systematically and meta-analytically reviewed in a previous paper [2]. For consistency with the previous datasets, age-specific incidence data were again extracted by graphical reading of the figures, without pooling across registers (Fig. 2C). The first and the last value of each register (i.e. incidence at the lowest and highest ages) were excluded because the corresponding age ranges were unbounded. Only adult age ranges (i.e. >18 years) and only non-zero incidences were included (because zero values, which can occur due to finite sampling, become minus infinity outliers in logarithmic scale). For dataset 3 we also had access to the original data, so the graphical reading was validated by verifying that the results obtained on the extracted data and the original data were equivalent.
As a control, we also analyzed the data of age-specific incidence of cancer originally reported by Armitage and Doll [6] (note that they additionally assumed that mortality was a good indicator for incidence). From the republished paper [27], we specifically extracted with graphical reading only the data for cancers that were consistent with a multistep pathogenesis, i.e. oesophagus, stomach, colon, rectum and pancreas. Again, the graphical reading was validated by verifying that the extracted data were consistent with the results reported in the original paper (Table 3), and we averaged the data across cancers for the main analyses.
Fig. 2
Comparison between power law and exponential function for ALS. A–C Age-specific incidence from three datasets, plotted with the full age range available. D-F Same data considering only the age ranges analyzed to meet the acceleration assumption. G–I Same data as in (D-F) represented in log-log scale, with linear fitting (blue) and quadratic fitting (purple). Note that the quadratic function provides a better fitting than the linear function in all datasets. (L–N) Same data as in (D-F) represented in semi-log scale, with linear fitting (blue) and quadratic fitting (purple). Note that the quadratic fit overlaps almost perfectly with linear fit, suggesting that the age-specific incidence of ALS is linear in semi-log scale. R2 corresponds to the linear fitting and is higher in semi-log scale in all three datasets. The increase of age-specific incidence of ALS is thus better described by an exponential function than a power law, which is not consistent with the Armitage-Doll multistep model
Table 1 Validation of dataset 1Table 2 Validation of dataset 2Table 3 Validation of armitage and doll cancer dataMain analysesIn all datasets, as main analyses we first linearly fitted the expected logarithm of the age-incidence data in log-log scale (Eq. 2, i.e. assuming a power law, as in previous studies proposing the multistep pathogenic hypothesis) and in semi-log scale (Eq. 4, i.e. assuming an exponential function), and calculated the corresponding fitting performances in terms of R2 values. We also performed a quadratic fitting in order to visually test whether the age-incidence data were supralinear in log-log scale (supporting the exponential function) or sublinear in semi-log scale (supporting the power law).
To more formally test whether the age-specific incidence data \(\:_\) was linear in log-log scale (i.e. power law) or linear in semi-log scale (i.e. exponential), we performed the following log-linear multiple regression:
$$\:}\:\left( } } \right) = \beta \:_ t\: + \beta \:_ }\left( t \right) + \beta \:_ + \varepsilon$$
(5)
where \(\:\varepsilon \sim N(0,\sigma ^ )\) is a normally distributed error term. We thus estimated a Bayes factor (BF10) as posterior probability of the semi-log model (i.e. \(\:_=0\), favoring exponential, H1) compared to the log-log model (\(\:_=0,\) favoring power law, H0). This analysis was performed in JASP (version 0.14.0), which employs a Bayesian inference framework with non-informative or weakly informative priors, and utilizes Markov Chain Monte Carlo sampling for posterior estimation. We specifically used the default Jeffreys-Zellner-Siow (JZS) prior, which places a Cauchy distribution (scale = 0.354) for the standardized regression coefficients and an uninformative Jeffreys prior \(\:p\left(^\right)\propto\:1/^\) on the residual variance, and the beta binomial with a = 1 and b = 1 (i.e. uniform distribution on the number of predictors) for the model prior. In our setting, where only two mutually exclusive single-predictor models are compared, this ensures that the competing models have equal prior probability and thus posterior probability ratios and Bayes factors are equivalent.
Ancillary analysesTo support the robustness of the main results, we performed the following ancillary analyses: Bayes factor robustness check, fractional polynomials, hierarchical mixed-effects models, leave-one-out cross-validation, sensitivity analyses, and validation on real data (dataset 3). The Bayes factor robustness check with JZS prior was performed in JASP. The remaining analyses were performed in Matlab, using the functions ‘fitlm’ for linear models and ‘fitlme’ for hierarchical mixed-effects models.
We first tested the robustness of the Bayes factors estimated in the log-linear multiple regression of the main analyses to the scale of the JZS prior, from very narrow (scale = 0.01) to very broad (scale = 100,000). We also estimated the Bayes factor BF10 as the maximum likelihood ratio between the semi-log model (i.e., exponential, H1) against the log-log model (i.e., power law, H0). While this maximum likelihood ratio does not represent a true Bayes factor (since it does not integrate over the prior), it can approximate it under certain conditions – namely, large sample sizes, regular and sharply peaked likelihoods, and flat (uniform) priors – where the marginal likelihood becomes dominated by the peak of the likelihood function. Note that the uniform prior can be seen as the limit of the JZS prior when the scale goes to infinity, so the maximum likelihood ratio can be considered as an additional robustness check with the widest prior compared to the JZS prior.
As a complementary approach, we also used fractional polynomials [28] to fit the age-incidence data with the Eq.
$$\ln\!\left(I_}\right)=\tilde_\, t^ \;+\; \tilde_ \;+\; \varepsilon$$
(6)
with \(\:^\equiv\:^\) when \(\:x\ne\:0\) and \(\:^\equiv\:\text\text\left(t\right)\), and x ranging from − 2 to 2 with 0.1 intervals. We estimated the optimal exponent xopt that provided the best fitting of the data, as indicated by maximum likelihood. The expected optimal exponent is xopt=0 if incidence increases with age as a power law, whereas it is xopt=1 if incidence increases with age as an exponential function.
As an additional complementary approach, we implemented hierarchical models to account for the nested structure of the data, fitting mixed-effects linear models with random intercepts \(\:_\) and random slopes \(\:_\) for each register to capture the potential differences in the age-incidence relationship across registers. For the hierarchical power law model, we fitted the age-incidence data with the Eq.
$$\:}\left( }_}}} } \right) = \beta \:_ + \beta \:_ }\left( t \right) + (u_} + v_} }\left( } \right)) + \,\varepsilon$$
(7)
For the hierarchical exponential model, we fitted the age-incidence data with the Eq.
$$\:}\left( }_}}} } \right) = \beta \:_ + \beta \:_ t + (u_} + v_} t) + \,\varepsilon$$
(8)
We estimated a Bayes factor BF10 as the likelihood ratio between the hierarchical exponential model (H1) against the hierarchical power law model (H0). Note that for this analysis the data were never pooled across registers.
In order to ensure robustness, we performed leave-one-out cross-validation on the linear log-log and semi-log regressions, as well as the hierarchical models, so that the contribution of each data point to the overall R2 value always comes from the prediction of a model constructed without that data point.
In order to assess to what extent the results depended on the exact age range considered in the analyses, we performed a sensitivity analysis comparing fitting performance (i.e. R2) between the exponential function and the power law when reducing age range (i.e. respecting the acceleration assumption) and when increasing the age range (i.e. violating the acceleration assumption, which is expected to bias the comparison toward the power law).
In dataset 3, all results obtained with the extracted data were corroborated on the real data.
Bayesian evidence in favor to alternative hypothesis (BF10 > 1) or to the null hypothesis (BF01 > 1) is described according to standard levels: anecdotal (1 < BF < 3), moderate (> 3), strong (> 10), very strong (> 30), and extreme (> 100).
Comments (0)