Flexible Machine Learning Estimation of Conditional Average Treatment Effects: A Blessing and a Curse

The increasing availability of observational data has tremendously boosted the field of machine learning.1 Machine learning provides us with flexible, nonparametric methods to study the observed outcome Y, given features X, that may involve a treatment (or exposure) A, by statistical inference on the (conditional) distributions Y|X,A=a. Therefore, machine learning methods are excellent at predicting future observations that arise from the same (factual) distribution.2 However, it is essential to realize that these models cannot be automatically used to answer “what if” questions for the treatment A, as associations found are not necessarily causal.2–8 Statistical inference of associations is thus only one step in causal inference and, as such, in counterfactual prediction.9

The critical step for causal inference is linking the distribution of outcomes in a universe where everyone was treated with a, that is potential outcomes Ya,10 to the distribution of the observed data. Working with observational data, we have to make causal identification assumptions that cannot be verified with the data, so machine learning is insufficient. Instead, we have to rely on expert knowledge.11 If the identification assumptions apply, distributions of potential and observed outcomes can be linked, and observable analogs of the causal estimands can be derived. Accurate statistical inference on these analogs is also necessary for causal inference. When the identification assumptions apply, but the statistical inference is off, the causal inference will also be invalid. The flexibility offered by machine learning methods can improve statistical inference.6,12 More precisely, machine learning methods can be exploited to learn nuisance parameters of the data generating distribution, such as conditional means and propensity scores, that can be used to estimate the causal estimand, as is, for example, done in targeted maximum likelihood estimation.13,14

The increasing availability of diverse data makes studying effect heterogeneity among individuals more feasible. The field of precision medicine aims to understand this heterogeneity to improve individual treatment decisions.15 The average treatment effect (ATE), E[Y1−Y0], might seriously differ from the individual treatment effect, that is the actual change in outcome caused by the exposure for a particular individual i(Yi1−Yi0).16 However, it is well known that an individual treatment effect is not identifiable because of the fundamental problem of causal inference,17 it is impossible to observe the different potential outcomes for one individual jointly. On the other hand, marginalized effects like the ATE and the conditional ATE become identifiable in the absence of unmeasured confounding. In randomized experiments, this unconfoundedness assumption holds by design. Treatment effect heterogeneity studies thus focus on the estimation of the individualized conditional ATEs, E[Y1−Y0|X], given measured features X, but aggregated over remaining unmeasured features, as a proxy for the individual treatment effects.18 The functional form of effect modification by different levels of the measured features might be very complex, so machine learning methods are promising tools for estimating conditional ATEs.19

In recent years several meta-learning strategies for conditional ATE estimation have been proposed. These strategies decompose the conditional ATE estimation into regression problems that can be solved with any suitable machine learning method (see Caron et al.20 for a detailed review). T-learners fit separate models for treated and controls and estimate conditional ATEs as the plug-in difference of the conditional mean estimates.21,22 The performance of T-learners will depend on the levels of sparsity and smoothness of conditional means for treated and controls, and the choice of the base learner, and is low when the treated and control samples differ in size.23 S-learners include treatment assignment as another covariate and the conditional ATE is estimated as the difference of the estimated conditional means for treated and controls.24–27 Estimation with S-learners might suffer from serious finite-sample bias because they do not involve the conditional ATE directly, a problem also known for ATE estimation.28 The R-learner directly identifies the conditional ATE by regressing transformed outcomes on transformed treatment assignment using estimates of nuisance parameters in the first step (as we will elaborate on in the methods section).29 The R-learner is also called “double machine learning” and may give unbiased estimates of the average causal effect for finite samples. At the same time, a one-step approach (S-learner) would still be biased.28 Similarly, the DR-learner deals with augmented inverse probability weighted transformation30 of observations after constructing estimates of the propensity score and conditional means in the first step.31,32 The cost of making weaker modeling assumptions using flexible machine learning methods is slower convergence rates for the estimators, known as the curse of dimensionality.33 Therefore, much of the ongoing research is focused on comparing the different methods for conditional ATE estimation to derive whether and when they are optimal.31,34–37

The aim of this work is different, as we want to emphasize the difference between the conditional ATE and the individual treatment effect. The conditional ATE is much more personalized than the ATE and, thus, an important step towards precision medicine. However, it concerns us that the conditional ATE is sometimes seen as the individual treatment effect.38 Whether the conditional ATE can appropriately approximate the individual treatment effect depends on the remaining variability of causal effects given the considered modifiers, for example, a conditional ATE ≥0 given X=x39 does not imply that all individual treatment effects ≥0 for those individuals.40 In this work, we investigate whether we can use a causal random forest41 to estimate the variance of the marginal individual treatment effect distribution. More specifically, we investigate the performance of the causal random forest to estimate var(Y1−Y0|X=x) and var(Y1−Y0) for data simulated from a causal system based on a real case study.

To open up the field of individual treatment effect distribution estimation, we derive identification assumptions, additionally to those necessary for marginal causal inference, to identify other characteristics of the conditional individual treatment effect distribution. To give an idea of how such assumptions on the joint distribution of potential outcomes can evolve the field of treatment effect heterogeneity, we extend the causal random forest algorithm to estimate the variance of the individual treatment effect given the measured features. First, we introduce our notation and present the identification assumptions necessary for conditional ATE estimation. Subsequently, we present the results of fitting the causal random forest to datasets simulated under different settings. Thereafter, we introduce the causal assumption for the identification of the conditional variance of the individual treatment effect. Moreover, we extend the causal random forest to estimate the latter and present its performance on the simulated datasets. Finally, we present some concluding remarks.

NOTATION AND METHODS

Probability distributions of factual and counterfactual outcomes are defined in the potential outcome framework.42,43 Let Yi and Ai represent the (factual) stochastic outcome and the random treatment assignment level of the individual i. Let Yia equal the potential outcome under an intervention on the treatment to level a (is counterfactual when Ai≠a). We thus rely on a deterministic potential outcome framework, where each level of treatment corresponds to only one outcome for each individual (but its value typically differs between individuals).44,45

We will consider only two treatment levels with 0 indicating no treatment. The individual causal effect of an arbitrary individual i equals Yi1−Yi0. When we discuss the random variable describing the heterogeneity in the population, we do not subscript. Identification assumptions are necessary to relate the distribution of potential outcomes to the distribution of observed outcomes. First of all, it is necessary to have access to a set of measured features X so that the treatment assignment is conditionally independent of the potential outcomes.

Assumption 1. Conditional Exchangeability

A⊥Y0,Y1|X

This independence is called conditional exchangeability (or unconfoundedness) and implies the absence of unmeasured confounding that cannot be verified with observational data.10 Then there are no features, other than X, that Y0 or Y1 depend on and that differ in distribution between individuals with A=1 and A=0. Because we are interested in causal effect heterogeneity, the set of features will also X contain modifiers Xm (i.e., ∃x1,x2:E[Y1−Y0|Xm=x1]≠E[Y1−Y0|Xm=x2]46) next to the confounders that are necessary to obtain independence. A feature can be only a modifier, only a confounder, or both, all on the additive scale. For a feature L that is only a confounder but not a modifier ∀l:E[Y1−Y0|L=l,Xm=x]=E[Y1−Y0|Xm=x], where Xm represents the feature set X without L.

Furthermore, we need to assume that an observed outcome equals the potential outcome for the assigned treatment, referred to as causal consistency.47

Assumption 2. Causal Consistency

Yi=YiAi

Causal consistency is also referred to as the stable unit treatment value assumption.48 Causal consistency implies that potential outcomes are independent of the treatment levels of other individuals (no interference) and that there are no different versions of the exposure levels. Causal consistency can also not be verified with data.

Finally, the probability of receiving treatment should be bounded away from 0 and 1 for all levels of X, referred to as positivity10 and is also known as overlap.48

Assumption 3. Positivity

∀x:0<P(A=1|X=x)<1

As in Section 6 of Athey et al.,41 by causal consistency, we use the parameterization

Yi=Yi0+biAi,

where bi is the individual treatment effect of individual i, so that Yi1=Yi0+bi. The conditional mean of bi given features Xi equals the conditional ATE τ(Xi), where τ(x)=E[Y1−Y0|X=x]. The individual treatment effect can thus be divided into τ(Xi) and the individual deviation from the conditional ATE that is referred to as U1i. We rewrite Equation (1) as

Yi=θ0(Xi)+NYi+(τ(Xi)+U1i)Ai,

where θ0(x)=E[Yi0|Xi=x], NYi represents the deviation of Yi0 from θ0(Xi), τ(x)=E[bi|Xi=x], E[NYi|Xi=x]=0 and E[U1i|Xi=x]=0. In this parameterization, the individual Y0 and effect b have been rewritten as the sum of their conditional expectations and zero mean deviations from these expectations. Note that other characteristics (different from the mean) of the NY|X=x and U1|X=x distributions can depend on the value of x. Furthermore, U1 and NY can be dependent.

Case Study and Data Simulation

To illustrate how the random conditional expectation E[Y1−Y0|X] and Y1−Y0 may differ in distribution, we simulate data based on the Framingham Heart Study.49 We focus on the heterogeneity in the effect of nonalcoholic fatty liver disease on a clinical precursor to heart failure, the left ventricular filling pressure.50 The association found in the original work was adjusted for several features. However, for this illustration, we assume that only gender (male = 0 and female = 1) and systolic blood pressure (SBP, mmHg) are confounders. We will simulate the following cause-effect relations:

Ai=1

Yi0=β0+βgenXgen,i+βSBPXSBP,i+NYi

Yi1=Yi0+(τ0+τgenXgen,i+τSBPXSBP,i+U1i),

where Xgen,i∼Ber(p), XSBP,i∼N(0,1), U1i∼N(0,σ12), NYi∼N(0,σ02), NAi∼Uni[0,1], and U1i⊥NYi. Moreover, there is no unmeasured confounding, that is NAi⊥NYi, U1i so that Ai⊥(Yi1,Yi0)|Xgen,i,XSBP,i. By causal consistency, the observed outcome Yi=YiA equals Yi1 when Ai=1 and Yi0 when Ai=0. The parameter values are obtained by fitting a linear mixed model for the relation of fatty liver disease and the left ventricular filling pressure adjusted for standardized SBP and gender,

Y=β0+βgenXgen,i+βSBPXSBP,i+NYi +(τ0+τgenXgen,i+τSBPXSBP,i+U1i)Ai,

to the subset of the Framingham Heart Study participants n=2356 as used by Chiu et al.50 The distribution of Y1−Y0 using the parameters obtained with PROC LOGISTIC and PROC MIXED in SAS is shown in Figure 1, where E[Y1−Y0]=0.5, var(Y1−Y0)=1.41 and P(Y1−Y0>0)=0.64. Furthermore, the distribution of the conditional expectation E[Y1−Y0|XSBP,Xgen] is shown with a standard deviation equal to 0.16 and P(E[Y1−Y0|XSBP,Xgen]>0)=1.00. The conditional expectation distribution seriously differs from that of the individual treatment effect due to the unmeasured (remaining) effect heterogeneity (U1). For completeness, the distributions of Y1 and Y0 are presented in eFigure 1 in eAppendix B; https://links.lww.com/EDE/C83. Moreover, we simulate X0, which is a measured variable associated with the level of the individual modifier U1, (U1,X0)T∼N(0,(σ12ρδσ12 ρδσ12δ2σ12 )). For ρ > 0 ρ>0,

Comments (0)

No login
gif