23 Model selection

23.1 Prediction model selection

The first stage of DML approaches involves training prediction models to predict $E[Y|X, W]$ and $E[T|X, W]$. Suppose you are considering multiple classes of estimators with various sets of hyper-parameter values for each model class. We can use GridSearchCVList() from the econml pacakge to conduct cross-validation to see which model works the best in predicting $E[Y|X, W]$ and $E[T|X, W]$ in terms of CV MSE. GridSearchCVList() is an extension of GridSearchCV() from the sklearn package. While GridSearchCV() conducts CV for a single model class with various sets of hyper-parameter values, GridSearchCVList() conducts CV for multiple classes of models with various sets of hyper-parameter values.

Let’s go through an example use of GridSearchCVList(). First, we import all the functions we will be using.

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Lasso
from sklearn.model_selection import RepeatedKFold
from econml.dml import LinearDML
from econml.sklearn_extensions.model_selection import GridSearchCVList

Let’s also generate synthetic dataset using make_regression().

from sklearn.datasets import make_regression
import numpy as np

#=== set parameters for data generation ===#
n_samples, n_features, n_informative, noise = 2000, 20, 15, 2
rng = np.random.RandomState(8934)

#=== generate synthetic data ===#
XT, y = make_regression(
  n_samples, 
  n_features, 
  n_informative = n_informative, 
  noise = noise, 
  random_state = rng
)

T = XT[:, 0] # first column as the treatment variable
X = XT[:, 1:] # the rest as X

Some of the key parameters for GridSearchCVList() are:

estimator_list: List of estimators. Each estimator needs to implement the scikit-learn estimator interface.
param_grid: List of the name of parameters to tune with their values for each model.
cv: Specification of how CV is conducted

Let’s define them one by one.

estimator_list

est_list = [
    Lasso(max_iter=1000), 
    GradientBoostingRegressor(),
    RandomForestRegressor(min_samples_leaf = 5)
]

So, the model classes we consider are lasso, random forest, and boosted forest. Note that you can fix the value of parameters that you do not vary in CV. For example, RandomForestRegressor(min_samples_leaf = 5) sets min_samples_leaf at 5.

param_grid

par_grid_list = [
    {"alpha": [0.001, 0.01, 0.1, 1, 10]},
    {"max_depth": [3, 5, None], "n_estimators": [50, 100, 200]},
    {"max_depth": [3, 5, 10], "max_features": [3, 5, 10, 20]},
]

The $n$th entry of param_grid is for the $n$th entry of estimator_list. For example, two hyper-parameters will be tried for RandomForestRegressor(): max_depth and max_features. The complete combinations of the values from the parameters will be evaluated. For example, here are all the set of parameter values tried for RandomForestRegressor().

   max_depth max_features
1          3            3
2          5            3
3         10            3
4          3            5
5          5            5
6         10            5
7          3           10
8          5           10
9         10           10
10         3           20
11         5           20
12        10           20

cv

rk_cv = RepeatedKFold(n_splits = 4, n_repeats = 3, random_state = 123)

#=== check the number of splits ===#  
rk_cv.get_n_splits()

Cross-validation specification can be done using the sklearn.model_selection module. Here, we are using RepeatedKFold, and it is a 4-fold CV repeated 2 times.

Let’s now specify GridSearchCVList() using the above parameters.

first_stage = GridSearchCVList(
    estimator_list = est_list,
    param_grid_list= par_grid_list,
    cv = rk_cv,
    scoring = "neg_mean_squared_error"
)

scoring

We now conduct CV with the fit() method and then access the best estimator by referring to best_estimator_ attribute.

model_y = first_stage.fit(X, y).best_estimator_
model_y

Lasso(alpha=0.1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We do the same for $T$.

model_t = first_stage.fit(X, T).best_estimator_
model_t

Lasso(alpha=0.1)

We can now run DML with the best first stage models like below.

#=== set up a linear DML ===#
est = LinearDML(
    model_y=model_y, 
    model_t=model_t
)

#=== train ===#
est.fit(Y, T, X=X, W=X)

23.2 Causal model selection

from econml.dml import DML, LinearDML, SparseLinearDML, NonParamDML
from econml.metalearners import XLearner, TLearner, SLearner, DomainAdaptationLearner
from econml.dr import DRLearner

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import numpy as np

# split data in train-validation
X_train, X_val, T_train, T_val, Y_train, Y_val = train_test_split(X, T, y, test_size=.4)

# define list of CATE estimators to select among
reg = lambda: RandomForestRegressor(min_samples_leaf=20)
clf = lambda: RandomForestClassifier(min_samples_leaf=20)

models = [('xlearner', XLearner(models=reg(), cate_models=reg(), propensity_model=clf())),
          ('slearner', SLearner(overall_model=reg()))
]

# fit cate models on train data
models = [(name, mdl.fit(Y_train, T_train, X=X_train)) for name, mdl in models]

# score cate models on validation data
scorer = RScorer(model_y=reg(), model_t=clf(),
                 discrete_treatment=True, cv=5, mc_iters=2, mc_agg='median')

scorer.fit(Y_val, T_val, X=X_val)

rscore = [scorer.score(mdl) for _, mdl in models]
# select the best model
mdl, _ = scorer.best_model([mdl for _, mdl in models])
# create weighted ensemble model based on score performance
mdl, _ = scorer.ensemble([mdl for _, mdl in models])

# Model selection ## Prediction model selection The first stage of DML approaches involves training prediction models to predict $E[Y|X, W]$ and $E[T|X, W]$. Suppose you are considering multiple classes of estimators with various sets of hyper-parameter values for each model class. We can use `GridSearchCVList()` from the `econml` pacakge to conduct cross-validation to see which model works the best in predicting $E[Y|X, W]$ and $E[T|X, W]$ in terms of CV MSE. `GridSearchCVList()` is an extension of `GridSearchCV()` from the `sklearn` package. While `GridSearchCV()` conducts CV for a <span style="color:blue"> single</span> model class with various sets of hyper-parameter values, `GridSearchCVList()` conducts CV for <span style="color:blue"> multiple</span> classes of models with various sets of hyper-parameter values. Let's go through an example use of `GridSearchCVList()`. First, we import all the functions we will be using. ```{python} from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier from sklearn.ensemble import GradientBoostingRegressor from sklearn.linear_model import Lasso from sklearn.model_selection import RepeatedKFold from econml.dml import LinearDML from econml.sklearn_extensions.model_selection import GridSearchCVList ``` Let's also generate synthetic dataset using `make_regression()`. ```{python} from sklearn.datasets import make_regression import numpy as np #=== set parameters for data generation ===# n_samples, n_features, n_informative, noise = 2000, 20, 15, 2 rng = np.random.RandomState(8934) #=== generate synthetic data ===# XT, y = make_regression( n_samples, n_features, n_informative = n_informative, noise = noise, random_state = rng ) T = XT[:, 0] # first column as the treatment variable X = XT[:, 1:] # the rest as X ``` Some of the key parameters for `GridSearchCVList()` are: + `estimator_list`: List of estimators. Each estimator needs to implement the scikit-learn estimator interface. + `param_grid`: List of the name of parameters to tune with their values for each model. + `cv`: Specification of how CV is conducted Let's define them one by one. > `estimator_list` ```{python} est_list = [ Lasso(max_iter=1000), GradientBoostingRegressor(), RandomForestRegressor(min_samples_leaf = 5) ] ``` So, the model classes we consider are lasso, random forest, and boosted forest. Note that you can fix the value of parameters that you do not vary in CV. For example, `RandomForestRegressor(min_samples_leaf = 5)` sets `min_samples_leaf` at 5. > `param_grid` ```{python} par_grid_list = [ {"alpha": [0.001, 0.01, 0.1, 1, 10]}, {"max_depth": [3, 5, None], "n_estimators": [50, 100, 200]}, {"max_depth": [3, 5, 10], "max_features": [3, 5, 10, 20]}, ] ``` The $n$th entry of `param_grid` is for the $n$th entry of `estimator_list`. For example, two hyper-parameters will be tried for `RandomForestRegressor()`: `max_depth` and `max_features`. The complete combinations of the values from the parameters will be evaluated. For example, here are all the set of parameter values tried for `RandomForestRegressor()`. ```{r} #| echo: false max_depth <- c(3, 5, 10) max_features <- c(3, 5, 10, 20) expand.grid(max_depth = max_depth, max_features = max_features) ``` > `cv` ```{python} rk_cv = RepeatedKFold(n_splits = 4, n_repeats = 3, random_state = 123) #=== check the number of splits ===# rk_cv.get_n_splits() ``` Cross-validation specification can be done using the `sklearn.model_selection` module. Here, we are using `RepeatedKFold`, and it is a 4-fold CV repeated 2 times. Let's now specify `GridSearchCVList()` using the above parameters. ```{python} first_stage = GridSearchCVList( estimator_list = est_list, param_grid_list= par_grid_list, cv = rk_cv, scoring = "neg_mean_squared_error" ) ``` ::: {.column-margin} [scoring](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) ::: We now conduct CV with the `fit()` method and then access the best estimator by referring to `best_estimator_` attribute. ```{python} #| cache: true model_y = first_stage.fit(X, y).best_estimator_ model_y ``` We do the same for $T$. ```{python} #| cache: true model_t = first_stage.fit(X, T).best_estimator_ model_t ``` We can now run DML with the best first stage models like below. ```{python} #| eval: false #| #=== set up a linear DML ===# est = LinearDML( model_y=model_y, model_t=model_t ) #=== train ===# est.fit(Y, T, X=X, W=X) ``` ## Causal model selection {#sec-python-causal-model-selection} ```{python} #| eval: false from econml.dml import DML, LinearDML, SparseLinearDML, NonParamDML from econml.metalearners import XLearner, TLearner, SLearner, DomainAdaptationLearner from econml.dr import DRLearner from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LassoCV import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split import numpy as np # split data in train-validation X_train, X_val, T_train, T_val, Y_train, Y_val = train_test_split(X, T, y, test_size=.4) # define list of CATE estimators to select among reg = lambda: RandomForestRegressor(min_samples_leaf=20) clf = lambda: RandomForestClassifier(min_samples_leaf=20) models = [('xlearner', XLearner(models=reg(), cate_models=reg(), propensity_model=clf())), ('slearner', SLearner(overall_model=reg())) ] # fit cate models on train data models = [(name, mdl.fit(Y_train, T_train, X=X_train)) for name, mdl in models] # score cate models on validation data scorer = RScorer(model_y=reg(), model_t=clf(), discrete_treatment=True, cv=5, mc_iters=2, mc_agg='median') scorer.fit(Y_val, T_val, X=X_val) rscore = [scorer.score(mdl) for _, mdl in models] # select the best model mdl, _ = scorer.best_model([mdl for _, mdl in models]) # create weighted ensemble model based on score performance mdl, _ = scorer.ensemble([mdl for _, mdl in models]) ```