from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Lasso
from sklearn.model_selection import RepeatedKFold
from econml.dml import LinearDML
from econml.sklearn_extensions.model_selection import GridSearchCVList
23 Model selection
23.1 Prediction model selection
The first stage of DML approaches involves training prediction models to predict \(E[Y|X, W]\) and \(E[T|X, W]\). Suppose you are considering multiple classes of estimators with various sets of hyper-parameter values for each model class. We can use GridSearchCVList()
from the econml
pacakge to conduct cross-validation to see which model works the best in predicting \(E[Y|X, W]\) and \(E[T|X, W]\) in terms of CV MSE. GridSearchCVList()
is an extension of GridSearchCV()
from the sklearn
package. While GridSearchCV()
conducts CV for a single model class with various sets of hyper-parameter values, GridSearchCVList()
conducts CV for multiple classes of models with various sets of hyper-parameter values.
Let’s go through an example use of GridSearchCVList()
. First, we import all the functions we will be using.
Let’s also generate synthetic dataset using make_regression()
.
from sklearn.datasets import make_regression
import numpy as np
#=== set parameters for data generation ===#
= 2000, 20, 15, 2
n_samples, n_features, n_informative, noise = np.random.RandomState(8934)
rng
#=== generate synthetic data ===#
= make_regression(
XT, y
n_samples,
n_features, = n_informative,
n_informative = noise,
noise = rng
random_state
)
= XT[:, 0] # first column as the treatment variable
T = XT[:, 1:] # the rest as X X
Some of the key parameters for GridSearchCVList()
are:
estimator_list
: List of estimators. Each estimator needs to implement the scikit-learn estimator interface.param_grid
: List of the name of parameters to tune with their values for each model.cv
: Specification of how CV is conducted
Let’s define them one by one.
estimator_list
= [
est_list =1000),
Lasso(max_iter
GradientBoostingRegressor(),= 5)
RandomForestRegressor(min_samples_leaf ]
So, the model classes we consider are lasso, random forest, and boosted forest. Note that you can fix the value of parameters that you do not vary in CV. For example, RandomForestRegressor(min_samples_leaf = 5)
sets min_samples_leaf
at 5.
param_grid
= [
par_grid_list "alpha": [0.001, 0.01, 0.1, 1, 10]},
{"max_depth": [3, 5, None], "n_estimators": [50, 100, 200]},
{"max_depth": [3, 5, 10], "max_features": [3, 5, 10, 20]},
{ ]
The \(n\)th entry of param_grid
is for the \(n\)th entry of estimator_list
. For example, two hyper-parameters will be tried for RandomForestRegressor()
: max_depth
and max_features
. The complete combinations of the values from the parameters will be evaluated. For example, here are all the set of parameter values tried for RandomForestRegressor()
.
max_depth max_features
1 3 3
2 5 3
3 10 3
4 3 5
5 5 5
6 10 5
7 3 10
8 5 10
9 10 10
10 3 20
11 5 20
12 10 20
cv
= RepeatedKFold(n_splits = 4, n_repeats = 3, random_state = 123)
rk_cv
#=== check the number of splits ===#
rk_cv.get_n_splits()
12
Cross-validation specification can be done using the sklearn.model_selection
module. Here, we are using RepeatedKFold
, and it is a 4-fold CV repeated 2 times.
Let’s now specify GridSearchCVList()
using the above parameters.
= GridSearchCVList(
first_stage = est_list,
estimator_list = par_grid_list,
param_grid_list= rk_cv,
cv = "neg_mean_squared_error"
scoring )
We now conduct CV with the fit()
method and then access the best estimator by referring to best_estimator_
attribute.
= first_stage.fit(X, y).best_estimator_
model_y model_y
Lasso(alpha=0.1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.1)
We do the same for \(T\).
= first_stage.fit(X, T).best_estimator_
model_t model_t
Lasso(alpha=0.1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.1)
We can now run DML with the best first stage models like below.
#=== set up a linear DML ===#
= LinearDML(
est =model_y,
model_y=model_t
model_t
)
#=== train ===#
=X, W=X) est.fit(Y, T, X
23.2 Causal model selection
from econml.dml import DML, LinearDML, SparseLinearDML, NonParamDML
from econml.metalearners import XLearner, TLearner, SLearner, DomainAdaptationLearner
from econml.dr import DRLearner
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import numpy as np
# split data in train-validation
= train_test_split(X, T, y, test_size=.4)
X_train, X_val, T_train, T_val, Y_train, Y_val
# define list of CATE estimators to select among
= lambda: RandomForestRegressor(min_samples_leaf=20)
reg = lambda: RandomForestClassifier(min_samples_leaf=20)
clf
= [('xlearner', XLearner(models=reg(), cate_models=reg(), propensity_model=clf())),
models 'slearner', SLearner(overall_model=reg()))
(
]
# fit cate models on train data
= [(name, mdl.fit(Y_train, T_train, X=X_train)) for name, mdl in models]
models
# score cate models on validation data
= RScorer(model_y=reg(), model_t=clf(),
scorer =True, cv=5, mc_iters=2, mc_agg='median')
discrete_treatment
=X_val)
scorer.fit(Y_val, T_val, X
= [scorer.score(mdl) for _, mdl in models]
rscore # select the best model
= scorer.best_model([mdl for _, mdl in models])
mdl, _ # create weighted ensemble model based on score performance
= scorer.ensemble([mdl for _, mdl in models]) mdl, _