Causal inference is an important aspect of economics. In a complex world with many factors for economists to consider, it is extremely valuable to know what actually causes economic phenomena. Economists have come up with many methods to solve this problem. One recent method is called debiased machine learning. This method leverages the effectiveness of machine learning in prediction tasks to accurately calculate causal parameters.
The basic problem of causal inference is to find, when all other factors are held constant, what is the average treatment effect across a population. We are essentially interested in comparing two nearly identical worlds, that differ only in that in one world some treatment is applied, and in the other world the treatment is not applied. One way we might approach this is to try to use many covariates to control for confounding. Using a sophisticated machine learning model, it is possible to predict an outcome variable given a large number of input variables. If we believe that this model accurately captures confounding, then we can find a causal treatment effect by comparing two inputs where everything is held constant except for the treatment variable. One such model could be the conditional expectation function \(\gamma_0(D,X) = \mathbb{E}[Y|D,X]\). Let our data consist of inputs \(W_i = (D_i,X_i)\). Then, we can use the equation
\(\hat{\theta} = 1/n\sum_{i=1}^{n} m(W_i,\hat{\gamma})\) where \(m(W_i,\hat{\gamma}) = \hat{\gamma}(D_i = 1, X_i)-\hat{\gamma}(D_i = 0, X_i)\) to estimate \(\hat{\theta}\). However, this approach suffers from bias and doesn't return an accurate estimate for \(\theta\). This is where debiased machine learning comes in. Debiased machine learning adjusts the results above to remove bias using the following equation:
\[\hat{\theta} = \frac{1}{n}\sum_{l=1}^{L}\sum_{i \in I_l} m(W_i,\hat{\gamma}_l)+\hat{\alpha}_l(W_i)(Y_i-\hat{\gamma}_l(X_i))\]
where \(\alpha\) is a learned parameter called the riesz representer (discussed in appendix C). Using this insight into how to remove bias, we can then calculate causal parameters using debiased machine learning. The algorithm for debiased machine learning is as follows:
We aren't limited to only calculating average treatment effects with debiased machine learning. By simply changing \(\gamma\) and \(m\) in the above algorithm, we can calculate other effects of interest. In this guide, we will also cover how to to calculate elasticities. To do this, we revise our \(\gamma(p_i,X_i) = \mathbb{E}[G_i|p_i,X_i]\) to learn how much gas \(G_i\) someone will buy given the price \(p_i\) and other confounding factors \(X_i\). Using partial differences to estimate elasticities, we set \[m(p_i,X_i) = \frac{\gamma(p_i+\delta/2,X_i)-\gamma(p_i-\delta/2,X_i)}{\delta}\].
This guide will walk through R code for the aforementioned debiased machine learning algorithm for both ATE and elasticities. The example used looks at the impact of access to a 401k on lifetime savings for ATE and gas price data for elasticities. This guide has two goals:
The text above each paragraph will provide an overview of what the code below does. Details about specifics of the code will be explained in comments in the code. The appendix will walk through how to adapt the code for use with your own data.
This guide will focus very little on the math behind debiased machine learning. If the reader is interested in this, there will be some discussion in the appendix, but the best source is the original paper on debiased machine learning at https://arxiv.org/abs/1608.00060.
Below are the necessary packages and basic functions. Short descriptions are available above each of the functions. For the code to work, change the working directory at the end of this code block to the directory with the downloaded 401k data or gas data (or directory with your data if you are using your own dataset). One point of confusion for readers might be that many functions first two arguments are "args" and "type". Because the basic algorithm is so similar for both finding ATE and elasticities, it is more effecient to use generic arguments instead of having to write seperate functions for ATE and elasticities because they have different arguments. By writing functions like this, we can simply pass through arguments and then wait until we reach the most basic functionals (such as get_MNG) to differentiate between ATE and elasticities. When "type = ATE", the associated arguments are "args = T" and when "type = PD", the associate arguments are "args = X.up, X.down, delta".
rm(list=ls())
###########
#Libraries#
###########
library("foreign")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("ggplot2")
library("quantreg") #used for rq.fit.sfn
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
library("nnet") #used for mulitnom
library("randomForest")
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library("keras")
#################
#Basic Functions#
#################
#Euclidean Norm
two.norm <- function(x){
return(sqrt(x %*% x))
}
#dictionary that converts d and z to vector with intercept term
#used to calculate alpha and gamma
b<-function(d,z){
return(c(1,d,z))
}
#dictionary that converts d and z to vector with intercept term and interaction term
#used to calculate alpha and gamma
b2<-function(d,z){
return(c(1,d,z,d*z))
}
#identity dictionary
#used with elasticities
b.PD <- function(d,z){
return(d)
}
#m function mentioned in the introduction for ATE
m.ATE<-function(y,d,z,gamma){ #all data arguments to make interchangeable with m2
return(gamma(1,z)-gamma(0,z))
}
#m function for finding elasticites
m.PD<-function(y,x,x.up,x.down,delta,gamma){ #all data arguments to make interchangeable with m2
return((gamma(x.up,0)-gamma(x.down,0))/delta) #0s in gamma are placeholders because gamma requires two arguments
}
#used to calculate matrix N, which is used to calculate gamma_hat and perform regression
m2<-function(y,d,z,gamma){
return(y*gamma(d,z))
}
#returns number of columns of X
get.p <- function(args,type,X,b){
if (type == "ATE"){
return(length(b(args$T[1],X[1,])))
} else {
return(ncol(X))
}
}
#returns number of data points
get.n <- function(args,type,X){
if (type == "ATE"){
return(length(args$T))
} else {
return(nrow(X))
}
}
#psi_tilde for debiased machine learning (actual parameter, debiased moment function)
psi_tilde<-function(args,type,y,z,alpha,gamma){
if (type=="ATE"){
d = args$T
#print(d)
return(m.ATE(y,d,z,gamma)+alpha(d,z)*(y-gamma(d,z)))
} else {
x.up = args$X.up
x.down = args$X.down
delta = args$delta
return(m.PD(y,z,x.up,x.down,delta,gamma)+alpha(z,0)*(y-gamma(z,0)))
}
}
#psi_tilde with a bias. Used to demonstrate shortfall of not accounting for bias. Won't achieve actual debiased machine learning
psi_tilde_bias<-function(args,type,y,z,alpha,gamma){
if (type=="ATE"){
d = args$d
return(m.ATE(y,d,z,gamma))
} else {
x.up = args$X.up
x.down = args$X.down
delta = args$delta
return(m.PD(y,z,x.up,x.down,delta,gamma))
}
}
#Gets M,N and G matrices. These are necessary for calculating the riesz representer, which is discussed in appendix C, and also calculating gamma. Outputs will have dimensions: M (p x 1), N (p x 1), G (p x p)
get_MNG<-function(args, type, Y,X,b){
if (type == "ATE"){
T = args$T
p=length(b(T[1],X[1,]))
n.nl=length(T)
B=matrix(0,n.nl,p)
M=matrix(0,p,n.nl)
N=matrix(0,p,n.nl)
#Takes average by collecting into a matrix and averaging the rows
for (i in 1:n.nl){
B[i,]=b(T[i],X[i,])
M[,i]=m.ATE(Y[i],T[i],X[i,],b)
N[,i]=m2(Y[i],T[i],X[i,],b)
}
M_hat=rowMeans(M)
N_hat=rowMeans(N)
G_hat=t(B)%*%B/n.nl
return(list(M_hat,N_hat,G_hat,B))
} else {
X.up = args$X.up
X.down = args$X.down
delta = args$delta
#print(X.up)
#print(X.down)
p=ncol(X)
n=nrow(X)
M=matrix(0,p,n)
N=matrix(0,p,n)
for (i in 1:n){ #simplifications since dictionary b is the identity
M[,i]=(X.up[i,]-X.down[i,])/delta #since m(w,b)=(x.up-x.down)/delta
N[,i]=Y[i]*X[i,] #since m2(w,b)=y*x
}
M_hat=rowMeans(M) #since m(w,b)=dx
N_hat=rowMeans(N)
G_hat=t(X)%*%X/n
return(list(M_hat,N_hat,G_hat,X))
}
}
setwd("~/Documents/research/rrr_tutorial")
The following "get_data" function will prepare the data for debiased machine learning. This code splits the data into three components for ATE:
And into five components for elasticities:
get_data<-function(df,spec,quintile,type = "ATE"){
### Inputs
### df: dataset that is being used. In this case a .dta but could be another format
### spec: value of 1,2,or 3 for ATE or 1,2 for PD
### quintile: integer in [0,5]. 0 means all data, 1-5 indicates which quintile of income distribution to filter for
### type: what quantity you're trying to estimate (ATE or PD)
### Output
### List of (Y,D,X) or (Y,X,X.up,X.down,delta)
if (type == "ATE"){
Y <- df[,"net_tfa"]
D <- df[,"e401"]
# All three specifications incorporate the same variables, but differ in the number of higher order terms. X.L has no higher order terms. X.H has higher order terms up to the 8th degree for some variables. X.vH incorporates higher order and interaction terms.
## low dim specification
X.L <- cbind(df[,"age"], df[,"inc"], df[,"educ"], df[,"fsize"], df[,"marr"], df[,"twoearn"], df[,"db"], df[,"pira"], df[,"hown"])
## high dim specification.
X.H <- cbind(poly(df[,"age"], 6, raw=TRUE),
poly(df[,"inc"], 8, raw=TRUE),
poly(df[,"educ"], 4, raw=TRUE),
poly(df[,"fsize"], 2, raw=TRUE),
df[,"marr"], df[,"twoearn"], df[,"db"], df[,"pira"], df[,"hown"])
## very high dim specification
X.vH=model.matrix(~(poly(df[,"age"], 6, raw=TRUE) +
poly(df[,"inc"], 8, raw=TRUE) +
poly(df[,"educ"], 4, raw=TRUE) +
poly(df[,"fsize"], 2, raw=TRUE) +
df[,"marr"] +
df[,"twoearn"] +
df[,"db"] +
df[,"pira"] +
df[,"hown"])^2)
X.vH=X.vH[,-1]
#Selects for desires specification
if (spec==1){
X=X.L
} else if (spec==2) {
X=X.H
} else {
X=X.vH
}
X <- scale(X,center=TRUE,scale=TRUE) #Centers and scales X
n=nrow(X)
#Impose common support. More discussion in Appendix B.
p.1 <- multinom(D~X-1, trace=FALSE)$fitted.values
indexes.to.drop <- which(p.1 < min(p.1[D==1]) | max(p.1[D==1]) < p.1)
if (length(indexes.to.drop)==0) {indexes.to.drop <- n+1} #R throws a wobbly if [-indexes.to.drop] is negating an empty set.
n.per.treatment <- as.vector(table(D[-indexes.to.drop]))
n.trim <- n.per.treatment[1]+n.per.treatment[2]
Y.trimmed=Y[-indexes.to.drop]
D.trimmed=D[-indexes.to.drop]
X.trimmed=X[-indexes.to.drop,]
if (spec==1){
inc=X.trimmed[,2]
} else if (spec==2) {
inc=X.trimmed[,7]
} else {
inc=X.trimmed[,7]
}
#Filters for desired quintile
if (quintile>0){
q <- ntile(inc, 5)
Y.q=Y.trimmed[q==quintile]
D.q=D.trimmed[q==quintile]
X.q=X.trimmed[q==quintile,]
} else {
Y.q=Y.trimmed
D.q=D.trimmed
X.q=X.trimmed
}
return(list(Y.q,D.q,X.q))
} else {
names(df)
N=nrow(df)
# take logs of continuous vars
cols.to.log<-c("gas","price","income","age","distance")
df[,cols.to.log]<-lapply(df[,cols.to.log],log)
# output
Y=df$gas
# construct vars
df$age2<-((df$age)^2)
df$income2<-((df$income)^2)
# ensure diff memory location
df.up<-data.frame(df)
df.down<-data.frame(df)
# construct df.up and df.down
prices<-df$price
delta=sd(prices)/4
df.up$price<-prices+delta/2
df.down$price<-prices-delta/2
df$price2<-(prices)^2
df.up$price2<-(df.up$price)^2
df.down$price2<-(df.down$price)^2
# specification
if (spec==1){
formula<- ~ (factor(driver)+factor(hhsize)+factor(month)+factor(prov)+factor(year))+price+price2+
price:((factor(driver)+factor(hhsize)+factor(month)+factor(prov)+factor(year)))+
price2:((factor(driver)+factor(hhsize)+factor(month)+factor(prov)+factor(year)))+urban+youngsingle+distance+distance^2+
age+age2 + income +income2
} else {
formula<- ~ (factor(driver)+factor(hhsize)+factor(month)+factor(prov)+factor(year))+price+price2+
price:((factor(driver)+factor(hhsize)+factor(month)+factor(prov)+factor(year))+age+age2+income+income2)+
price2:((factor(driver)+factor(hhsize)+factor(month)+factor(prov)+factor(year))+age+age2+income+income2)+
urban+youngsingle+distance+distance^2+
age+age2 + income +income2
}
regressors<-model.matrix(formula,data=df)
regressors.up<-model.matrix(formula, data=df.up)
regressors.down<-model.matrix(formula, data=df.down)
# check
dim(regressors)
dim(regressors.up)
dim(regressors.down)
if (quintile>0){
q <- ntile(df$income, 5)
Y.q=Y[q==quintile]
regressors.q=regressors[q==quintile,]
regressors.up.q=regressors.up[q==quintile,]
regressors.down.q=regressors.down[q==quintile,]
} else {
Y.q=Y
regressors.q=regressors
regressors.up.q=regressors.up
regressors.down.q=regressors.down
}
return(list(Y.q,regressors.q,regressors.up.q,regressors.down.q,delta))
}
}
Once the data is in the correct form, we can begin stage one of this algorithm. This involves estimating \(\alpha\) and \(\gamma\) using machine learning. This code sets up various machine learning models that can be applied here. In the following code, "RMD_dantzig" and "RMD_lasso" implement dantzig and lasso learning models respectively. Then, "RMD_stable" implements the algorithm to find \(\hat{\rho}\), the coefficient of \(\hat{\alpha}\), and \(\hat{\beta}\), the coeffecient of \(\hat{\gamma}\). Understanding "RMD_dantzig", "RMD_lasso", and "get_D" in depth is not essential to understanding this code. These are essentially helper functions that assist in tuning parameters in our learned models.
l=0.1
#Dantzig learning algorithm to tune regularization parameter
RMD_dantzig <- function(M, G, D, lambda=0, sparse = TRUE) {
p <- ncol(G)
zp <- rep(0, p)
L <-c(l,rep(1,p-1)) #dictionary is ordered (constant,T,X,TX)
A <- solve(diag(D),G)
R <- rbind(A, -A)
a <- solve(diag(D),M)
r <- c(a - lambda*L, -a - lambda*L)
if(sparse) {
Ip <- as(p, "matrix.diag.csr")
R <- as.matrix.csr(R)
f <- rq.fit.sfnc(Ip, zp, R = R, r = r)
} else {
Ip <- diag(p)
f <- rq.fit.fnc(Ip, zp, R = R, r = r)
}
return(f)
}
#Lasso learning algorithm to tune the regularization paremeter of our model
RMD_lasso <- function(M, G, D, lambda=0, control = list(maxIter = 1000, optTol = 10^(-5),
zeroThreshold = 10^(-6)), beta.start = NULL) {
p <- ncol(G)
Gt<-G
Mt<-M
L <-c(l,rep(1,p-1)) #dictionary is ordered (constant,...)
lambda_vec=lambda*L*D #v3: insert D here
if (is.null(beta.start)) {
beta <- rep(0,p) #vs low-dimensional initialization
}
else {
beta <- beta.start
}
wp <- beta
mm <- 1
while (mm < control$maxIter) {
beta_old <- beta
for (j in 1:p) {
rho=Mt[j]-Gt[j,]%*%beta+Gt[j,j]*beta[j]
z=Gt[j,j]
if (sum(is.na(rho)) >= 1) {
beta[j] <- 0
next
}
if (rho < -1 * lambda_vec[j])
beta[j] <- (rho+lambda_vec[j])/z
if (abs(rho) <= lambda_vec[j])
beta[j] <- 0
if (rho > lambda_vec[j])
beta[j] <- (rho-lambda_vec[j])/z
}
wp <- cbind(wp, beta)
if (sum(abs(beta - beta_old), na.rm = TRUE) < control$optTol) {
break
}
mm <- mm + 1
}
w <- beta
w[abs(w) < control$zeroThreshold] <- 0
return(list(coefficients = w, coef.list = wp, num.it = mm))
}
#Performs theoretical iteration to help tune our regularization parameter
get_D <- function(args,type,Y,X,m,rho_hat,b){
#n=nrow(X)
#p=length(b(T[1],X[1,]))
n = get.n(args,type,X)
p = get.p(args,type,X,b)
df=matrix(0,p,n)
if (type== "ATE"){
T = args$T
for (i in 1:n){
df[,i]=b(T[i],X[i,])*as.vector(rho_hat %*% b(T[i],X[i,]))-m.ATE(Y[i],T[i],X[i,],b)
}
} else {
X.up = args$X.up
X.down = args$X.down
delta = args$delta
for (i in 1:n){
df[,i]=X[i,]*as.vector(rho_hat %*% X[i,])-m.PD(Y[i],X[i,],X.up[i,],X.down[i,],delta,b)
}
}
df=df^2
D2=rowMeans(df)
D=sqrt(D2)
return(D) #pass around D as vector
}
c=0.5
alpha=0.1
tol=1e-6
#Calculates ML function for alpha and gamma
RMD_stable<-function(args,type,Y,X,p0,D_LB,D_add,max_iter,b,is_alpha = TRUE,is_lasso = TRUE){
###Inputs
###args: list of differing arguments between ATE and PD
### ATE -> args = list(T)
### PD -> args = list(X.up,X.down,delta)
###Y = outcome variable
###X = input variables
###type = ATE or PD
###p0: initial parameter for p
###D_LB: Lower bound on D
###D_add: value added to D
###max_iter: maximum iterations of chosen ML algorithm
###b: dictionary
###is_alpha: True = return alpha, False = return gamma
###is_lasso: True = use lasso, False = use dantzig
###Output
###rho_hat or beta_hat, a vector to estimate alpha or gamma rescpectively by taking the dot product with b
k=1
#p=length(b(T[1],X[1,]))
#n=length(T)
p = get.p(args,type,X,b)
n = get.n(args,type,X)
# low-dimensional moments
X0=X[,1:p0]
if (type == "ATE"){
args0 = list(T = args$T)
} else {
args0 = list(X.up = args$X.up[,1:p0], X.down = args$X.down[,1:p0], delta = args$delta)
}
MNG0<-get_MNG(args0, type, Y,X0,b)
M_hat0=MNG0[[1]]
N_hat0=MNG0[[2]]
G_hat0=MNG0[[3]]
# initial estimate
rho_hat0=solve(G_hat0,M_hat0)
rho_hat=c(rho_hat0,rep(0,p-ncol(G_hat0)))
beta_hat0=solve(G_hat0,N_hat0)
beta_hat=c(beta_hat0,rep(0,p-ncol(G_hat0)))
# moments
MNG<-get_MNG(args,type,Y,X,b)
M_hat=MNG[[1]]
N_hat=MNG[[2]]
G_hat=MNG[[3]]
# penalty
lambda=c*qnorm(1-alpha/(2*p))/sqrt(n) # snippet
if(is_alpha){
###########
# alpha_hat
###########
diff_rho=1
#Loop through max_iter times or until the change in rho between iterations is less than tol
while(diff_rho>tol & k<=max_iter){
# previous values
rho_hat_old=rho_hat+0
# normalization
D_hat_rho=get_D(args,type,Y,X,m,rho_hat_old,b)
D_hat_rho=pmax(D_LB,D_hat_rho)
D_hat_rho=D_hat_rho+D_add
# RMD estimate
if(is_lasso){
rho_hat=RMD_lasso(M_hat, G_hat, D_hat_rho, lambda)$coefficients
}else{
rho_hat=RMD_dantzig(M_hat, G_hat, D_hat_rho, lambda)$coefficients
}
# difference
diff_rho=two.norm(rho_hat-rho_hat_old)
k=k+1
}
print(paste0('k: '))
print(paste0(k))
return(rho_hat)
} else {
###########
# gamma_hat
###########
diff_beta=1
#Loop through max_iter times or until the change in rho between iterations is less than tol
while(diff_beta>tol & k<=max_iter){
# previous values
beta_hat_old=beta_hat+0
# normalization
D_hat_beta=get_D(args,type,Y,X,m2,beta_hat_old,b)
D_hat_beta=pmax(D_LB,D_hat_beta)
D_hat_beta=D_hat_beta+D_add
# RMD estimate
if(is_lasso){
beta_hat=RMD_lasso(N_hat, G_hat, D_hat_beta, lambda)$coefficients
}else{
beta_hat=RMD_dantzig(N_hat, G_hat, D_hat_beta, lambda)$coefficients
}
# difference
diff_beta=two.norm(beta_hat-beta_hat_old)
k=k+1
}
print(paste0('k: '))
print(paste0(k))
return(beta_hat)
}
}
arg_Forest<- list(clas_nodesize=1, reg_nodesize=5, ntree=1000, na.action=na.omit, replace=TRUE)
arg_Nnet<- list(size=8, maxit=1000, decay=0.01, MaxNWts=10000, trace=FALSE)
Using the above functions, "get_stage1" returns \(\alpha\) and \(\gamma\) estimators for an outcome variable (Y), treatment variable (T) and confounding variables (X). The basic process for finding either \(\alpha\) or \(\gamma\) is to calculate the vector \(\hat{\rho}\) or \(\hat{\beta}\) respectively and then return a function that is the dot product of this vector with the chosen dictionary b that maps W in some way.
get_stage1<-function(args,type,Y,X,p0,D_LB,D_add,max_iter,b,alpha_estimator = 1,gamma_estimator = 1){
###Inputs
###args: list of differing arguments between ATE and PD
### ATE -> args = list(T)
### PD -> args = list(X.up,X.down,delta)
###Y = outcome variable
###X = input variables
###type = ATE or PD
###p0: initial parameter for p
###D_LB: Lower bound on D
###D_add: value added to D
###max_iter: maximum iterations of chosen ML algorithm
###b: dictionary
###alpha_estimator: numerical value indicating which model to use for the alpha estimator
###0 dantzig, 1 lasso
###gamma_estimator: numerical value indicating which model to use for the gamma estimator
###0 dantzig, 1 lasso, 2 rf, 3 nn
###Output
###alpha_hat and gamma_hat estimators
#For ATE
#p=length(b(T[1],X[1,]))
#n=length(T)
#For PD
#p = dim(X)[2]
#n = dim(X)[1]
p = get.p(args,type,X,b)
n = get.n(args,type,X)
MNG<-get_MNG(args,type,Y,X,b)
B=MNG[[4]]
###########
# alpha hat
###########
if(alpha_estimator==0){ # dantzig
rho_hat=RMD_stable(args,type,Y,X,p0,D_LB,D_add,max_iter,b,1,0)
alpha_hat<-function(d,z){
return(b(d,z)%*%rho_hat)
}
} else if(alpha_estimator==1){ # lasso
rho_hat=RMD_stable(args,type,Y,X,p0,D_LB,D_add,max_iter,b,1,1)
alpha_hat<-function(d,z){
return(b(d,z)%*%rho_hat)
}
}
###########
# gamma hat
###########
if(gamma_estimator==0){ # dantzig
beta_hat=RMD_stable(args,type,Y,X,p0,D_LB,D_add,max_iter,b,0,0)
gamma_hat<-function(d,z){
return(b(d,z)%*%beta_hat)
}
} else if(gamma_estimator==1){ # lasso
beta_hat=RMD_stable(args,type,Y,X,p0,D_LB,D_add,max_iter,b,0,1)
gamma_hat<-function(d,z){
return(b(d,z)%*%beta_hat)
}
} else if(gamma_estimator==2){ # random forest
forest<- do.call(randomForest, append(list(x=B,y=Y), arg_Forest))
gamma_hat<-function(d,z){
return(predict(forest,newdata=b(d,z), type="response"))
}
} else if(gamma_estimator==3){ # neural net
# scale down, de-mean, run NN, scale up, remean so that NN works well
maxs_B <- apply(B, 2, max)
mins_B <- apply(B, 2, min)
maxs_Y<-max(Y)
mins_Y<-min(Y)
# hack to ensure that constant covariates do not become NA in the scaling
const=maxs_B==mins_B
keep=(1-const)*1:length(const)
NN_B<-B
NN_B[,keep]<-scale(NN_B[,keep], center = mins_B[keep], scale = maxs_B[keep] - mins_B[keep])
NN_Y<-scale(Y, center = mins_Y, scale = maxs_Y - mins_Y)
nn<- do.call(nnet, append(list(x=NN_B,y=NN_Y), arg_Nnet))
gamma_hat<-function(d,z){
test<-t(as.vector(b2(d,z)))
NN_b<-test
NN_b[,keep]<-scale(t(NN_b[,keep]),
center = mins_B[keep],
scale = maxs_B[keep] - mins_B[keep])
NN_Y_hat<-predict(nn,newdata=NN_b)
Y_hat=NN_Y_hat*(maxs_Y-mins_Y)+mins_Y
return(Y_hat)
}
} else if(gamma_estimator==4){ # 2 layer NN (keras)
# scale down, de-mean, run NN, scale up, remean so that NN works well
# hack to ensure that constant covariates do not become NA in the scaling
maxs_B <- apply(B, 2, max)
mins_B <- apply(B, 2, min)
maxs_Y<-max(Y)
mins_Y<-min(Y)
const=maxs_B==mins_B
keep=(1-const)*1:length(const)
NN_B<-B
NN_B[,keep]<-scale(NN_B[,keep], center = mins_B[keep], scale = maxs_B[keep] - mins_B[keep])
NN_Y<-scale(Y, center = mins_Y, scale = maxs_Y - mins_Y)
# choose architecture
build_model <- function() {
model <- keras_model_sequential() %>%
layer_dense(units = 8, activation = "relu",
input_shape = dim(NN_B)[[2]]) %>%
layer_dense(units = 8, activation = "relu") %>%
layer_dense(units = 1)
model %>% compile(
optimizer = "rmsprop",
loss = "mse",
metrics = c("mae")
)
}
# use package
model <- build_model()
num_epochs <- 100
model %>% fit(NN_B, NN_Y,
epochs = num_epochs, batch_size = 1, verbose = 0)
gamma_hat<-function(d,z){
NN_b<-t(as.vector(b(d,z))) # test point
NN_b[,keep]<-scale(t(NN_b[,keep]),
center = mins_B[keep],
scale = maxs_B[keep] - mins_B[keep])
# 2 layer NN (keras)
NN_Y_hat<-model %>% predict(NN_b, verbose = 0)
Y_hat=NN_Y_hat*(maxs_Y-mins_Y)+mins_Y
return(Y_hat)
}
}
return(list(alpha_hat,gamma_hat))
}
The final step of this algorithm is to use the learned alpha and gamma estimators to estimate the parameter of interest. At this point we have functions set up that are able to learn the \(\alpha\) and \(\gamma\) functions. The "rrr" function starts by splitting up our data into subsets so that we can learn \(\alpha\) and \(\gamma\). It then learns \(\alpha_i\) and \(\gamma_i\) for all i and finds \(\hat{\theta}\) and the standard error of \(\hat{\theta}\).
L=5 #parameter for number of subsets to split data into
rrr<-function(args,type,Y,X,p0,D_LB,D_add,max_iter,dict,alpha_estimator,gamma_estimator,bias){
if (type=='ATE'){
T = args$T
} else {
X.up = args$X.up
X.down = args$X.down
delta = args$delta
}
n=nrow(X)
#Split up into I_1,...I_n
folds <- split(sample(n, n,replace=FALSE), as.factor(1:L))
Psi_tilde=numeric(0) #Initialize Psi_tilde as 0. This will become a running list of
#Iterating through I_1,,,,I_n
for (l in 1:L){
#Breaks into two sets. variables with .l are not in X_i used for training. variables with .nl are used for calculating theta
Y.l=Y[folds[[l]]]
Y.nl=Y[-folds[[l]]]
X.l=X[folds[[l]],]
X.nl=X[-folds[[l]],]
if (type=="ATE"){
T.l=T[folds[[l]]]
T.nl=T[-folds[[l]]]
n.l=length(T.l)
n.nl=length(T.nl)
stage1.args = list(T = T.nl)
} else {
X.up.l=X.up[folds[[l]],]
X.up.nl=X.up[-folds[[l]],]
X.down.l=X.down[folds[[l]],]
X.down.nl=X.down[-folds[[l]],]
n.l=nrow(X.l)
n.nl=nrow(X.nl)
stage1.args = list(X.up = X.up.nl, X.down = X.down.nl, delta = delta)
}
# get stage 1 (on nl)
stage1_estimators<-get_stage1(stage1.args, type, Y.nl,X.nl,p0,D_LB,D_add,max_iter,dict,alpha_estimator,gamma_estimator)
alpha_hat=stage1_estimators[[1]]
gamma_hat=stage1_estimators[[2]]
print(paste0('fold: ',l))
#get stage 2 (on l)
#psi_star
Psi_tilde.l=rep(0,n.l)
for (i in 1:n.l){
if (type=="ATE"){psi.args = list(T = T.l[i])}
else {psi.args = list(X.up = X.up.l[i,],X.down = X.down.l[i,],delta = delta)}
if(bias){ #plug-in
Psi_tilde.l[i]=psi_tilde_bias(psi.args, type, Y.l[i],X.l[i,],alpha_hat,gamma_hat) # without subtracting theta_hat
}else{ #DML
Psi_tilde.l[i]=psi_tilde(psi.args, type, Y.l[i],X.l[i,],alpha_hat,gamma_hat) # without subtracting theta_hat
}
}
Psi_tilde=c(Psi_tilde,Psi_tilde.l)
}
#point estimation
ate=mean(Psi_tilde)
#influences
Psi=Psi_tilde-ate
var=mean(Psi^2)
se=sqrt(var/n)
#Returns ATE and SE
if (type == "ATE"){
out<-c(table(T)[[2]],table(T)[[1]],ate,se)
} else {
out<-c(n,ate,se)
}
return(out)
}
#Prints out the results of rrr in a way that is easy to read
printer<-function(spec1,type){
if (type == "ATE"){
print(paste(" treated: ",spec1[1], " untreated: ", spec1[2], " ATE: ",round(spec1[3],2), " SE: ", round(spec1[4],2), sep=""))
} else {
print(paste(" n: ",spec1[1]," AD: ",round(spec1[2],2), " SE: ", round(spec1[3],2), sep=""))
}
}
#Prints results of rrr in format suitable for LaTex
for_tex<-function(spec1,type){
if (type == "ATE"){
print(paste(" & ",spec1[1], " & ", spec1[2], " & ",round(spec1[3],2), " & ", round(spec1[4],2), sep=""))
} else {
print(paste(" & ",spec1[1]," & ",round(spec1[2],2), " & ", round(spec1[3],2), sep=""))
}
}
Now that all of the necessary parts are in place, the following code puts everything together and actually runs the debiased machine learning algorithm. It combines the first section of this guide that prepares the data, then runs "rrr" to actually estimate \(\hat{\theta}\). Finally, it prints out the results. This code is capable of iterating through income quintiles if you're interested in \(\hat{\theta}\) in different income groups. To implement this functionality comment out the indicated line. As a note on run times, the neural nets take quite a bit longer to run than lasso and dantzig.
type = "PD" #either ATE for average treatment effect or PD for elasticity
for (quintile in 0:0){ #change "0:5" to "0:0" if you just want to run once on whole data set
quintile=0 #comment out this line to iterate through quintiles
print(paste0('quintile: '))
print(paste0(quintile))
#This if statement sets up data for either ATE or PD
#note: each clause has its own 'spec' and 'dict' variables, so
#if you want to change these, make sure you're changing it in
#the correct clause
#Conversely, all other arguments (such as alpha) are outside these clauses,
#so changing them for one type will change them for either type
if (type=="ATE"){
df <- read.dta("sipp1991.dta")
#df = read.csv("Voting.csv")
spec=1 #spec in (1-3)
#quintile=0 #quintile in (1-5). 0 means all quintiles
data<-get_data(df,spec,quintile) #trimming like Farrell; different than Chernozhukov et al.
Y=data[[1]]
T=data[[2]]
X=data[[3]] #no intercept
##################
# helper functions
##################
# dictionary for calculating alpha (discussion in appendix C)
dict=b # b for partially linear model, b2 for interacted model. note that b2 appears in stage1.R for NN
p=length(b(T[1],X[1,]))
args = list(T = T)
} else {
df <- read.dta("gasoline_final_tf1.dta")
spec = 1 #either 1 or 2
data<-get_data(df,spec,quintile,type = "PD") #like Chernozhukov and Semenova
Y=data[[1]]
X=data[[2]]
X.up=data[[3]]
X.down=data[[4]]
delta=data[[5]]
#test<-get_MNG(Y,X,X.up,X.down,delta)
#M_hat=test[[1]]
#N_hat=test[[2]]
#as.numeric(M_hat)
#as.numeric(N_hat)
n=nrow(X)
p=ncol(X)
dict = b.PD
args = list(X.up = X.up, X.down = X.down, delta = delta)
}
#p0=dim(X0) used in low-dim dictionary in the stage 1 tuning procedure
p0=ceiling(p/4)
if (p>60){
p0=ceiling(p/40)
}
D_LB=0 #each diagonal entry of \hat{D} lower bounded by D_LB
D_add=.2 #each diagonal entry of \hat{D} increased by D_add. 0.1 for 0, 0,.2 otw
max_iter=10 #max number iterations in Dantzig selector iteration over estimation and weights
###########
# algorithm
###########
set.seed(1) # for sample splitting
alpha_estimator=1
gamma_estimator=1
bias=0 #0 for unbiased or 1 for biased
#alpha_estimator: 0 dantzig, 1 lasso
#gamma_estimator: 0 dantzig, 1 lasso, 2 rf, 3 nn, 4 2-layer nn (2-layer nn only works for ATE)
results<-rrr(args,type,Y,X,p0,D_LB,D_add,max_iter,dict,alpha_estimator,gamma_estimator,bias)
printer(results,type)
for_tex(results,type)
}
## [1] "quintile: "
## [1] "0"
## Warning in split.default(sample(n, n, replace = FALSE), as.factor(1:L)): data
## length is not a multiple of split variable
## [1] "k: "
## [1] "11"
## [1] "k: "
## [1] "8"
## [1] "fold: 1"
## [1] "k: "
## [1] "11"
## [1] "k: "
## [1] "8"
## [1] "fold: 2"
## [1] "k: "
## [1] "11"
## [1] "k: "
## [1] "8"
## [1] "fold: 3"
## [1] "k: "
## [1] "11"
## [1] "k: "
## [1] "8"
## [1] "fold: 4"
## [1] "k: "
## [1] "11"
## [1] "k: "
## [1] "8"
## [1] "fold: 5"
## [1] " n: 5001 AD: -0.64 SE: 0.12"
## [1] " & 5001 & -0.64 & 0.12"
The above framework is very flexible and can be easily adapted to fit other data sets. To do so, it requires adjusting "get_data" to return Y,T, and X as a list (or Y,X,X.up,X.down and delta for elasticities), where Y and T are vectors and X is a matrix (X.up and X.down should be matrices and delta a float). It is important to note that you can't have any NA values in any of your matrices. The following steps outline this process:
#df = df%>%
# mutate(T = ifelse(column %in% c("list of values for treatment"),0,1))
Common Support ensures that for all X, \(P(X) = Pr(D=1|X)\) is between zero and one. This is a necessary condition for causal inference to work. To do this, \(P(X_i) = Pr(D_i=1|X_i)\) is learned using some learning algorithm. Then, all \(X_i\) values where \(P(X_i)\) is too close to zero or one are thrown out. But what is too close?
The way we determine which values to throw out is by looking at \(X_i\) in the treatment group (where \(D_i = 1\)) and see what range of values we get for \(P(X_i)\). We then set the upper and lower limits of this range as the maximum and minimum allowable values that we'll include of \(P(X_i)\). Thus, some values of \(W_i\) exist where \(D_i = 0\) and \(P(X_i)\) is outside the given range, and these \(W_i\) are dropped. The code below creates a histogram with the dotted lines showing the limits of the common support interval. The second histogram is zoomed in to show more clearly what values fall outside the range. As can be seen in the graph, the vast majority of values fall between the dotted lines, which is a good sign.
df <- read.dta("sipp1991.dta")
Y <- df[,"net_tfa"]
D <- df[,"e401"]
X.L <- cbind(df[,"age"], df[,"inc"], df[,"educ"], df[,"fsize"], df[,"marr"], df[,"twoearn"], df[,"db"], df[,"pira"], df[,"hown"])
X <- scale(X.L,center=TRUE,scale=TRUE) #Centers and scales X
#Impose common support. More discussion in Appendix B.
p.1 <- multinom(D~X-1, trace=FALSE)$fitted.values
indexes.to.drop <- which(p.1 < min(p.1[D==1]) | max(p.1[D==1]) < p.1)
probs = data.frame("treatment" = append(rep("Treated",length(p.1[D==1])),rep("Untreated",length(p.1[D==0]))),"value" = append(p.1[D==1],p.1[D==0]))
ggplot(probs,aes(x = value, fill = treatment))+
geom_histogram(alpha = 0.2, col = "black")+
geom_vline(xintercept = min(p.1[D==1]),linetype = "dotted")+
geom_vline(xintercept = max(p.1[D==1]),linetype = "dotted")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
probs %>%
filter(value<=0.25)%>%
ggplot(aes(x = value, fill = treatment))+
geom_histogram(alpha = 0.2, col = "black")+
geom_vline(xintercept = min(p.1[D==1]))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Alpha--known as the Riesz representer--can be calculated in two different ways. The first way is using propensity scores as follows:
However, this suffers from several practical issues. For instance, since the propensity scores are learned on a logistic curve, there are often values very close to 0 or very close to 1. These values then result in large values in the denominator of alpha, making this calculation unstable. Fortunately, there is a more direct way to calculate the Riesz representer that is used in this guide. This approach leverages a powerful mathematical fact about the Riesz representer: \(\mathbb{E}[m(W,\gamma)] = \mathbb{E}[\alpha_0(W)\gamma(W)]\). We can then try to find an \(\alpha\) that minimizes the loss function \[L = \mathbb{E}[\alpha_0(W)-\alpha(W)]^2\].
We can then use the above fact to derive the following: \[L(\alpha) = C-2\mathbb{E}[m(W,\alpha)]+\mathbb{E}[\alpha(W)]^2\].
We then assume the functional form of the Riesz representer as \[\alpha(W) = \rho^{'} b(W)\] where \(\rho^{'}\) is a vector and \(b(W)\) maps W to some other vector. Since b is an arbitrary dictionary, it can include polynomials of W and interaction terms in W. This makes our chosen class for the Riesz representer capable of representing a wide selection of functions. To find \(\alpha\), we now need to find \(\rho\). From here we can expand our loss function to \[L(\rho) = C-2\rho^{'}\hat{M}+\rho^{'}\hat{G}\rho\] where \(\hat{M} = 1/n \sum_{i=1}^{n}m(W_i,b)\) and \(\hat{G} =1/n \sum_{i=1}^{n}b(W_i)b(W_i)^{'}\). Minimizing \(L\) gives \(\hat{\rho} = \hat{G}^{-1}\hat{M}\). Thus, starting from a simple fact about the Riesz representer and adopting a functional form for \(\alpha\), we are able to estimate through a simple matrix multiplication using two matrices that can be easily computed from our chosen function for b and our learned function m.
Now that we've solved the problem of estimating \(\alpha\), we can expand on our work by using differnt loss functions. In the above example, we used a squared loss. However, we might want to add a regulization term, and have a loss function like \[L(\rho) = C-2\rho^{'}\hat{M}+\rho^{'}\hat{G}\rho+2\lambda||\rho||_1\], which we could then optimize. Thus, using this approach we have a lot of options for how we want to measure \(\alpha\) that rely on optimizing relatively simple loss functions