🗓️ Week 7
Interactions & beyond ols

PB4A7- Quantitative Applications for Behavioural Science

13 Nov 2024

Interactions

  • For both polynomials and logarithms, the effect of a one-unit change in \(X\) differs depending on its current value (for logarithms, a 1-unit change in \(X\) is different percentage changes in \(X\) depending on current value)
  • But why stop there? Maybe the effect of \(X\) differs depending on the current value of other variables! - Enter interaction terms!

\[ Y = \beta_0 + \beta_1X + \beta_2Z + \beta_3X\times Z + \varepsilon \]

  • Interaction terms are a little tough but also extremely important.

Interactions

Expect to come back to these slides, as you’re almost certainly going to use interaction terms in both our assessment and the dissertation

Interactions

  • Change in the value of a control can shift a regression line up and down
  • Using the model \(Y = \beta_0 + \beta_1X + \beta_2Z\), estimated as \(Y = .01 + 1.2X + .95Z\):

Interactions

  • But an interaction can both shift the line up and down AND change its slope
  • Using the model \(Y = \beta_0 + \beta_1X + \beta_2Z + \beta_3X\times Z\), estimated as \(Y = .035 + 1.14X + .94Z + 1.02X\times Z\):

Interactions

  • How can we interpret an interaction?
  • The idea is that the interaction shows how the effect of one variable changes as the value of the other changes
  • The derivative helps!

\[ Y = \beta_0 + \beta_1X + \beta_2Z + \beta_3X\times Z \] \[ \partial Y/\partial X = \beta_1 + \beta_3 Z \]

  • The effect of \(X\) is \(\beta_1\) when \(Z = 0\), or \(\beta_1 + \beta_3\) when \(Z = 1\), or \(\beta_1 + 3\beta_3\) if \(Z = 3\)!

Interactions

  • Often we are doing interactions with binary variables to see how an effect differs across groups
  • Now, instead of the intercept giving the baseline and the binary coefficient giving the difference, the coefficient on \(X\) is the baseline effect of \(X\) and the interaction is the difference in the effect of \(X\)
  • The interaction coefficient becomes “the difference in the effect of \(X\) between the \(Z\) =”No” and \(Z\) = “Yes” groups”
  • (What if it’s continuous? Mathematically the same but the thinking changes - the interaction term is the difference in the effect of \(X\) you get when increasing \(Z\) by one unit)

Notes on Interactions

  • Like with polynomials, the coefficients on their own now have little meaning and must be evaluated alongside each other. \(\beta_1\) by itself is just “the effect of \(X\) when \(Z = 0\)”, not “the effect of \(X\)”
  • Yes, you do almost always want to include both variables in un-interacted form and interacted form. Otherwise the interpretation gets very thorny

Notes on Interactions

  • Interaction effects are poorly powered. You need a lot of data to be able to tell whether an effect is different in two groups. If \(N\) observations is adequate power to see if the effect itself is different from zero, you need a sample of roughly \(16\times N\) to see if the difference in effects is nonzero. Sixteen times!!
  • It’s tempting to try interacting your effect with everything to see if it’s bigger/smaller/nonzero in some groups, but because it’s poorly powered, this is a bad idea! You’ll get a lot of false positives

OLS and the Dependent Variable

A typical OLS equation looks like:

\[ Y = \beta_0 + \beta_1X + \varepsilon \]

and assumes that the error term, \(\varepsilon\), is normal.

  • The normal distribution is continuous and smooth and has infinite range
  • And the linear form stretches off to infinity in either direction as \(X\) gets small or big
  • Both of these imply that the dependent variable, \(Y\), is continuous and can take any value (why is that?)!
  • If that’s not true, then our model will be misspecified in some way

Non-Continuous Dependent Variables

When might dependent variables not be continuous and have infinite range?

  • Years working at current job (can’t be negative)
  • Are you self-employed? (Binary)
  • Number of children (must be a round number, can’t be negative)
  • Which brand of soda did you buy? (categorical)
  • Did you recover from your disease? (binary)
  • How satisfied are you with your purchase on a 1-5 scale? (must be a round number from 1 to 5, and the difference between 1 and 2 isn’t necessarily the same as the difference between 2 and 3)

Binary Dependent Variables

  • In many cases, such as variables that must be round numbers, or can’t be negative, even though there are ways of properly handling these issues, people will usually ignore the problem and just use OLS, as long as the data is continuous-ish (i.e. doesn’t have a LOT of observations right at 0 next to the impossible negative values, or has lots of different values so the round number smooth out)
  • However, the problems of using OLS are a bit worse for binary data, and so they’re the most common case in which we do something special to account for it
  • Binary dependent variables are also really common! We’re often interested in whether a certain outcome happened or didn’t (if we want to know if a drug was effective, we are likely asking if you are cured or not!)

So, how can we deal with having a binary dependent variable, and why do they give OLS such problems?

The Linear Probability Model

  • First off, let’s ignore the completely unexplained warnings I’ve just given you and do it with OLS anyway, and see what happens
  • Running OLS with a binary dependent variable is called the “linear probability model” or LPM

\[ D = \beta_0 + \beta_1X + \varepsilon \]

Throughout these slides, let’s use \(D\) to refer to a binary variable

The Linear Probability Model

  • In terms of how we do it, the interpretation is the exact same as regular OLS, so you can bring in all your intuition
  • The only difference is that our interpretation of the dependent variable is now in probability terms
  • If \(\hat{\beta}_1 = .03\), that means that a one-unit increase in \(X\) is associated with a three percentage point increase in the probability that \(D = 1\)
  • (percentage points! Not percentage - an increase from .1 to .13, say, not .1 to .103)

The Linear Probability Model

So what’s the problem?

The linear probability model can lead to…

  • Terrible predictions
  • Incorrect slopes that don’t acknowledge the boundaries of the data

Terrible Predictions

  • OLS fits a straight line. So if you increase or decrease \(X\) enough, eventually you’ll predict that the probability of \(D = 1\) is bigger than 1, or lower than 0. Impossible!
  • We can address part of this by just not trying to predict outside the range of the data, but if \(X\) has a lot of variation in it, we might get those impossible predictions even for values in our data. And what do we do with that?
  • (Also, because errors tend to be small for certain ranges of \(X\) and large for others, we have to use heteroskedasticity-robust standard errors)

Terrible Predictins

Incorrect Slopes

  • Also, OLS requires that the slopes be constant
  • (Not necessarily if you use a polynomial or logarithm, but the following critique still applies)
  • This is not what we want for binary data!
  • As the prediction gets really close to 0 or 1, the slope should flatten out to nothing
  • If we predict there’s a .50 chance of \(D = 1\), a one-unit increase in \(X\) with \(\hat{\beta}_1 = .03\) would increase that to .53
  • If we predict there’s a .99 chance of \(D = 1\), a one-unit increase in \(X\) with \(\hat{\beta}_1 = .03\) would increase that to 1.02…
  • Uh oh! The slope should be flatter near the edges. We need the slope to vary along the range of \(X\)

Incorrect Slopes

  • We can see how much the OLS slopes are overstating changes in \(D\) as \(X\) changes near the edges by comparing an OLS fit to just regular ol’ local means, with no shape imposed at all
  • We’re not forcing the red line to flatten out - it’s doing that naturally as the mean can’t possibly go any lower! OLS barrels on through though

Linear Probability Model

So what can we make of the LPM?

  • Bad if we want to make predictions
  • Bad at estimating slope if we’re looking near the edges of 0 and 1
  • (which means it’s especially bad if the average of \(D\) is near 0 or 1)

When might we use it anyway?

  • It behaves better in small samples than methods estimated by maximum likelihood (which many other methods are)
  • If we only care about slopes far away from the boundaries
  • If alternate methods (like we’re about to go into) put too many other statistical demands on the data (OLS is very “easy” from a computational standpoint)
  • If we’re using lots of fixed effects (OLS deals with these far more easily than nonlinear methods)
  • If our right-hand side is just binary variables (if X has limited range it might not predict out of 0-1!)

Generalized Linear Models

  • So LPM has problems. What can we do instead?
  • Let’s introduce the concept of the Generalized Linear Model

Here’s an OLS equation:

\[ Y = \beta_0 + \beta_1X + \varepsilon \]

Here’s a GLM equation:

\[ E(Y | X) = F(\beta_0 + \beta_1X) \]

Where \(F()\) is some function.

Generalized Linear Models

\[ E(D | X) = F(\beta_0 + \beta_1X) \]

  • We can call the \(\beta_0 + \beta_1X\) part, which is the same as in OLS, the index function. It’s a linear function of our variable \(X\) (plus whatever other controls we have in there), same as before
  • But to get our prediction of what \(Y\) will be conditional on what \(X\) is ( \(D|X\) ), we do one additional step of running it through a function \(F()\) first. We call this function a link function since it links the index function to the outcome
  • If \(F(z) = z\), then we’re basically back to OLS
  • But if \(F()\) is nonlinear, then we can account for all sorts of nonlinear dependent variables!

So in other words, our prediction of \(D\) is still based on the linear index, but we run it through some nonlinear function first to get our nonlinear output!

Generalized Linear Models

We can also think of this in terms of the latent variable interpretation

\[ D^* = \beta_0 + \beta_1X \]

Where \(D^*\) is an unseen “latent” variable that can take any value, just like a regular OLS dependent variable (and roughly the same in concept as our index function)

And we convert that latent variable to a proabability using some function

\[ E(D | X) = F(D^*) \]

and perhaps saying something like “if we estimate \(Y^*\) is above the number \(c\), then we predict \(D = 1\)”

Probit and Logit

  • Let’s go back to our index-and-function interpretation. What function should we use?
  • (many many different options depending on your dependent variable - poisson for count data, log link for nonnegative skewed values, multinomial logit for categorical data…)
  • For binary dependent variables the two most common link functions are the probit and logistic links. We often call a regression with a logistic link a “logit regression”

\[ Probit(index) = \Phi(index) \]

where \(\Phi()\) is the standard normal cumulative distribution function (i.e. the probability that a random standard normal value is less than or equal to \(index\) )

\[ Logistic(index) = \frac{e^{index}}{1+e^{index}} \]

For most purposes it doesn’t matter whether you use probit or logit, but logit is getting much more popular recently (due to its common use in data science - it’s computationally easier) so we’ll focus on that, and just know that pretty much all of this is the same with probit

Logit

  • Notice that we can’t possibly predict a value outside of 0 or 1, no matter how wild \(X\) and our index get
  • As \(index\) goes to \(-\infty\),

\[ Logistic(index) \rightarrow \frac{0}{1+0} = 0 \]

  • And as \(index\) goes to \(\infty\),

\[ Logistic(index) \rightarrow \frac{\infty}{1+\infty } = 1 \]

Logit

  • Also notice that, like the local means did, its slope flattens out near the edges