OLS and the Dependent Variable
A typical OLS equation looks like:
\[ Y = \beta_0 + \beta_1X + \varepsilon \]
and assumes that the error term, \(\varepsilon\), is normal.
- The normal distribution is continuous and smooth and has infinite range
- And the linear form stretches off to infinity in either direction as \(X\) gets small or big
- Both of these imply that the dependent variable, \(Y\), is continuous and can take any value (why is that?)!
- If that’s not true, then our model will be misspecified in some way
Non-Continuous Dependent Variables
When might dependent variables not be continuous and have infinite range?
- Years working at current job (can’t be negative)
- Are you self-employed? (Binary)
- Number of children (must be a round number, can’t be negative)
- Which brand of soda did you buy? (categorical)
- Did you recover from your disease? (binary)
- How satisfied are you with your purchase on a 1-5 scale? (must be a round number from 1 to 5, and the difference between 1 and 2 isn’t necessarily the same as the difference between 2 and 3)
Binary Dependent Variables
- In many cases, such as variables that must be round numbers, or can’t be negative, even though there are ways of properly handling these issues, people will usually ignore the problem and just use OLS, as long as the data is continuous-ish (i.e. doesn’t have a LOT of observations right at 0 next to the impossible negative values, or has lots of different values so the round number smooth out)
- However, the problems of using OLS are a bit worse for binary data, and so they’re the most common case in which we do something special to account for it
- Binary dependent variables are also really common! We’re often interested in whether a certain outcome happened or didn’t (if we want to know if a drug was effective, we are likely asking if you are cured or not!)
So, how can we deal with having a binary dependent variable, and why do they give OLS such problems?
The Linear Probability Model
- First off, let’s ignore the completely unexplained warnings I’ve just given you and do it with OLS anyway, and see what happens
- Running OLS with a binary dependent variable is called the “linear probability model” or LPM
\[ D = \beta_0 + \beta_1X + \varepsilon \]
Throughout these slides, let’s use \(D\) to refer to a binary variable
The Linear Probability Model
- In terms of how we do it, the interpretation is the exact same as regular OLS, so you can bring in all your intuition
- The only difference is that our interpretation of the dependent variable is now in probability terms
- If \(\hat{\beta}_1 = .03\), that means that a one-unit increase in \(X\) is associated with a three percentage point increase in the probability that \(D = 1\)
- (percentage points! Not percentage - an increase from .1 to .13, say, not .1 to .103)
The Linear Probability Model
So what’s the problem?
The linear probability model can lead to…
- Terrible predictions
- Incorrect slopes that don’t acknowledge the boundaries of the data
Terrible Predictions
- OLS fits a straight line. So if you increase or decrease \(X\) enough, eventually you’ll predict that the probability of \(D = 1\) is bigger than 1, or lower than 0. Impossible!
- We can address part of this by just not trying to predict outside the range of the data, but if \(X\) has a lot of variation in it, we might get those impossible predictions even for values in our data. And what do we do with that?
- (Also, because errors tend to be small for certain ranges of \(X\) and large for others, we have to use heteroskedasticity-robust standard errors)
Terrible Predictins
Incorrect Slopes
- Also, OLS requires that the slopes be constant
- (Not necessarily if you use a polynomial or logarithm, but the following critique still applies)
- This is not what we want for binary data!
- As the prediction gets really close to 0 or 1, the slope should flatten out to nothing
- If we predict there’s a .50 chance of \(D = 1\), a one-unit increase in \(X\) with \(\hat{\beta}_1 = .03\) would increase that to .53
- If we predict there’s a .99 chance of \(D = 1\), a one-unit increase in \(X\) with \(\hat{\beta}_1 = .03\) would increase that to 1.02…
- Uh oh! The slope should be flatter near the edges. We need the slope to vary along the range of \(X\)
Incorrect Slopes
- We can see how much the OLS slopes are overstating changes in \(D\) as \(X\) changes near the edges by comparing an OLS fit to just regular ol’ local means, with no shape imposed at all
- We’re not forcing the red line to flatten out - it’s doing that naturally as the mean can’t possibly go any lower! OLS barrels on through though
Linear Probability Model
So what can we make of the LPM?
- Bad if we want to make predictions
- Bad at estimating slope if we’re looking near the edges of 0 and 1
- (which means it’s especially bad if the average of \(D\) is near 0 or 1)
When might we use it anyway?
- It behaves better in small samples than methods estimated by maximum likelihood (which many other methods are)
- If we only care about slopes far away from the boundaries
- If alternate methods (like we’re about to go into) put too many other statistical demands on the data (OLS is very “easy” from a computational standpoint)
- If we’re using lots of fixed effects (OLS deals with these far more easily than nonlinear methods)
- If our right-hand side is just binary variables (if X has limited range it might not predict out of 0-1!)
Generalized Linear Models
- So LPM has problems. What can we do instead?
- Let’s introduce the concept of the Generalized Linear Model
Here’s an OLS equation:
\[ Y = \beta_0 + \beta_1X + \varepsilon \]
Here’s a GLM equation:
\[ E(Y | X) = F(\beta_0 + \beta_1X) \]
Where \(F()\) is some function.
Generalized Linear Models
\[ E(D | X) = F(\beta_0 + \beta_1X) \]
- We can call the \(\beta_0 + \beta_1X\) part, which is the same as in OLS, the index function. It’s a linear function of our variable \(X\) (plus whatever other controls we have in there), same as before
- But to get our prediction of what \(Y\) will be conditional on what \(X\) is ( \(D|X\) ), we do one additional step of running it through a function \(F()\) first. We call this function a link function since it links the index function to the outcome
- If \(F(z) = z\), then we’re basically back to OLS
- But if \(F()\) is nonlinear, then we can account for all sorts of nonlinear dependent variables!
So in other words, our prediction of \(D\) is still based on the linear index, but we run it through some nonlinear function first to get our nonlinear output!
Generalized Linear Models
We can also think of this in terms of the latent variable interpretation
\[ D^* = \beta_0 + \beta_1X \]
Where \(D^*\) is an unseen “latent” variable that can take any value, just like a regular OLS dependent variable (and roughly the same in concept as our index function)
And we convert that latent variable to a proabability using some function
\[ E(D | X) = F(D^*) \]
and perhaps saying something like “if we estimate \(Y^*\) is above the number \(c\), then we predict \(D = 1\)”
Probit and Logit
- Let’s go back to our index-and-function interpretation. What function should we use?
- (many many different options depending on your dependent variable - poisson for count data, log link for nonnegative skewed values, multinomial logit for categorical data…)
- For binary dependent variables the two most common link functions are the probit and logistic links. We often call a regression with a logistic link a “logit regression”
\[ Probit(index) = \Phi(index) \]
where \(\Phi()\) is the standard normal cumulative distribution function (i.e. the probability that a random standard normal value is less than or equal to \(index\) )
\[ Logistic(index) = \frac{e^{index}}{1+e^{index}} \]
For most purposes it doesn’t matter whether you use probit or logit, but logit is getting much more popular recently (due to its common use in data science - it’s computationally easier) so we’ll focus on that, and just know that pretty much all of this is the same with probit
Logit
- Notice that we can’t possibly predict a value outside of 0 or 1, no matter how wild \(X\) and our index get
- As \(index\) goes to \(-\infty\),
\[ Logistic(index) \rightarrow \frac{0}{1+0} = 0 \]
- And as \(index\) goes to \(\infty\),
\[ Logistic(index) \rightarrow \frac{\infty}{1+\infty } = 1 \]
Logit
- Also notice that, like the local means did, its slope flattens out near the edges