Lesson 17: Multiple linear regression
Monday, January 29, 2024
Basics of multiple linear regression
Multiple linear regression with R
Interpretation of multiple linear regression
Model specification
\[ Y = \beta_0 + \beta_1 X_1 + \epsilon \]
mpg
) of a car based on its weight (wt
). The model is expressed as follows:\[ \text{mpg} = \beta_0 + \beta_1 \text{wt} + \epsilon \]
Why we need multiple linear regression?
In real world, it is rare that an outcome is affected by only one factor. In other words, the outcome is usually affected by more than one predictor variable. For example, the miles per gallon of a car may be mainly affected by its weight, but it is also affected by other factors, such as rear axle ratio, and horsepower. Therefore, we need to consider more than one predictor variable in the model.
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \epsilon \]
mpg
) of a car based on its weight (wt
), horsepower (hp
), rear axle ratio (drat
), and the transmission model (automatic or manual, am
). The model is expressed as follows:\[ \text{mpg} = \beta_0 + \beta_1 \text{wt} + \beta_2 \text{drat} + \beta_3 \text{hp} + \beta_4 \text{am} + \epsilon \]
lm()
function to run multiple linear regression in R
Call:
lm(formula = mpg ~ wt + hp + drat + am, data = mutate(mtcars,
am = factor(am)))
Residuals:
Min 1Q Median 3Q Max
-3.2882 -1.7531 -0.6827 1.1691 5.5211
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.027077 6.185177 4.855 4.5e-05 ***
wt -2.726092 0.937791 -2.907 0.007209 **
hp -0.036373 0.009814 -3.706 0.000958 ***
drat 0.981018 1.377101 0.712 0.482341
am1 1.578521 1.559281 1.012 0.320363
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.56 on 27 degrees of freedom
Multiple R-squared: 0.8428, Adjusted R-squared: 0.8196
F-statistic: 36.2 on 4 and 27 DF, p-value: 1.75e-10
We can interpret the model as follows:
The coefficient of weight is -2.72, which means that the predicted miles per gallon of a car decreases by 2.72 miles for each additional 1000 pounds of weight, all other things equal. We can see that the coefficient of weight is statistically significant (p < 0.01), which means that the weight of a car is a statistically significant predictor of its miles per gallon.
The coefficient of horsepower is -0.036, which means that the predicted miles per gallon of a car decreases by 3.6 miles for each additional 100 horsepower, all other things equal. We can see that the coefficient of horsepower is statistically significant (p < 0.01), which means that the horsepower of a car is a statistically significant predictor of its miles per gallon.
The coefficient of rear axle ratio is 0.98, which means that the predicted miles per gallon of a car increases by about 1 mile for each additional unit of rear axle ratio, all other things equal. However, the coefficient of rear axle ratio is not statistically significant (p = 0.48).
The coefficient of transmission model is 1.58, which means that the predicted miles per gallon of a car with manual transmission is 1.58 miles higher than that of a car with automatic transmission, all other things equal. We can see that the coefficient of transmission model is not statistically significant either (p = 0.32).
The model specification is very important in multiple linear regression
It is also a very difficult task. To some extent, it is easier to run multiple linear regression than to specify the model.
We need to specify the model based on our research question, the theory, and the data. We should put forward a hypothesis about the relationship between the outcome and the predictors, than to find a fitted model through trying all different combinations of predictors. For instance:
Call:
lm(formula = mpg ~ wt + hp + qsec + am, data = mutate(mtcars,
am = factor(am)))
Residuals:
Min 1Q Median 3Q Max
-3.4975 -1.5902 -0.1122 1.1795 4.5404
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.44019 9.31887 1.871 0.07215 .
wt -3.23810 0.88990 -3.639 0.00114 **
hp -0.01765 0.01415 -1.247 0.22309
qsec 0.81060 0.43887 1.847 0.07573 .
am1 2.92550 1.39715 2.094 0.04579 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.435 on 27 degrees of freedom
Multiple R-squared: 0.8579, Adjusted R-squared: 0.8368
F-statistic: 40.74 on 4 and 27 DF, p-value: 4.589e-11
It is not the purpose of this course to teach you how to specify a model, which should be based on your research question and the theory.
However, we can use technical skills to test whether a model is a good fit or not. We will learn how to do this in future courses.
Thank you!