Analyzing a Linear Regression Result
In statistics, a common tool is to build a linear regression model. As a data scientist, it is easy to get caught up in the technical improvements of a model: improving the R-squared, reducing the RMSEs, and removing features with high p-values. However, it is important not to lose sight of the context of your analysis: what do the regression results actually mean? Here we will go through several steps of regression interpretation so that we can understand the results we produce and apply them to a business problem at hand!
To begin, let’s look at the snapshot below of a regression result. We will use the following example to explain how to interpret OLS results:
- Interpreting the R-squared
A data scientist’s goal is to improve the regression model so that the R-squared value is as close to 1 as possible. In technical terms, the R-squared measures the amount of variance that is explained by the model. So, in the case above, an R-squared of 0.699 means that our model predicts our data with 69.9% accuracy.
2. Interpreting the Intercept
The intercept, also known as the “constant”, tells you the value of Y when all variables of X=0. In the case above, the Y value represents price. So the intercept tells us — if all X values are zero, what is the price of a home? Well, in this scenario, this intercept has no meaning. If the X value of “sqft_living” is 0, it means that a house has 0 square feet of living space. There is no home that will follow these dimensions. For a home to exist, the X value of sqft_living will never be zero, and thus our intercept has no intrinsic meaning.
3. Interpret Linear Coefficients
Linear coefficients are the most straight forward to interpret. We will look at the variable “view” as an example. The coefficient here is 2.425e+04 (or 24,260). Since both view and price are linear coefficients, a one unit increase in view would increase price by $24,260. Pretty straightforward!
4. Log Coefficients
Log coefficients are slightly more complex to analyze. For example, the variable sqft_living was log transformed. As a result, this has to be interpreted differently: a one percent increase in sqft_living will increase price by the value of the coefficient, divided by 100. The coefficient is 7.924e+04 (or more easily read as 79,240), so a 1% increase in sqft_living will increase price by 79,240/100 = $792.
5. Interpret Dummies
Interpreting dummies requires a different approach. Recall that, when a continuous variable is transformed to a dummy variable, one dummy variable must be dropped in order to avoid the dummy variable problem of collinearity. The dataset used to build the regression above was from King County, and each dummy represents a different city. In this model, “Seattle” was dropped to avoid the dummy variable trap. So, the coefficients on the remaining dummies, and thus the remaining cities, are all in relation to the impact Seattle has on price.
For example, the coefficient on City_Bellevue (the area where Bill Gates lives!) is 4.732e+04, or $47,320. So, if a home is located in Bellevue, the impact on price, compared to a home in Seattle, would be an increase of $47,320.
6. Interpret Interactions
Lastly, we will learn how to interpret interaction terms. There are two interaction terms in this regression example: sqft_living*floors and sqft_living*bathrooms. These interactions show us that the impact of sqft_living on price differs for different values of floors and for different values of bathrooms.
If we look at sqft_living*floors, the coefficient is 7.38e+04 or $73,800. This means that, as floors increase by one unit, the impact of sqft_living on price will increase by $73,800.
If we look at sqft_living*bathrooms, the coefficient is 1.476e+04 or $14,760. This means that, as bathrooms increase by one unit, the impact of sqft_living on price will increase by $14,760.
There are various types of coefficients that can be included in a regression model and it is important that careful attention is drawn towards analyzing them correctly. Whether an X-variable is linear, log, interaction or dummied will impact how its coefficient relates to an impact on the Y-value.
This post should have given you a solid foundation on regression analysis, but there are many more types of X-Y relationships to explore. For more detail on how to interpret further relationships, take a look at the following link.
Good luck on your future regression models — I hope this helped you on your road to becoming a Data Scientist!