Using Linear Regression to Gain Insight into Housing Sales

Overview of a data modeling project for Flatiron’s data science course

6 min readOct 27, 2020

How much is a home’s price affected by:

a recent renovation?
10 additional square feet of living space?
other listings for sale in the area?

For my data science project for Flatiron School, I used a dataset covering home sales in King County, Washington, in 2014 and 2015 to see if could determine how much of a home’s worth was represented by particular characteristics, such as the area of the living space, when it was built, or whether it is located on a waterfront.

The ultimate purpose of the assignment was to give me a hands-on way to learn linear regression — or in other words, how to model the relationship between independent variables and a dependent variable, which in this case was a home’s sale price and various features of the home. Towards the end of the project, I ran models of Ordinary Least Squares in Python.

As Wikipedia defines it, “OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the given dataset and those predicted by the linear function.”

Exploring the Data

Before I could begin analyzing the data, I had to figure out what I was dealing with. I could use everything from a home’s latitude and longitude to the size of its basement. I ran visualizations of the data to help me determine which direction I might go in.

Price

Most houses sold for $2 million or less.

Grade

A majority of houses were ranked with a grade of 7 or 8, which is a measure of their condition.

Year Built vs. Price

There doesn’t seem to be a strong relationship between when a house was built and its price.

Lot square footage vs. price

Some homes on rather small lots can be worth several million, while the dataset’s house with the largest lot sold for less than $1 million.

I figured this finding had to do with homes in urban versus rural areas. It also led me to the next stage of dealing with outliers. For example, while most houses had fewer than five bedrooms, one in the data set had 33, far more than the any other home.

Checking for Multicollinearity

In addition to eliminating from my dataset outliers and categories I determined would not be useful for my analysis, I also looked to remove categories that were too closely associated with other categories. These multicollinear groupings can muddy the findings of the OLS model. I used Seaborn heatmaps and Variable Inflation Factors to help with my decisions.

For this heatmap below, I looked to eliminate those with scores above .70, and in the chart below that, I wanted to eliminate VIFs above 5.0. For example, I found sqft_above is closely linked to sqft_living, so I chose to eliminate the former category from my model.

I also was interested in how an area’s population affects housing prices. I found U.S. Census data for the same year as the housing dataset and merged the the datasets together. I engineered a category I called “active market score,” which is the ratio of a zip code’s housing sales per the zip code population (which I then multiplied by 1,000 to make it easier to work with).

Running the models

While picking and choosing categories, I ran several versions of the OLS model to see how different characteristics affect a home’s price. Through the iterations, I eliminated categories with p-values larger than 0.05 and found it useful to normalize continuous categories using their standard deviations to reduce the likelihood of these variables introducing a bias into the model.

I also changed certain categories to categorical. For example, instead of using the year of a home’s renovation, which could have been any year since 1900, I introduced a column into my dataset called “recently renovated,” which only accounted for homes renovated in the previous 25 years. I also disregarded the “year built” column in favor of assigning each home’s construction date into one of six blocks of approximately 20 years each.

Still, I found my model didn’t fit the dataset all that well, especially for the more expensive homes according to a Q-Q plot I generated. Ideally, the blue dots would follow the red line, but instead it looked like this:

When I limited my model to homes worth $900,000 or below, it still accounted for more than 90% of the homes, but my Q-Q plot looked a lot better:

Therefore, the real estate company should just realize that they shouldn’t use my findings for homes at the high end of the market.

What I found

In my final model, the R-squared was 0.532, which means a little more than half of the observed variation in price can be explained by the model’s inputs. From the coefficients, I determined that if all other variables stay constant:

For each 910.9 increase/decrease in living square footage, we expect a home’s price to average increase/decrease $61,380.
For each 39,842 increase/decrease in lot square footage, we expect a home’s price to average increase/decrease $2,565.
For each 437.5 increase/decrease in basement square footage, we expect a home’s price to average increase/decrease $722.

In addition:

For every grade a house increases, its value is expected to rise $112,000
Houses in an active market command higher prices than similar houses in less active markets
If a house is located on a waterfront, it is expected to be worth $143,200 more
Houses renovated within the last 25 years are expected to be worth $21,180 more.
Each additional floor added to a house is expected to increase its worth $29,520 (regardless of total sq. footage).
The model could not accurately incorporate the age of houses built since 1980. Houses built in older age groups, such as 1900–1919, tend to be worth more than newer ones, such as 1960–1979.

Most of these findings make sense. For example, one would expect houses to generally be worth more as their grade increases and for homes on waterfronts to command higher prices.

As for the active market score, I figured it could have gone either way: areas with a lot of listings could drive down prices because of a higher supply or they could increase prices because of buyers’ eagerness to make offers before a home slips away. It turns out the latter is true.

Other findings make less intuitive sense. For example, my model finds that older homes tend to be worth more. Also, even though increased square footage of a home’s living space and its lot typically translate into higher prices, the numbers seem a bit off. My guess is that these issues have to do with urban vs. rural areas. Older housing stock is more likely to be found in in-demand city centers, and an apartment in Seattle is much different than a farmhouse out to its east.

Clearly, I still have some unanswered questions and avenues for further exploration. While I’m sure continuing to use these techniques will help solidify many of the concepts in my mind, I found this assignment to be a useful and interesting introduction to linear regression and modeling.