Teaching Statistics with Real Estate Data

Statistics is the core of data science, which is an increasingly critical skill for many professions. It is, however, also an academic topic about which many people feel ambivalent. They are told it is important, and many are required to take a statistics course. The good students push themselves through a dreadful collection of Greek letters, memorize a bunch of formulas, and pass an exam. They may have even earned an A. How much they actually remember a few years later, not to mention how much they can actually use statistics in their personal or professional lives, is a whole nother story.

Is deviation the same as standard deviation 𝛔? How is standard deviation different from standard score Z? What about standard error? Is that a special type of error 𝜺? Isn’t error also called residual? And why does everything also have a Greek letter name? Most importantly, what’s the point of standard deviation? Can I eat it? Can I sell it? Can I make money with standard deviation?

Statistics as a branch of mathematics derives its beauty from elegant and concise mathematical equations. However, these abstract representations can be so dense that most people have trouble decoding such “Da Vinci Code.” As a result, most of the effort in learning statistics is spent on decoding and applying mathematical equations and very little on its practical use. This is unfortunate because statistics is such a pragmatic tool embedded in so many real-world solutions. To help students learn statistics more effectively, I decided to ask them to go house shopping. Well, They just needed to pretend to buy a house. They didn’t actually have to complete the purchase. With an unlimited budget, each student was asked to pick a real estate property in the Williamsburg area on the Zillow website, and enter data about the property using this short survey.

Then we would spend the rest of the semester calculating numbers and analyzing the dataset in different ways. The dataset and different analyses are available here.

Student names are included in the dataset, so they can personally identify their own records. When we compute deviations or residuals, I would ask students to compute these scores using their selected properties. I would then ask them to write down their scores on their own “data dots,” or round-shaped sticky notes that we can then put up on a whiteboard to construct a histogram, or a scatterplot. 

This image has an empty alt attribute; its file name is HB2PLJvxucynZF1P7q4vK4e3Dm5wp0f0nP3GB8kDgNkEdJonP0_dWwakd7jSaSMnx7cq5BPUFtBEU7XhJ0A19N1hGxdkxhPbn2vjRtvE3cPsrZabuAPs7CDRvbvFQyiL7LMYqoVU=s0
This image has an empty alt attribute; its file name is WjnikPaC-sk0L_IKfcJWfSCepNh_DSwJWMF-fZsFvEOcH3psba_gBjjOYHt2FfWgGkOiShG45FJURIutRlpr-idlqbLYUL0vk1noc9YAYgq4Yfczni23Qx_9Kn2gjEklDB9dDyCy=s0

This exercise is useful in so many different ways. First, the student gets to clearly distinguish measures that belong to an individual case (i.e. a row), such as deviation, from measures that belong to an entire dataset, such as standard deviation. They also have a deeper appreciation of the relationship between a case, and a dataset.

Real estate datasets are a good choice because (1) The business domain knowledge is easily accessible to the general population. For example, most people intuitively understand that bigger houses (i.e., a larger square footage) are probably going to be more expensive, and adding a bathroom is probably going to increase the value of the property. (2) The relationships between the variables are usually very strong and predictable. For example, the number of bedrooms is always highly correlated with the number of bathrooms. Each additional bedroom is always going to add a significant chunk of value. Therefore, as an instructor I don’t have to worry whether the statistical magic will work in a new semester or not.

The survey questions include both categorical variables (e.g., Zipcode) and numeric variables (e.g. Price). Therefore, we can use the same dataset to run a wide range of analyses by combining different categorical and numeric variables. In fact, we can get through an entire semester using the same dataset and never run out of ideas. The analyses I teach, such as correlation, ANOVA, t tests, linear regression, and logistic regression, are organized by week on this Google Sheets file.

Most importantly, Zillow uses their dataset to produce Zestimate, which is a real-world application of machine learning predictions. Students have an intuitive understanding of Zestimate, and when they realize that they can build their own Zestimate models after having learned regression, they often get a great sense of accomplishment. We the visit the Zillow Research site together to study additional explanations of how Zestimate works, and again students feel a great sense of satisfaction when they realize that they can actually understand those explanations and jargon!

As we finish up the semester, we build our own Zestimate equations, make our own predictions, and calculate our residuals (i.e., the amount of mispredictions.) These models are very easy to interpret. For example, the beta coefficient tells you how much the price would increase if the house has an extra bedroom, or how much the price would drop, if the house moves from Zipcode 23185 to 23188. These simple, vivid, and intuitive exercises make the rather abstract concepts and Greek letters (e.g., 𝛃) easier to digest, remember, and apply.