Getting Started, Regression
from pathlib import Path
DATA_DIR = Path("/kaggle/input")
if (DATA_DIR / "ucfai-core-fa19-regression").exists():
DATA_DIR /= "ucfai-core-fa19-regression"
elif DATA_DIR.exists():
# no-op to keep the proper data path for Kaggle
pass
else:
# You'll need to download the data from Kaggle and place it in the `data/`
# directory beside this notebook.
# The data should be here: https://kaggle.com/c/ucfai-core-fa19-regression/data
DATA_DIR = Path("data")
First thing first, we need to import some packages:
matplotlib
allows us to graphnumpy
is a powerful Linear Algebra librarypandas
allows us to Extract, Load, and Transform (ETL) datasetssklearn
is a great Machine Learning library
# import some important stuff
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as st
from sklearn import datasets, linear_model
Basic Example
The data for this example is arbitrary (we’ll use real data in a bit), but there is a clear linear relationship here.
Let’s also graph this to demonstrate that relationship:
# Get some data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# Let's plot the data to see what it looks like
plt.scatter(x, y, color = "black")
plt.show()
Now, let’s use “least squares estimation” to try minimizing the squared error of our function’s output and the data we plotted.
$SS_{xy}$ is the cross deviation about $x$, and $SS_{xx}$ is the deviation about $x$.
It’s basically some roundabout algebra methods to optimize a function.
The concept isn’t super complicated but it gets hairy when you do it by hand.
# calculating the coefficients
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x, m_y = np.mean(x), np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x - n*m_y*m_x)
SS_xx = np.sum(x*x - n*m_x*m_x)
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
#var to hold the coefficients
b = (b_0, b_1)
#print out the estimated coefficients
print("Estimated coefficients:\nb_0 = {} \nb_1 = {}".format(b[0], b[1]))
But, we don’t need to directly program all of the maths everytime we do linear regression.
sklearn
has built in functions that allows you to quickly do Linear Regression with just a few lines of code.
We’re going to use sklearn
to make a model and then plot it using matplotlib
.
# Sklearn learn require this shape
x = x.reshape(-1,1)
y = y.reshape(-1,1)
# making the model
regress = linear_model.LinearRegression()
regress.fit(x, y)
And now, lets see what the model looks like:
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "black",
marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "blue")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()
So now we can make predictions with new points based off our data
# here we can try out any data point
print(regress.predict([[6]]))
Applied Linear Regression
The Ames Housing Dataset
Ames is a city located in Iowa.
- This data set consists of all property sales collected by the Ames City Assessor’s Office between the years of 2006 and 2010.
- Originally contained 113 variables and 3970 property sales pertaining to the sale of stand-alone garages, condos, storage areas, and of course residential property.
- Distributed to the public as a means to replace the old Boston Housing 1970’s data set.
- Link to Original
- The “cleaned” version of this dataset contains 2930 observations along with 80 predictor variables and two identification variables.
What was the original purpose of this data set?
- Why did the City of Ames decide to collect this data?
- What does the prices of houses effect?
What’s inside?
- This ”new” data set contains 2930 (n=2930) observations along with 80 predictor variables and two identification variables.
- You can read the paper that introduced this data and an exhaustive breakdown of the variables if you want.
Quick Summary
Of the 80 predictor variables we have:
- 20 continuous variables (area dimension)
- Garage Area, Wood Deck Area, Pool Area
- 14 discrete variables (items occurring)
- Remodeling Dates, Month and Year Sold
- 23 nominal and 23 ordinal
- Nominal: Condition of the Sale, Type of Heating and Foundation
- Ordinal: Fireplace and Kitchen Quality, Overall Condition of the House
Question to Answer:
- What is the linear relationship between sale price on above ground living room area?
To try answering this, let’s visually investigate what we’re trying to predict. We’ll kick this off with getting some summary statistics.
housing_data = pd.read_csv(DATA_DIR / "house-prices-advanced-regression-techniques/train.csv", delimiter="\t")
# Mean Sales price
mean_price = np.mean(housing_data["SalePrice"])
print("Mean Price : " + str(mean_price))
# Variance of the Sales Price
var_price = np.var(housing_data["SalePrice"], ddof=1)
print("Variance of Sales Price : " + str(var_price))
# Median of Sales Price
median_price = np.median(housing_data["SalePrice"])
print("Median Sales Price : " + str(median_price))
# Skew of Sales Price
skew_price = st.skew(housing_data["SalePrice"])
print("Skew of Sales Price : " + str(skew_price))
housing_data["SalePrice"].describe()
Another way we can view our data is with a box and whisker plot.
plt.boxplot(housing_data["SalePrice"])
plt.ylabel("Sales Price")
plt.show()
Now let’s look at sales price on above ground living room area.
plt.scatter(housing_data["GrLivArea"], housing_data["SalePrice"])
plt.ylabel("Sales Price")
plt.show()
Finally, lets generate our model and see how it predicts Sales Price!!
# we need to reshape the array to make the sklearn gods happy
area_reshape = housing_data["GrLivArea"].values.reshape(-1,1)
price_reshape = housing_data["SalePrice"].values.reshape(-1,1)
# Generate the Model
model = linear_model.LinearRegression(fit_intercept=True)
model.fit(area_reshape, price_reshape)
price_prediction = model.predict(area_reshape)
# plotting the actual points as scatter plot
plt.scatter(area_reshape, price_reshape)
# plotting the regression line
plt.plot(area_reshape, price_prediction, color = "red")
# putting labels
plt.xlabel('Above Ground Living Area')
plt.ylabel('Sales Price')
# function to show plot
plt.show()
Applied Logistic Regression
# we're going to need a different model, so let's import it
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
For Logistic Regression, we’re going to be using a real dataset.
You can find this data from UCI’s Machine Learning Repository, or on Kaggle alongside this notebook.
On Kaggle, we’ve packaged up the data for you, so now we can play around with it.
But before that, we need to read in the data – pandas
has the functions we need to do this
# read_csv allow us to easily import a whole dataset
data = pd.read_csv(
DATA_DIR / "adult-census-income/adult.csv",
names=["age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week","native-country","income"]
)
# this tells us whats in it
print(data.info())
# data.head() gives us some the the first 5 rows of the data
data.head()
The code below will show us some information about the continunous parameters that our dataset contains.
age
is Agefnlwgt
is final weight, or the number of people that are represented in this group relative to the overall population of this dataset.education-num
is a numerical way of representing Education levelcapital-gain
is the money made investmentscapital-loss
is the loss from investmentshours-per-week
is the number of hours worked during a week
# this is the function that give us some quick info about continous data in the dataset
data.describe()
Now here is the Qustion:
- Which one of these parameters are best in figuring out if someone is going to be making more then 50k a year?
- Make sure you choose a continunous parameter, as categorical stuff isn’t going to work
# put the name of the parameter you want to test
### BEGIN SOLUTION
test = "capital-gain"
### END SOLUTION
# but before we make our model, we need to modify our data a bit
# little baby helper function
def incomeFixer(x):
if x == " <=50K":
return 0
else:
return 1
# change the income data into 0's and 1's
data["income"] = data.apply(lambda row: incomeFixer(row['income']), axis=1)
# get the data we are going to make the model with
x = np.array(data[test])
y = np.array(data["income"])
# again, lets make the scikitlearn gods happy
x = x.reshape(-1,1)
# Making the test-train split
splits = train_test_split(x ,y ,test_size=0.25, random_state=42)
x_train, x_test, y_train, y_test = splits
# now make data model!
logreg = LogisticRegression(solver='liblinear')
logreg.fit(x_train, y_train)
# now need to test the model's performance
print(logreg.score(x_test, y_test))
Thank You
We hope that you enjoyed being here today.
Please fill out this questionaire so we can get some feedback about tonight’s meeting.
We hope to see you here again next week for our core meeting on Random Forests and Support Vector Machines.