Boston Dataset
Posted by Mr Notebook Roooky in Notebook
Linear Regressions on the famous Boston dataset¶
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston
bostondt = load_boston()
print(bostondt['DESCR'])
dataset = load_boston()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target
df.head()
pd.isnull(df).sum()
Using various methods to compute the corrolation
#print('Pearson')
print(df.corr(method='pearson'))
#print('Spearman')
print(df.corr(method='spearman'))
#print('Kendall')
print(df.corr(method='kendall'))
%timeit df.corr(method='pearson')
%timeit df.corr(method='spearman')
%timeit df.corr(method='kendall')
pearson = df.corr(method= 'pearson')
pearson.iloc[-1][:-1]
pearson = df.corr(method= 'pearson')
#assuming target attr is the last then we remove corr wit itself
corr_with_target = pearson.iloc[-1][:-1]
#attri sorted from the most predictive
#predictivity = corr_with_target.argsort(ascending= False)
predictivity = corr_with_target.sort_values(ascending= False)
Since we might be also interested in strong negative correlations it would be better to sort the correlations by the absolute value.
corr_with_target[abs(corr_with_target).argsort()[::-1]]
predictivity
df.info()
The info method used above helps us understand the nature of our dataset in this case we can see we have 506 entries 14 columns and and they are all numerical type(Float)
df.describe()
Taking a closer view of our columns
df.columns
plt.figure(figsize=[10,5])
sns.pairplot(df)
Lets find out what kind of distribution our dealing with, the distribution can tell us what type of model we should proceed with.
sns.distplot(df['target'])
print(df['target'].std())
print(df['target'].mean())
df['target'].describe()
plt.figure(figsize=[10,10])
sns.heatmap(df.corr(), annot= True)
From that heatmap we can see some of the features that will pick our interest, for example we got RM it did well with our target the rest of them not so well so for now we can focus on RM(Number of Rooms) which by the way was shown to be a conteder from when we were still choosing the type of correlation method were going to us.
Now we come back to remind ourselves of the columns
Lets draw some more graphs a scatter graph is always good because most people a familiar with this kind of graph so it will help us here.
sns.scatterplot(y='target', x='RM', data=df)
From the graph above its difficult to make a model based on this one feature because even though there is a positive relationship it is not enough, more work to be done before we can convice our selves.
Lets introduce a the scipy library just to check if it can improve the performance of our one feature model.
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(df['target'], df['RM'])
print("R Value: " ,r_value)
print("RSquared Value: " ,r_value ** 2)
print("Intercept: " ,intercept)
Again not bad numbers considring the fact that were dealing with real data, 0.7 R value is good. But the question is always whether we can do better to reassure ourselves. We could try to fit a curve or line but those efforts would be futile as we can clearly see from the none those options would work.
we do have one more trick up our sleeves to try and model this data with out any trainning involved, we could use a scaler to fit all our data into a range and then use that range to compute a coef.
For this we are going to need the statsmodels library and scikitlearn's Standardscaler.
df.columns
X= df[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']]
y= df['target']
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
X= df[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']]
y= df['target']
#scale.fit makes the feature scaled to 1--->-1
X= df[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']] = scale.fit_transform(X= df[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']].as_matrix())
print (X)
est = sm.OLS(y, X).fit()
est.summary()
We can see in the Coef column that room number is out perfoming the rest even when we scale down to a small range of values for all features. But then again we have a few other features that scored a 2.5 and above, this makes us rethink our focus on the one feature prediction model.
The new conteder happens to be Rad, Rad perfomed badly in terms of correlation but now we have new information that if we scaleld all the features to a small range, the rand feature is worth paying attention to.
This means its time to introduce our favuorite library again, it might have an algorithm that would do a better job with all the features than our scale estimator.
from sklearn.model_selection import train_test_split
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
print('Intercept: ',lm.intercept_)
print('Coef: ',lm.coef_)
The coef above is
coeffecients = pd.DataFrame(lm.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients
coeffecients
Predictions using the coefficient¶
predictions = lm.predict(X_test)
y_test
predictions
plt.scatter(y_test,predictions)
plt.scatter(y_test,predictions)
sns.distplot((y_test-predictions))
y_test-predictions
from sklearn import metrics
print('MAE: ',metrics.mean_absolute_error(y_test,predictions))
print('MSE: ',metrics.mean_squared_error(y_test,predictions))
print('sqtMSE: ',np.sqrt(metrics.mean_squared_error(y_test,predictions)))