MrRutledge

'If at first you dont succeed just refresh and try again.' Aliyah

Fri 11 March 2016

Boston Dataset

Posted by Mr Notebook Roooky in Notebook   

Open In Colab

Linear Regressions on the famous Boston dataset

In [1]:
import pandas as pd 
import numpy as np
In [2]:
import seaborn as sns 
import matplotlib.pyplot as plt
In [3]:
%matplotlib inline
In [4]:
from sklearn.datasets import load_boston
In [5]:
bostondt = load_boston()
In [6]:
print(bostondt['DESCR'])
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

In [7]:
dataset = load_boston()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target
In [8]:
df.head()
Out[8]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
In [9]:
pd.isnull(df).sum()
Out[9]:
CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
target     0
dtype: int64

Using various methods to compute the corrolation

In [10]:
#print('Pearson')
print(df.corr(method='pearson'))
#print('Spearman')
print(df.corr(method='spearman'))
#print('Kendall')
print(df.corr(method='kendall'))
             CRIM        ZN     INDUS      CHAS       NOX        RM       AGE  \
CRIM     1.000000 -0.200469  0.406583 -0.055892  0.420972 -0.219247  0.352734   
ZN      -0.200469  1.000000 -0.533828 -0.042697 -0.516604  0.311991 -0.569537   
INDUS    0.406583 -0.533828  1.000000  0.062938  0.763651 -0.391676  0.644779   
CHAS    -0.055892 -0.042697  0.062938  1.000000  0.091203  0.091251  0.086518   
NOX      0.420972 -0.516604  0.763651  0.091203  1.000000 -0.302188  0.731470   
RM      -0.219247  0.311991 -0.391676  0.091251 -0.302188  1.000000 -0.240265   
AGE      0.352734 -0.569537  0.644779  0.086518  0.731470 -0.240265  1.000000   
DIS     -0.379670  0.664408 -0.708027 -0.099176 -0.769230  0.205246 -0.747881   
RAD      0.625505 -0.311948  0.595129 -0.007368  0.611441 -0.209847  0.456022   
TAX      0.582764 -0.314563  0.720760 -0.035587  0.668023 -0.292048  0.506456   
PTRATIO  0.289946 -0.391679  0.383248 -0.121515  0.188933 -0.355501  0.261515   
B       -0.385064  0.175520 -0.356977  0.048788 -0.380051  0.128069 -0.273534   
LSTAT    0.455621 -0.412995  0.603800 -0.053929  0.590879 -0.613808  0.602339   
target  -0.388305  0.360445 -0.483725  0.175260 -0.427321  0.695360 -0.376955   

              DIS       RAD       TAX   PTRATIO         B     LSTAT    target  
CRIM    -0.379670  0.625505  0.582764  0.289946 -0.385064  0.455621 -0.388305  
ZN       0.664408 -0.311948 -0.314563 -0.391679  0.175520 -0.412995  0.360445  
INDUS   -0.708027  0.595129  0.720760  0.383248 -0.356977  0.603800 -0.483725  
CHAS    -0.099176 -0.007368 -0.035587 -0.121515  0.048788 -0.053929  0.175260  
NOX     -0.769230  0.611441  0.668023  0.188933 -0.380051  0.590879 -0.427321  
RM       0.205246 -0.209847 -0.292048 -0.355501  0.128069 -0.613808  0.695360  
AGE     -0.747881  0.456022  0.506456  0.261515 -0.273534  0.602339 -0.376955  
DIS      1.000000 -0.494588 -0.534432 -0.232471  0.291512 -0.496996  0.249929  
RAD     -0.494588  1.000000  0.910228  0.464741 -0.444413  0.488676 -0.381626  
TAX     -0.534432  0.910228  1.000000  0.460853 -0.441808  0.543993 -0.468536  
PTRATIO -0.232471  0.464741  0.460853  1.000000 -0.177383  0.374044 -0.507787  
B        0.291512 -0.444413 -0.441808 -0.177383  1.000000 -0.366087  0.333461  
LSTAT   -0.496996  0.488676  0.543993  0.374044 -0.366087  1.000000 -0.737663  
target   0.249929 -0.381626 -0.468536 -0.507787  0.333461 -0.737663  1.000000  
             CRIM        ZN     INDUS      CHAS       NOX        RM       AGE  \
CRIM     1.000000 -0.571660  0.735524  0.041537  0.821465 -0.309116  0.704140   
ZN      -0.571660  1.000000 -0.642811 -0.041937 -0.634828  0.361074 -0.544423   
INDUS    0.735524 -0.642811  1.000000  0.089841  0.791189 -0.415301  0.679487   
CHAS     0.041537 -0.041937  0.089841  1.000000  0.068426  0.058813  0.067792   
NOX      0.821465 -0.634828  0.791189  0.068426  1.000000 -0.310344  0.795153   
RM      -0.309116  0.361074 -0.415301  0.058813 -0.310344  1.000000 -0.278082   
AGE      0.704140 -0.544423  0.679487  0.067792  0.795153 -0.278082  1.000000   
DIS     -0.744986  0.614627 -0.757080 -0.080248 -0.880015  0.263168 -0.801610   
RAD      0.727807 -0.278767  0.455507  0.024579  0.586429 -0.107492  0.417983   
TAX      0.729045 -0.371394  0.664361 -0.044486  0.649527 -0.271898  0.526366   
PTRATIO  0.465283 -0.448475  0.433710 -0.136065  0.391309 -0.312923  0.355384   
B       -0.360555  0.163135 -0.285840 -0.039810 -0.296662  0.053660 -0.228022   
LSTAT    0.634760 -0.490074  0.638747 -0.050575  0.636828 -0.640832  0.657071   
target  -0.558891  0.438179 -0.578255  0.140612 -0.562609  0.633576 -0.547562   

              DIS       RAD       TAX   PTRATIO         B     LSTAT    target  
CRIM    -0.744986  0.727807  0.729045  0.465283 -0.360555  0.634760 -0.558891  
ZN       0.614627 -0.278767 -0.371394 -0.448475  0.163135 -0.490074  0.438179  
INDUS   -0.757080  0.455507  0.664361  0.433710 -0.285840  0.638747 -0.578255  
CHAS    -0.080248  0.024579 -0.044486 -0.136065 -0.039810 -0.050575  0.140612  
NOX     -0.880015  0.586429  0.649527  0.391309 -0.296662  0.636828 -0.562609  
RM       0.263168 -0.107492 -0.271898 -0.312923  0.053660 -0.640832  0.633576  
AGE     -0.801610  0.417983  0.526366  0.355384 -0.228022  0.657071 -0.547562  
DIS      1.000000 -0.495806 -0.574336 -0.322041  0.249595 -0.564262  0.445857  
RAD     -0.495806  1.000000  0.704876  0.318330 -0.282533  0.394322 -0.346776  
TAX     -0.574336  0.704876  1.000000  0.453345 -0.329843  0.534423 -0.562411  
PTRATIO -0.322041  0.318330  0.453345  1.000000 -0.072027  0.467259 -0.555905  
B        0.249595 -0.282533 -0.329843 -0.072027  1.000000 -0.210562  0.185664  
LSTAT   -0.564262  0.394322  0.534423  0.467259 -0.210562  1.000000 -0.852914  
target   0.445857 -0.346776 -0.562411 -0.555905  0.185664 -0.852914  1.000000  
             CRIM        ZN     INDUS      CHAS       NOX        RM       AGE  \
CRIM     1.000000 -0.462057  0.521014  0.033948  0.603361 -0.211718  0.497297   
ZN      -0.462057  1.000000 -0.535468 -0.039419 -0.511464  0.278134 -0.429389   
INDUS    0.521014 -0.535468  1.000000  0.075889  0.612030 -0.291318  0.489070   
CHAS     0.033948 -0.039419  0.075889  1.000000  0.056387  0.048080  0.055616   
NOX      0.603361 -0.511464  0.612030  0.056387  1.000000 -0.215633  0.589608   
RM      -0.211718  0.278134 -0.291318  0.048080 -0.215633  1.000000 -0.187611   
AGE      0.497297 -0.429389  0.489070  0.055616  0.589608 -0.187611  1.000000   
DIS     -0.539878  0.478524 -0.565137 -0.065619 -0.683930  0.179801 -0.609836   
RAD      0.563969 -0.234663  0.353967  0.021739  0.434828 -0.076569  0.306201   
TAX      0.544956 -0.289911  0.483228 -0.037655  0.453258 -0.190532  0.360311   
PTRATIO  0.312768 -0.361607  0.336612 -0.115694  0.278678 -0.223194  0.251857   
B       -0.264378  0.128177 -0.192017 -0.033277 -0.202430  0.032951 -0.154056   
LSTAT    0.454837 -0.386818  0.465980 -0.041344  0.452005 -0.468231  0.485359   
target  -0.403964  0.339989 -0.418430  0.115202 -0.394995  0.482829 -0.387758   

              DIS       RAD       TAX   PTRATIO         B     LSTAT    target  
CRIM    -0.539878  0.563969  0.544956  0.312768 -0.264378  0.454837 -0.403964  
ZN       0.478524 -0.234663 -0.289911 -0.361607  0.128177 -0.386818  0.339989  
INDUS   -0.565137  0.353967  0.483228  0.336612 -0.192017  0.465980 -0.418430  
CHAS    -0.065619  0.021739 -0.037655 -0.115694 -0.033277 -0.041344  0.115202  
NOX     -0.683930  0.434828  0.453258  0.278678 -0.202430  0.452005 -0.394995  
RM       0.179801 -0.076569 -0.190532 -0.223194  0.032951 -0.468231  0.482829  
AGE     -0.609836  0.306201  0.360311  0.251857 -0.154056  0.485359 -0.387758  
DIS      1.000000 -0.361892 -0.381988 -0.223486  0.168631 -0.409347  0.313115  
RAD     -0.361892  1.000000  0.558107  0.251913 -0.214364  0.287943 -0.248115  
TAX     -0.381988  0.558107  1.000000  0.287769 -0.241606  0.384191 -0.414650  
PTRATIO -0.223486  0.251913  0.287769  1.000000 -0.042152  0.330335 -0.398789  
B        0.168631 -0.214364 -0.241606 -0.042152  1.000000 -0.145430  0.126955  
LSTAT   -0.409347  0.287943  0.384191  0.330335 -0.145430  1.000000 -0.668656  
target   0.313115 -0.248115 -0.414650 -0.398789  0.126955 -0.668656  1.000000  
In [11]:
%timeit df.corr(method='pearson')
1 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [12]:
%timeit df.corr(method='spearman')
24.9 ms ± 7.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [13]:
%timeit df.corr(method='kendall')
75 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [14]:
pearson = df.corr(method= 'pearson')
pearson.iloc[-1][:-1]
Out[14]:
CRIM      -0.388305
ZN         0.360445
INDUS     -0.483725
CHAS       0.175260
NOX       -0.427321
RM         0.695360
AGE       -0.376955
DIS        0.249929
RAD       -0.381626
TAX       -0.468536
PTRATIO   -0.507787
B          0.333461
LSTAT     -0.737663
Name: target, dtype: float64
In [15]:
pearson = df.corr(method= 'pearson')
#assuming target attr is the last then we remove corr wit itself
corr_with_target = pearson.iloc[-1][:-1]
#attri sorted from the most predictive
#predictivity = corr_with_target.argsort(ascending= False)
predictivity = corr_with_target.sort_values(ascending= False)

Since we might be also interested in strong negative correlations it would be better to sort the correlations by the absolute value.

In [16]:
corr_with_target[abs(corr_with_target).argsort()[::-1]]
Out[16]:
LSTAT     -0.737663
RM         0.695360
PTRATIO   -0.507787
INDUS     -0.483725
TAX       -0.468536
NOX       -0.427321
CRIM      -0.388305
RAD       -0.381626
AGE       -0.376955
ZN         0.360445
B          0.333461
DIS        0.249929
CHAS       0.175260
Name: target, dtype: float64
In [17]:
predictivity
Out[17]:
RM         0.695360
ZN         0.360445
B          0.333461
DIS        0.249929
CHAS       0.175260
AGE       -0.376955
RAD       -0.381626
CRIM      -0.388305
NOX       -0.427321
TAX       -0.468536
INDUS     -0.483725
PTRATIO   -0.507787
LSTAT     -0.737663
Name: target, dtype: float64
In [18]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
target     506 non-null float64
dtypes: float64(14)
memory usage: 55.4 KB

The info method used above helps us understand the nature of our dataset in this case we can see we have 506 entries 14 columns and and they are all numerical type(Float)

In [19]:
df.describe()
Out[19]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000

Taking a closer view of our columns

In [20]:
df.columns
Out[20]:
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'target'],
      dtype='object')
In [21]:
plt.figure(figsize=[10,5])
sns.pairplot(df)
Out[21]:
<seaborn.axisgrid.PairGrid at 0x1d38e3b4710>
<Figure size 720x360 with 0 Axes>

Lets find out what kind of distribution our dealing with, the distribution can tell us what type of model we should proceed with.

In [58]:
sns.distplot(df['target'])
C:\Users\Shaki\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d39951b9b0>
In [59]:
print(df['target'].std())
print(df['target'].mean())
9.19710408737982
22.532806324110698
In [60]:
df['target'].describe()
Out[60]:
count    506.000000
mean      22.532806
std        9.197104
min        5.000000
25%       17.025000
50%       21.200000
75%       25.000000
max       50.000000
Name: target, dtype: float64
In [61]:
plt.figure(figsize=[10,10])
sns.heatmap(df.corr(), annot= True)
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d3995b5198>

From that heatmap we can see some of the features that will pick our interest, for example we got RM it did well with our target the rest of them not so well so for now we can focus on RM(Number of Rooms) which by the way was shown to be a conteder from when we were still choosing the type of correlation method were going to us.

Now we come back to remind ourselves of the columns

Lets draw some more graphs a scatter graph is always good because most people a familiar with this kind of graph so it will help us here.

In [62]:
sns.scatterplot(y='target', x='RM', data=df)
Out[62]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d399967940>

From the graph above its difficult to make a model based on this one feature because even though there is a positive relationship it is not enough, more work to be done before we can convice our selves.

Lets introduce a the scipy library just to check if it can improve the performance of our one feature model.

In [27]:
from scipy import stats
In [28]:
slope, intercept, r_value, p_value, std_err = stats.linregress(df['target'], df['RM'])
In [29]:
print("R Value: " ,r_value)
print("RSquared Value: " ,r_value ** 2)
print("Intercept: " ,intercept)
R Value:  0.6953599470715394
RSquared Value:  0.483525455991334
Intercept:  5.087638671836054

Again not bad numbers considring the fact that were dealing with real data, 0.7 R value is good. But the question is always whether we can do better to reassure ourselves. We could try to fit a curve or line but those efforts would be futile as we can clearly see from the none those options would work.

we do have one more trick up our sleeves to try and model this data with out any trainning involved, we could use a scaler to fit all our data into a range and then use that range to compute a coef.

For this we are going to need the statsmodels library and scikitlearn's Standardscaler.

In [30]:
df.columns
Out[30]:
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'target'],
      dtype='object')
In [71]:
X= df[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']]
y= df['target']
In [32]:
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

X= df[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']]
y= df['target']

#scale.fit makes the feature scaled to 1--->-1

X= df[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']] = scale.fit_transform(X= df[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT']].as_matrix())

print (X)

est = sm.OLS(y, X).fit()

est.summary()
[[-0.41978194  0.28482986 -1.2879095  ... -1.45900038  0.44105193
  -1.0755623 ]
 [-0.41733926 -0.48772236 -0.59338101 ... -0.30309415  0.44105193
  -0.49243937]
 [-0.41734159 -0.48772236 -0.59338101 ... -0.30309415  0.39642699
  -1.2087274 ]
 ...
 [-0.41344658 -0.48772236  0.11573841 ...  1.17646583  0.44105193
  -0.98304761]
 [-0.40776407 -0.48772236  0.11573841 ...  1.17646583  0.4032249
  -0.86530163]
 [-0.41500016 -0.48772236  0.11573841 ...  1.17646583  0.44105193
  -0.66905833]]
C:\Users\Shaki\Anaconda3\lib\site-packages\ipykernel_launcher.py:10: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  # Remove the CWD from sys.path while we load stuff.
Out[32]:
OLS Regression Results
Dep. Variable: target R-squared: 0.106
Model: OLS Adj. R-squared: 0.082
Method: Least Squares F-statistic: 4.477
Date: Wed, 21 Nov 2018 Prob (F-statistic): 3.14e-07
Time: 23:55:38 Log-Likelihood: -2304.8
No. Observations: 506 AIC: 4636.
Df Residuals: 493 BIC: 4691.
Df Model: 13
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 -0.9281 1.388 -0.669 0.504 -3.654 1.798
x2 1.0816 1.571 0.688 0.492 -2.006 4.169
x3 0.1409 2.071 0.068 0.946 -3.928 4.210
x4 0.6817 1.074 0.635 0.526 -1.429 2.792
x5 -2.0567 2.173 -0.947 0.344 -6.325 2.212
x6 2.6742 1.441 1.855 0.064 -0.158 5.506
x7 0.0195 1.825 0.011 0.991 -3.567 3.605
x8 -3.1040 2.062 -1.506 0.133 -7.154 0.946
x9 2.6622 2.836 0.939 0.348 -2.909 8.234
x10 -2.0768 3.111 -0.668 0.505 -8.189 4.035
x11 -2.0606 1.390 -1.482 0.139 -4.792 0.671
x12 0.8493 1.204 0.706 0.481 -1.516 3.214
x13 -3.7436 1.778 -2.106 0.036 -7.236 -0.251
Omnibus: 178.041 Durbin-Watson: 0.045
Prob(Omnibus): 0.000 Jarque-Bera (JB): 783.126
Skew: 1.521 Prob(JB): 8.84e-171
Kurtosis: 8.281 Cond. No. 9.82


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can see in the Coef column that room number is out perfoming the rest even when we scale down to a small range of values for all features. But then again we have a few other features that scored a 2.5 and above, this makes us rethink our focus on the one feature prediction model.

The new conteder happens to be Rad, Rad perfomed badly in terms of correlation but now we have new information that if we scaleld all the features to a small range, the rand feature is worth paying attention to.

This means its time to introduce our favuorite library again, it might have an algorithm that would do a better job with all the features than our scale estimator.

In [74]:
from sklearn.model_selection import train_test_split
In [75]:
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
In [76]:
from sklearn.linear_model import LinearRegression
In [77]:
lm = LinearRegression()
In [78]:
lm.fit(X_train,y_train)
Out[78]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
In [79]:
print('Intercept: ',lm.intercept_)
print('Coef: ',lm.coef_)
Intercept:  22.225777226574095
Coef:  [-0.66646229  0.97929727  0.62472316  1.04873206 -2.31254853  2.02868973
  0.45424819 -2.6605586   2.26313467 -1.87315532 -1.90447001  0.64066261
 -4.59060744]

The coef above is

In [81]:
coeffecients = pd.DataFrame(lm.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients
Out[81]:
Coeffecient
CRIM -0.666462
ZN 0.979297
INDUS 0.624723
CHAS 1.048732
NOX -2.312549
RM 2.028690
AGE 0.454248
DIS -2.660559
RAD 2.263135
TAX -1.873155
PTRATIO -1.904470
B 0.640663
LSTAT -4.590607
In [82]:
coeffecients
Out[82]:
Coeffecient
CRIM -0.666462
ZN 0.979297
INDUS 0.624723
CHAS 1.048732
NOX -2.312549
RM 2.028690
AGE 0.454248
DIS -2.660559
RAD 2.263135
TAX -1.873155
PTRATIO -1.904470
B 0.640663
LSTAT -4.590607

Predictions using the coefficient

In [83]:
predictions = lm.predict(X_test)
In [84]:
y_test
Out[84]:
195    50.0
4      36.2
434    11.7
458    14.9
39     30.8
304    36.1
225    50.0
32     13.2
157    41.3
404     8.5
65     23.5
138    13.3
18     20.2
352    18.6
114    18.5
407    27.9
417    10.4
290    28.5
95     28.4
321    23.1
439    12.8
12     21.7
505    11.9
252    29.6
291    37.3
361    19.9
234    29.0
128    18.0
372    50.0
198    34.6
       ... 
16     23.1
412    17.9
430    14.5
41     26.6
431    14.1
295    28.6
325    24.6
216    23.3
335    21.1
502    20.6
267    50.0
169    22.3
268    43.5
218    21.5
454    14.9
45     19.3
153    19.4
70     24.2
277    33.1
148    17.8
112    18.8
409    27.5
90     22.6
375    15.0
469    20.1
78     21.2
160    27.0
124    18.8
167    23.8
272    24.4
Name: target, Length: 203, dtype: float64
In [85]:
predictions
Out[85]:
array([38.76995104, 27.39271318, 16.26805601, 16.64592872, 30.5945708 ,
       31.37975753, 37.68282481,  7.57986744, 33.62371472,  6.94206736,
       30.00015138, 13.74184077, 16.41357803, 17.5975484 , 24.92452314,
       20.61277162,  6.84027833, 32.74459645, 28.14176473, 24.87051184,
       12.01460369, 19.89597528, 22.93223855, 24.84808083, 33.41944923,
       18.2663553 , 32.40616206, 19.07263109, 27.85446156, 33.36724349,
       20.31071184, 18.71427039, 36.3942392 , 43.97914411, 28.53636198,
       22.23810379, 15.23341286, 18.4441601 ,  2.99896469, 30.75373687,
       23.98495287, 17.65233987, 33.49269972, 13.72450288, 17.45026475,
       25.3864821 , 29.9370352 , 16.43822597, 27.0157306 , 23.23886475,
       31.8958797 , 36.8917952 , 22.96758436, 18.06656811, 30.34602124,
       -0.30828515, 19.8446382 , 16.6131071 , 23.63902347, 21.26225918,
       29.69766593,  3.14282554, 16.86387632, 19.76329036,  9.71050797,
       24.21870511, 24.27695942, 19.87071765, 17.16247142, 19.85216234,
       23.74078001, 21.56791537, 23.14099313, 20.54638573, 27.77053085,
       21.2590119 , 36.87579928,  8.05035628, 28.9146871 , 16.70037511,
       15.70980238, 19.14484394, 29.65683713, 16.86617546, 10.15073018,
       21.34814159, 21.81482232, 32.18098353, 22.24314075, 21.75449868,
       12.50117018, 10.64264803, 22.59103858, 32.00987194,  5.75604165,
       34.05952126,  7.04112579, 31.53788515,  9.02176123, 21.19511453,
       32.37147301, 21.32823602, 27.19438339, 24.91207186, 23.08174295,
       24.76969659, 24.77145042, 30.14032582, 36.63344929, 32.59298802,
       23.27852444, 35.5111093 , 24.17973314, 22.05040637, 29.57566524,
       26.94598149, 28.86934886, 30.98598123, 26.77898549, 28.83037557,
       16.05739187, 20.89220193, 21.91047939, 36.88601261, 25.01402328,
       23.53157107, 15.12274061,  5.50883218, 14.14631563, 23.87422049,
       26.85906918, 33.17708597, 24.22078613, 19.60743115, 24.54377589,
       26.24871922, 30.8997013 , 26.2619873 , 33.44890707, 23.05544279,
       12.12838356, 35.44082938, 31.79591619, 16.5997814 , 25.17956469,
       19.77417177, 20.07188943, 24.67905941, 26.64881616, 29.50609111,
       16.87246772, 16.25039628, 40.96167542, 36.18058639, 22.00214486,
       21.47973172, 23.48638653, 12.67663095, 20.83340172, 24.99555373,
       19.27796673, 29.13806185, 40.15324017, 22.1316772 , 26.14454982,
       23.02029457, 18.61562996, 30.48499643, 17.42381182, 10.92515821,
       18.66294924, 33.26084439, 34.96275041, 20.74820685,  1.70547647,
       18.03065088, 27.34915728, 18.06414053, 28.56520062, 24.41093319,
       27.53096541, 20.55435421, 22.62919622, 37.78233999, 26.87713512,
       37.38740447, 25.79142163, 14.81336505, 22.11034091, 17.09095927,
       25.08768209, 35.57385009,  8.21251303, 20.29558413, 19.03028948,
       26.45168363, 24.24592238, 18.52485619, 21.43469229, 35.01450733,
       20.96970996, 23.6978562 , 28.08966447])
In [86]:
plt.scatter(y_test,predictions)
Out[86]:
<matplotlib.collections.PathCollection at 0x1d3996cb320>
In [87]:
plt.scatter(y_test,predictions)
Out[87]:
<matplotlib.collections.PathCollection at 0x1d399712e80>
In [88]:
sns.distplot((y_test-predictions))
C:\Users\Shaki\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[88]:
<matplotlib.axes._subplots.AxesSubplot at 0x1d399739390>
In [89]:
y_test-predictions
Out[89]:
195    11.230049
4       8.807287
434    -4.568056
458    -1.745929
39      0.205429
304     4.720242
225    12.317175
32      5.620133
157     7.676285
404     1.557933
65     -6.500151
138    -0.441841
18      3.786422
352     1.002452
114    -6.424523
407     7.287228
417     3.559722
290    -4.244596
95      0.258235
321    -1.770512
439     0.785396
12      1.804025
505   -11.032239
252     4.751919
291     3.880551
361     1.633645
234    -3.406162
128    -1.072631
372    22.145538
198     1.232757
         ...    
16      2.351793
412    16.194524
430    -3.530651
41     -0.749157
431    -3.964141
295     0.034799
325     0.189067
216    -4.230965
335     0.545646
502    -2.029196
267    12.217660
169    -4.577135
268     6.112596
218    -4.291422
454     0.086635
45     -2.810341
153     2.309041
70     -0.887682
277    -2.473850
148     9.587487
112    -1.495584
409     8.469711
90     -3.851684
375    -9.245922
469     1.575144
78     -0.234692
160    -8.014507
124    -2.169710
167     0.102144
272    -3.689664
Name: target, Length: 203, dtype: float64
In [90]:
from sklearn import metrics
In [91]:
print('MAE:    ',metrics.mean_absolute_error(y_test,predictions))
print('MSE:    ',metrics.mean_squared_error(y_test,predictions))
print('sqtMSE: ',np.sqrt(metrics.mean_squared_error(y_test,predictions)))
MAE:     3.9051448026275075
MSE:     29.416365467452838
sqtMSE:  5.423685598138302
In [ ]:

In [ ]: