MrRutledge

'If at first you dont succeed just refresh and try again.' Aliyah

Sun 12 March 2017

Analysis for bike journeys on Fremont Bridge

Posted by Mr NotebookRoooky in Notebook   

Seattle Bridge data work flow

Commuting cyclists crossing the bridge

Importing the libraries

Analysis libraries

In [32]:
import numpy as np
import pandas as pd 
from jupyterthemes import jtplot
Visualization Libraries
In [45]:
%matplotlib inline
import matplotlib.pyplot as plt
jtplot.style(theme='onedork')
#plt.style.use('seaborn')
Scikit Learn imports
In [46]:
from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture

Importing the data

In [47]:
from jupyterworkflow.data import get_fremont_data
data = get_fremont_data()

Data inspection

In [48]:
data.head()
Out[48]:
West East Total
Date
2012-10-03 00:00:00 9.0 4.0 13.0
2012-10-03 01:00:00 6.0 4.0 10.0
2012-10-03 02:00:00 1.0 1.0 2.0
2012-10-03 03:00:00 3.0 2.0 5.0
2012-10-03 04:00:00 1.0 6.0 7.0
In [49]:
data.describe()
Out[49]:
West East Total
count 51063.000000 51063.000000 51063.000000
mean 57.126902 53.654329 110.781231
std 82.685731 70.067851 139.511157
min 0.000000 0.000000 0.000000
25% 7.000000 7.000000 15.000000
50% 29.000000 29.000000 60.000000
75% 70.000000 71.000000 145.000000
max 717.000000 698.000000 957.000000
In [50]:
data.count()
Out[50]:
West     51063
East     51063
Total    51063
dtype: int64

Using Pandas resample to get the overall picture of the whole data.

Notice how just by using the resample we can gain some insight on the data and how the usage of this bridge changes over time. Hourly, Daily, Weekly, Monthly and Annually.

In [51]:
data.resample('D').sum().plot()
plt.ylabel('Daily trips')
plt.title('Daily Trips VS Date')
Out[51]:
Text(0.5,1,'Daily Trips VS Date')
In [52]:
data.resample('W').sum().plot()
plt.ylabel('Weekly trips')
plt.title('Weekly Trips VS Date')
Out[52]:
Text(0.5,1,'Weekly Trips VS Date')
In [53]:
data.resample('M').sum().plot()
plt.ylabel('Monthly trips')
plt.title('Monthly Trips VS Date')
Out[53]:
Text(0.5,1,'Monthly Trips VS Date')
In [54]:
data.resample('Y').sum().plot()
plt.ylabel('Annual trips')
plt.title('Annual Trips VS Date')
Out[54]:
Text(0.5,1,'Annual Trips VS Date')
In [55]:
plt.subplots()
plt.title('Weekly Trips VS Date\n east & waste')
plt.ylabel('Weekly trips')
data['East'].resample('W').sum().plot(legend=True)
data['West'].resample('W').sum().plot(legend = True)
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x1de8641ddd8>
In [56]:
plt.subplots()
plt.ylabel('Hourly trips')
plt.title('Hourly Trips VS Date')
data['East'].resample('H').sum().plot(legend=True)
data['West'].resample('H').sum().plot(legend = True)
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x1de862cf4e0>
In [57]:
data['Total'] = data['West'] + data['East']
ax = data.resample('D').sum().rolling(365).sum().plot();
ax.set_ylim(0,None);
In [58]:
data.groupby(data.index.time).mean().plot()
plt.ylabel('Number of trips at a given time')
plt.title('Number of trips VS Time of the day')
Out[58]:
Text(0.5,1,'Number of trips VS Time of the day')
In [59]:
data.head()
Out[59]:
West East Total
Date
2012-10-03 00:00:00 9.0 4.0 13.0
2012-10-03 01:00:00 6.0 4.0 10.0
2012-10-03 02:00:00 1.0 1.0 2.0
2012-10-03 03:00:00 3.0 2.0 5.0
2012-10-03 04:00:00 1.0 6.0 7.0
In [60]:
pivoted = data.pivot_table('Total',index=data.index.time, columns=data.index.date)
pivoted.iloc[:10,:5]
Out[60]:
2012-10-03 2012-10-04 2012-10-05 2012-10-06 2012-10-07
00:00:00 13.0 18.0 11.0 15.0 11.0
01:00:00 10.0 3.0 8.0 15.0 17.0
02:00:00 2.0 9.0 7.0 9.0 3.0
03:00:00 5.0 3.0 4.0 3.0 6.0
04:00:00 7.0 8.0 9.0 5.0 3.0
05:00:00 31.0 26.0 25.0 5.0 9.0
06:00:00 155.0 142.0 105.0 27.0 17.0
07:00:00 352.0 319.0 319.0 33.0 26.0
08:00:00 437.0 418.0 370.0 105.0 69.0
09:00:00 276.0 241.0 212.0 114.0 103.0
In [61]:
pivoted.plot(legend = False)
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x1de866aa5f8>
In [62]:
pivoted.index[:24]
Out[62]:
Index([00:00:00, 01:00:00, 02:00:00, 03:00:00, 04:00:00, 05:00:00, 06:00:00,
       07:00:00, 08:00:00, 09:00:00, 10:00:00, 11:00:00, 12:00:00, 13:00:00,
       14:00:00, 15:00:00, 16:00:00, 17:00:00, 18:00:00, 19:00:00, 20:00:00,
       21:00:00, 22:00:00, 23:00:00],
      dtype='object')
In [63]:
data.index
Out[63]:
DatetimeIndex(['2012-10-03 00:00:00', '2012-10-03 01:00:00',
               '2012-10-03 02:00:00', '2012-10-03 03:00:00',
               '2012-10-03 04:00:00', '2012-10-03 05:00:00',
               '2012-10-03 06:00:00', '2012-10-03 07:00:00',
               '2012-10-03 08:00:00', '2012-10-03 09:00:00',
               ...
               '2018-07-31 14:00:00', '2018-07-31 15:00:00',
               '2018-07-31 16:00:00', '2018-07-31 17:00:00',
               '2018-07-31 18:00:00', '2018-07-31 19:00:00',
               '2018-07-31 20:00:00', '2018-07-31 21:00:00',
               '2018-07-31 22:00:00', '2018-07-31 23:00:00'],
              dtype='datetime64[ns]', name='Date', length=51072, freq=None)

Scikit Learn to do further analysis of the data.

In [64]:
x  =pivoted.fillna(0).T.values
x.shape
Out[64]:
(2128, 24)
In [65]:
#treating each day as a projection using principle component analysis
x2 = PCA(2, svd_solver='full').fit_transform(x)
In [66]:
plt.scatter(x2[:,0],x2[:,1])
Out[66]:
<matplotlib.collections.PathCollection at 0x1de916f54a8>

using the gaussion mixture model to identify where the days fall

In [67]:
gmm = GaussianMixture(2)
gmm.fit(x)
labels = gmm.predict(x)
labels
Out[67]:
array([0, 0, 0, ..., 1, 0, 0], dtype=int64)
In [68]:
plt.scatter(x2[:,0],x2[:,1], c=labels, cmap='rainbow')
Out[68]:
<matplotlib.collections.PathCollection at 0x1de915e1748>

Commuting days vs Non commuting days

In [69]:
pivoted.T[labels == 0].T.plot(legend = False, alpha = 0.20)
pivoted.T[labels == 1].T.plot(legend = False, alpha = 0.20)
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x1de9242c710>
In [70]:
### The
In [71]:
daysofweek = pd.DatetimeIndex(pivoted.columns).dayofweek
In [72]:
plt.scatter(x2[:,0],x2[:,1], c=daysofweek, cmap='rainbow')
cb = plt.colorbar(ticks=range(7))
cb.set_ticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])

Zooming in on the days of the week that happen to land on the non commuter days to see whats happening

In [73]:
dates = pd.DatetimeIndex(pivoted.columns)
dates[(labels == 0)& (daysofweek < 5)]
Out[73]:
DatetimeIndex(['2012-10-03', '2012-10-04', '2012-10-05', '2012-10-08',
               '2012-10-09', '2012-10-10', '2012-10-11', '2012-10-12',
               '2012-10-15', '2012-10-16',
               ...
               '2018-07-18', '2018-07-19', '2018-07-20', '2018-07-23',
               '2018-07-24', '2018-07-25', '2018-07-26', '2018-07-27',
               '2018-07-30', '2018-07-31'],
              dtype='datetime64[ns]', length=1469, freq=None)
In [74]:
#
In [ ]:

In [ ]:

In [ ]:

In [ ]: