Seattle Bridge data work flow¶

Commuting cyclists crossing the bridge¶

Importing the libraries¶

Analysis libraries¶

In [32]:

import numpy as np
import pandas as pd 
from jupyterthemes import jtplot

Visualization Libraries¶

In [45]:

%matplotlib inline
import matplotlib.pyplot as plt
jtplot.style(theme='onedork')
#plt.style.use('seaborn')

Scikit Learn imports¶

In [46]:

from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture

Importing the data¶

In [47]:

from jupyterworkflow.data import get_fremont_data
data = get_fremont_data()

Data inspection¶

In [48]:

data.head()

Out[48]:

	West	East	Total
Date
2012-10-03 00:00:00	9.0	4.0	13.0
2012-10-03 01:00:00	6.0	4.0	10.0
2012-10-03 02:00:00	1.0	1.0	2.0
2012-10-03 03:00:00	3.0	2.0	5.0
2012-10-03 04:00:00	1.0	6.0	7.0

In [49]:

data.describe()

Out[49]:

	West	East	Total
count	51063.000000	51063.000000	51063.000000
mean	57.126902	53.654329	110.781231
std	82.685731	70.067851	139.511157
min	0.000000	0.000000	0.000000
25%	7.000000	7.000000	15.000000
50%	29.000000	29.000000	60.000000
75%	70.000000	71.000000	145.000000
max	717.000000	698.000000	957.000000

In [50]:

data.count()

Out[50]:

West     51063
East     51063
Total    51063
dtype: int64

Using Pandas resample to get the overall picture of the whole data.¶

Notice how just by using the resample we can gain some insight on the data and how the usage of this bridge changes over time. Hourly, Daily, Weekly, Monthly and Annually.¶

In [51]:

data.resample('D').sum().plot()
plt.ylabel('Daily trips')
plt.title('Daily Trips VS Date')

Out[51]:

Text(0.5,1,'Daily Trips VS Date')

In [52]:

data.resample('W').sum().plot()
plt.ylabel('Weekly trips')
plt.title('Weekly Trips VS Date')

Out[52]:

Text(0.5,1,'Weekly Trips VS Date')

In [53]:

data.resample('M').sum().plot()
plt.ylabel('Monthly trips')
plt.title('Monthly Trips VS Date')

Out[53]:

Text(0.5,1,'Monthly Trips VS Date')

In [54]:

data.resample('Y').sum().plot()
plt.ylabel('Annual trips')
plt.title('Annual Trips VS Date')

Out[54]:

Text(0.5,1,'Annual Trips VS Date')

In [55]:

plt.subplots()
plt.title('Weekly Trips VS Date\n east & waste')
plt.ylabel('Weekly trips')
data['East'].resample('W').sum().plot(legend=True)
data['West'].resample('W').sum().plot(legend = True)

Out[55]:

<matplotlib.axes._subplots.AxesSubplot at 0x1de8641ddd8>

In [56]:

plt.subplots()
plt.ylabel('Hourly trips')
plt.title('Hourly Trips VS Date')
data['East'].resample('H').sum().plot(legend=True)
data['West'].resample('H').sum().plot(legend = True)

Out[56]:

<matplotlib.axes._subplots.AxesSubplot at 0x1de862cf4e0>

In [57]:

data['Total'] = data['West'] + data['East']
ax = data.resample('D').sum().rolling(365).sum().plot();
ax.set_ylim(0,None);

In [58]:

data.groupby(data.index.time).mean().plot()
plt.ylabel('Number of trips at a given time')
plt.title('Number of trips VS Time of the day')

Out[58]:

Text(0.5,1,'Number of trips VS Time of the day')

In [59]:

data.head()

Out[59]:

	West	East	Total
Date
2012-10-03 00:00:00	9.0	4.0	13.0
2012-10-03 01:00:00	6.0	4.0	10.0
2012-10-03 02:00:00	1.0	1.0	2.0
2012-10-03 03:00:00	3.0	2.0	5.0
2012-10-03 04:00:00	1.0	6.0	7.0

In [60]:

pivoted = data.pivot_table('Total',index=data.index.time, columns=data.index.date)
pivoted.iloc[:10,:5]

Out[60]:

	2012-10-03	2012-10-04	2012-10-05	2012-10-06	2012-10-07
00:00:00	13.0	18.0	11.0	15.0	11.0
01:00:00	10.0	3.0	8.0	15.0	17.0
02:00:00	2.0	9.0	7.0	9.0	3.0
03:00:00	5.0	3.0	4.0	3.0	6.0
04:00:00	7.0	8.0	9.0	5.0	3.0
05:00:00	31.0	26.0	25.0	5.0	9.0
06:00:00	155.0	142.0	105.0	27.0	17.0
07:00:00	352.0	319.0	319.0	33.0	26.0
08:00:00	437.0	418.0	370.0	105.0	69.0
09:00:00	276.0	241.0	212.0	114.0	103.0

In [61]:

pivoted.plot(legend = False)

Out[61]:

<matplotlib.axes._subplots.AxesSubplot at 0x1de866aa5f8>

In [62]:

pivoted.index[:24]

Out[62]:

Index([00:00:00, 01:00:00, 02:00:00, 03:00:00, 04:00:00, 05:00:00, 06:00:00,
       07:00:00, 08:00:00, 09:00:00, 10:00:00, 11:00:00, 12:00:00, 13:00:00,
       14:00:00, 15:00:00, 16:00:00, 17:00:00, 18:00:00, 19:00:00, 20:00:00,
       21:00:00, 22:00:00, 23:00:00],
      dtype='object')

In [63]:

data.index

Out[63]:

DatetimeIndex(['2012-10-03 00:00:00', '2012-10-03 01:00:00',
               '2012-10-03 02:00:00', '2012-10-03 03:00:00',
               '2012-10-03 04:00:00', '2012-10-03 05:00:00',
               '2012-10-03 06:00:00', '2012-10-03 07:00:00',
               '2012-10-03 08:00:00', '2012-10-03 09:00:00',
               ...
               '2018-07-31 14:00:00', '2018-07-31 15:00:00',
               '2018-07-31 16:00:00', '2018-07-31 17:00:00',
               '2018-07-31 18:00:00', '2018-07-31 19:00:00',
               '2018-07-31 20:00:00', '2018-07-31 21:00:00',
               '2018-07-31 22:00:00', '2018-07-31 23:00:00'],
              dtype='datetime64[ns]', name='Date', length=51072, freq=None)

Scikit Learn to do further analysis of the data.¶

In [64]:

x  =pivoted.fillna(0).T.values
x.shape

Out[64]:

(2128, 24)

In [65]:

#treating each day as a projection using principle component analysis
x2 = PCA(2, svd_solver='full').fit_transform(x)

In [66]:

plt.scatter(x2[:,0],x2[:,1])

Out[66]:

<matplotlib.collections.PathCollection at 0x1de916f54a8>

using the gaussion mixture model to identify where the days fall¶

In [67]:

gmm = GaussianMixture(2)
gmm.fit(x)
labels = gmm.predict(x)
labels

Out[67]:

array([0, 0, 0, ..., 1, 0, 0], dtype=int64)

In [68]:

plt.scatter(x2[:,0],x2[:,1], c=labels, cmap='rainbow')

Out[68]:

<matplotlib.collections.PathCollection at 0x1de915e1748>

Commuting days vs Non commuting days¶

In [69]:

pivoted.T[labels == 0].T.plot(legend = False, alpha = 0.20)
pivoted.T[labels == 1].T.plot(legend = False, alpha = 0.20)

Out[69]:

<matplotlib.axes._subplots.AxesSubplot at 0x1de9242c710>

In [70]:

### The

In [71]:

daysofweek = pd.DatetimeIndex(pivoted.columns).dayofweek

In [72]:

plt.scatter(x2[:,0],x2[:,1], c=daysofweek, cmap='rainbow')
cb = plt.colorbar(ticks=range(7))
cb.set_ticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])

Zooming in on the days of the week that happen to land on the non commuter days to see whats happening¶

In [73]:

dates = pd.DatetimeIndex(pivoted.columns)
dates[(labels == 0)& (daysofweek < 5)]

Out[73]:

DatetimeIndex(['2012-10-03', '2012-10-04', '2012-10-05', '2012-10-08',
               '2012-10-09', '2012-10-10', '2012-10-11', '2012-10-12',
               '2012-10-15', '2012-10-16',
               ...
               '2018-07-18', '2018-07-19', '2018-07-20', '2018-07-23',
               '2018-07-24', '2018-07-25', '2018-07-26', '2018-07-27',
               '2018-07-30', '2018-07-31'],
              dtype='datetime64[ns]', length=1469, freq=None)

In [74]:

In [ ]:

MrRutledge

Analysis for bike journeys on Fremont Bridge