In many cities, there are communal bicycle sharing stations where you can rent bicycles by the hour or by the day. Kind of reminds me how we rent out cloud servers as I write this, haha. Washington, D.C. is one of these cities, and has detailed data available about how many bicycles were rented by hour and by day.

The powerful Hadi Fanaee-T at the University of Porto has compiled this data into a CSV file The file contains 17380 rows, and each row represents the bike rentals in a single hour of a single day. Let's take a closer look at it.

In [116]:

import pandas as pd
import numpy as np

bikerentals = pd.read_csv("/Users/Guneet/BikeRentals/bike_rental_hour.csv")
bikerentals.head()

Out[116]:

	instant	dteday	season	mnth	hr	weekday	weathersit	temp	atemp	hum	casual	registered	cnt
0	1	2011-01-01	1	1	0	6	1	0.24	0.2879	0.81	3	13	16
1	2	2011-01-01	1	1	1	6	1	0.22	0.2727	0.80	8	32	40
2	3	2011-01-01	1	1	2	6	1	0.22	0.2727	0.80	5	27	32
3	4	2011-01-01	1	1	3	6	1	0.24	0.2879	0.75	3	10	13
4	5	2011-01-01	1	1	4	6	1	0.24	0.2879	0.75	0	1	1

Some columns which come across with particular interest in this dataset are :

instant - a unique sequential id number for each row

dteday - the date the rentals occurred on

season - the season the rentals occurred in

yr - the year the rentals occurred in

mnth - the month the rentals occurred in

hr - the hour the rentals occurred in

holiday - whether or not the day was a holiday

weekday - whether or not the day was a weekday

workingday - whether or not the day was a working day

weathersit - the weather situation (categorical variable)

temp - the temperature on a 0-1 scale

atemp - the adjusted temperature

hum - the humidity on a 0-1 scale

windspeed - the wind speed on a 0-1 scale

casual - the number of casual riders (people who hadn't previously signed up with the bikesharing program) that rented bikes

registered - the number of registered riders (people who signed up previously) that rented bikes

cnt - the total number of bikes rented (casual + registered)

Let's take a closer look at the distribution of total rentals and make a normalized histogram of the cnt column.

In [117]:

import seaborn as sns
%matplotlib inline

sns.set(color_codes="True")
sns.distplot(bikerentals["cnt"])

Out[117]:

Correlations

Another interesting take away would be to see if any of the columns in the dataset is correlated with the cnt column.

In [118]:

bikerentals.corr()["cnt"]

Out[118]:

instant       0.278379
season        0.178056
yr            0.250495
mnth          0.120638
hr            0.394071
holiday      -0.030927
weekday       0.026900
workingday    0.030284
weathersit   -0.142426
temp          0.404772
atemp         0.400929
hum          -0.322911
windspeed     0.093234
casual        0.694564
registered    0.972151
cnt           1.000000
Name: cnt, dtype: float64

Correlations tell you what columns are closely related to the column you are interested in. The closer to 0 the correlation, the weaker the connetion. The closer to 1, the stronger the positive correlation, and the closer to -1, the stronger the negative correlation.

The humidity column seems to have a reasonably strong negative correlation to the total number of bikes rented, which makes sense. The air temperature and hr columns maintain a similar yet opposite possitive correlation to the number of rentals.

Calculating Features

It is helpful to calculate features before applying machine learning models. Features enhance the accuracy of models by introducing new information, or distilling already existing information.

For example, the hr column in bike_rentals contains hours that the bikes are rented for, from 1 to 24. A machine will treat each hour differently, and not understand that certain hours are related. We can flip this around by creating a new column with labels for morning, afternoon, evening, and night. This will bundle up similar times together, and enable our model to make better decisions.

In [119]:

def assign_label(hour):
    if hour >=0 and hour < 6:
        return 4
    elif hour >=6 and hour < 12:
        return 1
    elif hour >= 12 and hour < 18:
        return 2
    elif hour >= 18 and hour <=24:
        return 3

bikerentals["time_label"] = bikerentals["hr"].apply(assign_label)

In [120]:

bikerentals.head()

Out[120]:

	instant	dteday	season	mnth	hr	weekday	weathersit	temp	atemp	hum	casual	registered	cnt	time_label
0	1	2011-01-01	1	1	0	6	1	0.24	0.2879	0.81	3	13	16	4
1	2	2011-01-01	1	1	1	6	1	0.22	0.2727	0.80	8	32	40	4
2	3	2011-01-01	1	1	2	6	1	0.22	0.2727	0.80	5	27	32	4
3	4	2011-01-01	1	1	3	6	1	0.24	0.2879	0.75	3	10	13	4
4	5	2011-01-01	1	1	4	6	1	0.24	0.2879	0.75	0	1	1	4

Train & Test Data

Before we apply machine learning algorithms, we will need to split the data into training and testing sets. This enables to train an algorithm using the training data set and evaluate its accuracy on the test data set. An unrealistically low error value can arise due to overfitting if an algorithm is trained on the training data and evaluated for performance on the same data.

80% of the rows in bikerentals will be considered as part of the trainign set. The rest will become part of the testing set.

Error Metric

Mean squared error seems like a good fit to evaluate our data against as it works well on continuous numeric data, which our dataset is.

In [121]:

train = bikerentals.sample(frac=.8)
test = bikerentals.loc[~bikerentals.index.isin(train.index)]

First Attempt to Model, using Linear Regression

As a first pass, linear regression should work decently well on our data, given that many of the columns are correlated with cnt . Linear regression works well when predictors are independent, and don't change meaning when combined with each other. It is fairly resistant to overfitting because it is simple, but it can be prone to underfitting the data, and not building a powerful enough model.

We are ignoring the casual and registered columns because cnt is derived from these columns. If we are trying to predict the number of people who rent bikes in a given hour, it doesn't make sense that we already know the casual or registered riders, because those numbers are added together to get cnt.

In [122]:

from sklearn.linear_model import LinearRegression
predictors = list(train.columns)
predictors.remove("cnt")
predictors.remove("casual")
predictors.remove("registered")
predictors.remove("dteday")

lr = LinearRegression()

lr.fit(train[predictors], train["cnt"])

Out[122]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [123]:

predictions = lr.predict(test[predictors])

mse = np.mean((predictions - test["cnt"]) ** 2)
print(mse)

17126.94869869201

In [124]:

print(test["cnt"])

6          2
19        37
20        36
30         1
34        70
36        75
37        59
47         5
48         2
51        30
67        20
70         2
72         2
94         2
100      115
108      190
111       89
115       11
123      122
133      112
134       69
149       59
154      187
158       39
165        1
167        2
171       61
175       95
187       11
189        1
        ... 
17199    124
17205     26
17208     23
17210     12
17215      1
17216      3
17236      7
17240     11
17243     31
17245      8
17253     43
17259      3
17268    133
17269     75
17272    118
17280     63
17296    222
17298    225
17308     37
17310      6
17314     18
17326     97
17334     15
17341    122
17345    160
17352     47
17354     49
17358      1
17369    247
17371    214
Name: cnt, dtype: int64

Results

We have a large error, and this is probably due to the fact that the data has some few extremely high rental counts, but mostly low counts otherwise. The larger the error the more it is penalized with mean squared error. This gives a net higher total error.

Applying Decision Trees

Let's try our hand at applying decision trees. Decision trees are a fairly complex model, but they tend to predict outcomes much more reliably than linear regression. Owing to their complexity they also tend to overfit, particularly when parameters such as maximum depth and minimum number of samples per leaf aren't tweaked. Decision trees can be sensitive, where small changes in input data can result in a very different output model. Let's try with min_samples_leaf = 5 first.

In [125]:

from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(min_samples_leaf=5)
dt.fit(train[predictors], train["cnt"])

Out[125]:

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=5, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

In [126]:

predictions = dt.predict(test[predictors])
np.mean((predictions - test["cnt"]) ** 2)

Out[126]:

2468.692859622464

Now let's try with n_sampels_leaf = 2.

In [127]:

dt = DecisionTreeRegressor(min_samples_leaf=2)
dt.fit(train[predictors], train["cnt"])

Out[127]:

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=2, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

In [128]:

predictions = reg.predict(test[predictors])
np.mean((predictions - test["cnt"]) ** 2)

Out[128]:

881.6803957294475

Results

The decision tree regressors seem to have a much higher accuracy than our linear regression model.

Applying Random Forests

We now apply the random forests algorithm. Random forests tend to be more accurate than simple models like linear regression. They tend to overfit much less than decision trees because of how they are constructed. Nonetheless, they can still be prone to overfitting, and therefore tuning parameters such as maximum depth and minimum samples per leaf are important.

In [129]:

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(min_samples_leaf=2)
rf.fit(train[predictors], train["cnt"])

Out[129]:

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=2,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [130]:

predictions = rf.predict(test[predictors])
np.mean((predictions - test["cnt"]) ** 2)

Out[130]:

1824.4148077364828

Results

Upon computation we find that our random forests model with min_samples_leaf=2 gives a MSE of ~1824.41. We can play around and try modelling with other values of min_samples_leaf to see if we can minimise our error.

Next Steps

Some of the potential next steps could be :

Calculating more features, e.g. combining temperature, humidity and wind speed.

Try predicting casual and registered riders instead of total numer of riders and see how our models fit

Predicting Bike Rentals by Applying Decision Trees & Random Forests

Correlations

Calculating Features

Train & Test Data

Error Metric

First Attempt to Model, using Linear Regression

Results

Applying Decision Trees

Results

Applying Random Forests

Results

Next Steps

Comments