In many cities, there are communal bicycle sharing stations where you can rent bicycles by the hour or by the day. Kind of reminds me how we rent out cloud servers as I write this, haha. Washington, D.C. is one of these cities, and has detailed data available about how many bicycles were rented by hour and by day.
The powerful Hadi Fanaee-T at the University of Porto has compiled this data into a CSV file The file contains 17380 rows, and each row represents the bike rentals in a single hour of a single day. Let's take a closer look at it.
import pandas as pd
import numpy as np
bikerentals = pd.read_csv("/Users/Guneet/BikeRentals/bike_rental_hour.csv")
bikerentals.head()
Some columns which come across with particular interest in this dataset are :
Let's take a closer look at the distribution of total rentals and make a normalized histogram of the cnt column.
import seaborn as sns
%matplotlib inline
sns.set(color_codes="True")
sns.distplot(bikerentals["cnt"])
Correlations
Another interesting take away would be to see if any of the columns in the dataset is correlated with the cnt column.
bikerentals.corr()["cnt"]
Correlations tell you what columns are closely related to the column you are interested in. The closer to 0 the correlation, the weaker the connetion. The closer to 1, the stronger the positive correlation, and the closer to -1, the stronger the negative correlation.
The humidity column seems to have a reasonably strong negative correlation to the total number of bikes rented, which makes sense. The air temperature and hr columns maintain a similar yet opposite possitive correlation to the number of rentals.
Calculating Features
It is helpful to calculate features before applying machine learning models. Features enhance the accuracy of models by introducing new information, or distilling already existing information.
For example, the hr column in bike_rentals contains hours that the bikes are rented for, from 1 to 24. A machine will treat each hour differently, and not understand that certain hours are related. We can flip this around by creating a new column with labels for morning, afternoon, evening, and night. This will bundle up similar times together, and enable our model to make better decisions.
def assign_label(hour):
if hour >=0 and hour < 6:
return 4
elif hour >=6 and hour < 12:
return 1
elif hour >= 12 and hour < 18:
return 2
elif hour >= 18 and hour <=24:
return 3
bikerentals["time_label"] = bikerentals["hr"].apply(assign_label)
bikerentals.head()
Train & Test Data
Before we apply machine learning algorithms, we will need to split the data into training and testing sets. This enables to train an algorithm using the training data set and evaluate its accuracy on the test data set. An unrealistically low error value can arise due to overfitting if an algorithm is trained on the training data and evaluated for performance on the same data.
80% of the rows in bikerentals will be considered as part of the trainign set. The rest will become part of the testing set.
Error Metric
Mean squared error seems like a good fit to evaluate our data against as it works well on continuous numeric data, which our dataset is.train = bikerentals.sample(frac=.8)
test = bikerentals.loc[~bikerentals.index.isin(train.index)]
First Attempt to Model, using Linear Regression
As a first pass, linear regression should work decently well on our data, given that many of the columns are correlated with cnt . Linear regression works well when predictors are independent, and don't change meaning when combined with each other. It is fairly resistant to overfitting because it is simple, but it can be prone to underfitting the data, and not building a powerful enough model.
We are ignoring the casual and registered columns because cnt is derived from these columns. If we are trying to predict the number of people who rent bikes in a given hour, it doesn't make sense that we already know the casual or registered riders, because those numbers are added together to get cnt.
from sklearn.linear_model import LinearRegression
predictors = list(train.columns)
predictors.remove("cnt")
predictors.remove("casual")
predictors.remove("registered")
predictors.remove("dteday")
lr = LinearRegression()
lr.fit(train[predictors], train["cnt"])
predictions = lr.predict(test[predictors])
mse = np.mean((predictions - test["cnt"]) ** 2)
print(mse)
print(test["cnt"])
Results
We have a large error, and this is probably due to the fact that the data has some few extremely high rental counts, but mostly low counts otherwise. The larger the error the more it is penalized with mean squared error. This gives a net higher total error.
Applying Decision Trees
Let's try our hand at applying decision trees. Decision trees are a fairly complex model, but they tend to predict outcomes much more reliably than linear regression. Owing to their complexity they also tend to overfit, particularly when parameters such as maximum depth and minimum number of samples per leaf aren't tweaked. Decision trees can be sensitive, where small changes in input data can result in a very different output model. Let's try with min_samples_leaf = 5 first.
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(min_samples_leaf=5)
dt.fit(train[predictors], train["cnt"])
predictions = dt.predict(test[predictors])
np.mean((predictions - test["cnt"]) ** 2)
Now let's try with n_sampels_leaf = 2.
dt = DecisionTreeRegressor(min_samples_leaf=2)
dt.fit(train[predictors], train["cnt"])
predictions = reg.predict(test[predictors])
np.mean((predictions - test["cnt"]) ** 2)
Results
The decision tree regressors seem to have a much higher accuracy than our linear regression model.
Applying Random Forests
We now apply the random forests algorithm. Random forests tend to be more accurate than simple models like linear regression. They tend to overfit much less than decision trees because of how they are constructed. Nonetheless, they can still be prone to overfitting, and therefore tuning parameters such as maximum depth and minimum samples per leaf are important.
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(min_samples_leaf=2)
rf.fit(train[predictors], train["cnt"])
predictions = rf.predict(test[predictors])
np.mean((predictions - test["cnt"]) ** 2)
Results
Upon computation we find that our random forests model with min_samples_leaf=2 gives a MSE of ~1824.41. We can play around and try modelling with other values of min_samples_leaf to see if we can minimise our error.
Next Steps
Some of the potential next steps could be :
Comments
comments powered by Disqus