In many cities, there are communal bicycle sharing stations where you can rent bicycles by the hour or by the day. Kind of reminds me how we rent out cloud servers as I write this, haha. Washington, D.C. is one of these cities, and has detailed data available about how many bicycles were rented by hour and by day.

The powerful Hadi Fanaee-T at the University of Porto has compiled this data into a CSV file The file contains 17380 rows, and each row represents the bike rentals in a single hour of a single day. Let's take a closer look at it.

In [116]:
import pandas as pd
import numpy as np

bikerentals = pd.read_csv("/Users/Guneet/BikeRentals/bike_rental_hour.csv")
bikerentals.head()
Out[116]:
instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
0 1 2011-01-01 1 0 1 0 0 6 0 1 0.24 0.2879 0.81 0.0 3 13 16
1 2 2011-01-01 1 0 1 1 0 6 0 1 0.22 0.2727 0.80 0.0 8 32 40
2 3 2011-01-01 1 0 1 2 0 6 0 1 0.22 0.2727 0.80 0.0 5 27 32
3 4 2011-01-01 1 0 1 3 0 6 0 1 0.24 0.2879 0.75 0.0 3 10 13
4 5 2011-01-01 1 0 1 4 0 6 0 1 0.24 0.2879 0.75 0.0 0 1 1

Some columns which come across with particular interest in this dataset are :

  • instant - a unique sequential id number for each row
  • dteday - the date the rentals occurred on
  • season - the season the rentals occurred in
  • yr - the year the rentals occurred in
  • mnth - the month the rentals occurred in
  • hr - the hour the rentals occurred in
  • holiday - whether or not the day was a holiday
  • weekday - whether or not the day was a weekday
  • workingday - whether or not the day was a working day
  • weathersit - the weather situation (categorical variable)
  • temp - the temperature on a 0-1 scale
  • atemp - the adjusted temperature
  • hum - the humidity on a 0-1 scale
  • windspeed - the wind speed on a 0-1 scale
  • casual - the number of casual riders (people who hadn't previously signed up with the bikesharing program) that rented bikes
  • registered - the number of registered riders (people who signed up previously) that rented bikes
  • cnt - the total number of bikes rented (casual + registered)
  • Let's take a closer look at the distribution of total rentals and make a normalized histogram of the cnt column.

    In [117]:
    import seaborn as sns
    %matplotlib inline
    
    sns.set(color_codes="True")
    sns.distplot(bikerentals["cnt"])
    
    Out[117]:

    Correlations

    Another interesting take away would be to see if any of the columns in the dataset is correlated with the cnt column.

    In [118]:
    bikerentals.corr()["cnt"]
    
    Out[118]:
    instant       0.278379
    season        0.178056
    yr            0.250495
    mnth          0.120638
    hr            0.394071
    holiday      -0.030927
    weekday       0.026900
    workingday    0.030284
    weathersit   -0.142426
    temp          0.404772
    atemp         0.400929
    hum          -0.322911
    windspeed     0.093234
    casual        0.694564
    registered    0.972151
    cnt           1.000000
    Name: cnt, dtype: float64

    Correlations tell you what columns are closely related to the column you are interested in. The closer to 0 the correlation, the weaker the connetion. The closer to 1, the stronger the positive correlation, and the closer to -1, the stronger the negative correlation.

    The humidity column seems to have a reasonably strong negative correlation to the total number of bikes rented, which makes sense. The air temperature and hr columns maintain a similar yet opposite possitive correlation to the number of rentals.

    Calculating Features

    It is helpful to calculate features before applying machine learning models. Features enhance the accuracy of models by introducing new information, or distilling already existing information.

    For example, the hr column in bike_rentals contains hours that the bikes are rented for, from 1 to 24. A machine will treat each hour differently, and not understand that certain hours are related. We can flip this around by creating a new column with labels for morning, afternoon, evening, and night. This will bundle up similar times together, and enable our model to make better decisions.

    In [119]:
    def assign_label(hour):
        if hour >=0 and hour < 6:
            return 4
        elif hour >=6 and hour < 12:
            return 1
        elif hour >= 12 and hour < 18:
            return 2
        elif hour >= 18 and hour <=24:
            return 3
    
    bikerentals["time_label"] = bikerentals["hr"].apply(assign_label)
    
    In [120]:
    bikerentals.head()
    
    Out[120]:
    instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt time_label
    0 1 2011-01-01 1 0 1 0 0 6 0 1 0.24 0.2879 0.81 0.0 3 13 16 4
    1 2 2011-01-01 1 0 1 1 0 6 0 1 0.22 0.2727 0.80 0.0 8 32 40 4
    2 3 2011-01-01 1 0 1 2 0 6 0 1 0.22 0.2727 0.80 0.0 5 27 32 4
    3 4 2011-01-01 1 0 1 3 0 6 0 1 0.24 0.2879 0.75 0.0 3 10 13 4
    4 5 2011-01-01 1 0 1 4 0 6 0 1 0.24 0.2879 0.75 0.0 0 1 1 4

    Train & Test Data

    Before we apply machine learning algorithms, we will need to split the data into training and testing sets. This enables to train an algorithm using the training data set and evaluate its accuracy on the test data set. An unrealistically low error value can arise due to overfitting if an algorithm is trained on the training data and evaluated for performance on the same data.

    80% of the rows in bikerentals will be considered as part of the trainign set. The rest will become part of the testing set.

    Error Metric

    Mean squared error seems like a good fit to evaluate our data against as it works well on continuous numeric data, which our dataset is.

    In [121]:
    train = bikerentals.sample(frac=.8)
    test = bikerentals.loc[~bikerentals.index.isin(train.index)]
    

    First Attempt to Model, using Linear Regression

    As a first pass, linear regression should work decently well on our data, given that many of the columns are correlated with cnt . Linear regression works well when predictors are independent, and don't change meaning when combined with each other. It is fairly resistant to overfitting because it is simple, but it can be prone to underfitting the data, and not building a powerful enough model.

    We are ignoring the casual and registered columns because cnt is derived from these columns. If we are trying to predict the number of people who rent bikes in a given hour, it doesn't make sense that we already know the casual or registered riders, because those numbers are added together to get cnt.

    In [122]:
    from sklearn.linear_model import LinearRegression
    predictors = list(train.columns)
    predictors.remove("cnt")
    predictors.remove("casual")
    predictors.remove("registered")
    predictors.remove("dteday")
    
    lr = LinearRegression()
    
    lr.fit(train[predictors], train["cnt"])
    
    Out[122]:
    LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
    In [123]:
    predictions = lr.predict(test[predictors])
    
    mse = np.mean((predictions - test["cnt"]) ** 2)
    print(mse)
    
    17126.94869869201
    
    In [124]:
    print(test["cnt"])
    
    6          2
    19        37
    20        36
    30         1
    34        70
    36        75
    37        59
    47         5
    48         2
    51        30
    67        20
    70         2
    72         2
    94         2
    100      115
    108      190
    111       89
    115       11
    123      122
    133      112
    134       69
    149       59
    154      187
    158       39
    165        1
    167        2
    171       61
    175       95
    187       11
    189        1
            ... 
    17199    124
    17205     26
    17208     23
    17210     12
    17215      1
    17216      3
    17236      7
    17240     11
    17243     31
    17245      8
    17253     43
    17259      3
    17268    133
    17269     75
    17272    118
    17280     63
    17296    222
    17298    225
    17308     37
    17310      6
    17314     18
    17326     97
    17334     15
    17341    122
    17345    160
    17352     47
    17354     49
    17358      1
    17369    247
    17371    214
    Name: cnt, dtype: int64
    

    Results

    We have a large error, and this is probably due to the fact that the data has some few extremely high rental counts, but mostly low counts otherwise. The larger the error the more it is penalized with mean squared error. This gives a net higher total error.

    Applying Decision Trees

    Let's try our hand at applying decision trees. Decision trees are a fairly complex model, but they tend to predict outcomes much more reliably than linear regression. Owing to their complexity they also tend to overfit, particularly when parameters such as maximum depth and minimum number of samples per leaf aren't tweaked. Decision trees can be sensitive, where small changes in input data can result in a very different output model. Let's try with min_samples_leaf = 5 first.

    In [125]:
    from sklearn.tree import DecisionTreeRegressor
    dt = DecisionTreeRegressor(min_samples_leaf=5)
    dt.fit(train[predictors], train["cnt"])
    
    Out[125]:
    DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
               max_leaf_nodes=None, min_samples_leaf=5, min_samples_split=2,
               min_weight_fraction_leaf=0.0, presort=False, random_state=None,
               splitter='best')
    In [126]:
    predictions = dt.predict(test[predictors])
    np.mean((predictions - test["cnt"]) ** 2)
    
    Out[126]:
    2468.692859622464

    Now let's try with n_sampels_leaf = 2.

    In [127]:
    dt = DecisionTreeRegressor(min_samples_leaf=2)
    dt.fit(train[predictors], train["cnt"])
    
    Out[127]:
    DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
               max_leaf_nodes=None, min_samples_leaf=2, min_samples_split=2,
               min_weight_fraction_leaf=0.0, presort=False, random_state=None,
               splitter='best')
    In [128]:
    predictions = reg.predict(test[predictors])
    np.mean((predictions - test["cnt"]) ** 2)
    
    Out[128]:
    881.6803957294475

    Results

    The decision tree regressors seem to have a much higher accuracy than our linear regression model.

    Applying Random Forests

    We now apply the random forests algorithm. Random forests tend to be more accurate than simple models like linear regression. They tend to overfit much less than decision trees because of how they are constructed. Nonetheless, they can still be prone to overfitting, and therefore tuning parameters such as maximum depth and minimum samples per leaf are important.

    In [129]:
    from sklearn.ensemble import RandomForestRegressor
    
    rf = RandomForestRegressor(min_samples_leaf=2)
    rf.fit(train[predictors], train["cnt"])
    
    Out[129]:
    RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
               max_features='auto', max_leaf_nodes=None, min_samples_leaf=2,
               min_samples_split=2, min_weight_fraction_leaf=0.0,
               n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
               verbose=0, warm_start=False)
    In [130]:
    predictions = rf.predict(test[predictors])
    np.mean((predictions - test["cnt"]) ** 2)
    
    Out[130]:
    1824.4148077364828

    Results

    Upon computation we find that our random forests model with min_samples_leaf=2 gives a MSE of ~1824.41. We can play around and try modelling with other values of min_samples_leaf to see if we can minimise our error.

    Next Steps

    Some of the potential next steps could be :

  • Calculating more features, e.g. combining temperature, humidity and wind speed.
  • Try predicting casual and registered riders instead of total numer of riders and see how our models fit


  • Comments

    comments powered by Disqus