Board games have been making a comeback lately, and deeper, more strategic boardgames, like Settlers of Catan have become hugely popular. BoardGameGeek is a popular site where these types of board games are discussed and reviewed.

Here we have a dataset that contains 80000 board games and their associated review scores. The data was scraped from BoardGameGeek and compiled into CSV format by Sean Beck.

Let's preview the data and look at some interesting columns.

In [73]:
import pandas as pd
board_games = pd.read_csv("board_games.csv")
board_games.head(5)
Out[73]:
id type name yearpublished minplayers maxplayers playingtime minplaytime maxplaytime minage users_rated average_rating bayes_average_rating total_owners total_traders total_wanters total_wishers total_comments total_weights average_weight
0 12333 boardgame Twilight Struggle 2005.0 2.0 2.0 180.0 180.0 180.0 13.0 20113 8.33774 8.22186 26647 372 1219 5865 5347 2562 3.4785
1 120677 boardgame Terra Mystica 2012.0 2.0 5.0 150.0 60.0 150.0 12.0 14383 8.28798 8.14232 16519 132 1586 6277 2526 1423 3.8939
2 102794 boardgame Caverna: The Cave Farmers 2013.0 1.0 7.0 210.0 30.0 210.0 12.0 9262 8.28994 8.06886 12230 99 1476 5600 1700 777 3.7761
3 25613 boardgame Through the Ages: A Story of Civilization 2006.0 2.0 4.0 240.0 240.0 240.0 12.0 13294 8.20407 8.05804 14343 362 1084 5075 3378 1642 4.1590
4 3076 boardgame Puerto Rico 2002.0 2.0 5.0 150.0 90.0 150.0 12.0 39883 8.14261 8.04524 44362 795 861 5414 9173 5213 3.2943

Each row represents a single board game and has descriptive statistics about the board game, as well as review information. Some of the interesting columns are :

  • name -- name of the board game.
  • playingtime -- the playing time (given by the manufacturer).
  • minplaytime-- the minimum playing time (given by the manufacturer).
  • maxplaytime -- the maximum playing time (given by the manufacturer).
  • minage -- the minimum recommended age to play.
  • users_rated -- the number of users who rated the game.
  • average_rating -- the average rating given to the game by users. (0-10)
  • total_weights -- Number of weights given by users. Weight is a subjective measure that is made up by BoardGameGeek. It describes how "deep" or involved a game is.
  • average_weight -- the average of all the subjective weights (0-5).

One interesting machine learning task might be to predict average_rating using the other columns. The dataset contains quite a few missing values, and rows where there are no reviews, where the score is 0. Let's remove them.

In [74]:
board_games = board_games.dropna(axis=0)
board_games = board_games[board_games["users_rated"]>0]

board_games.head()
Out[74]:
id type name yearpublished minplayers maxplayers playingtime minplaytime maxplaytime minage users_rated average_rating bayes_average_rating total_owners total_traders total_wanters total_wishers total_comments total_weights average_weight
0 12333 boardgame Twilight Struggle 2005.0 2.0 2.0 180.0 180.0 180.0 13.0 20113 8.33774 8.22186 26647 372 1219 5865 5347 2562 3.4785
1 120677 boardgame Terra Mystica 2012.0 2.0 5.0 150.0 60.0 150.0 12.0 14383 8.28798 8.14232 16519 132 1586 6277 2526 1423 3.8939
2 102794 boardgame Caverna: The Cave Farmers 2013.0 1.0 7.0 210.0 30.0 210.0 12.0 9262 8.28994 8.06886 12230 99 1476 5600 1700 777 3.7761
3 25613 boardgame Through the Ages: A Story of Civilization 2006.0 2.0 4.0 240.0 240.0 240.0 12.0 13294 8.20407 8.05804 14343 362 1084 5075 3378 1642 4.1590
4 3076 boardgame Puerto Rico 2002.0 2.0 5.0 150.0 90.0 150.0 12.0 39883 8.14261 8.04524 44362 795 861 5414 9173 5213 3.2943

Since we are trying to predict the average_rating column using other columns let's visualize and explore it a little.

In [75]:
import seaborn as sns
%matplotlib inline

sns.set(color_codes="True")
sns.distplot(board_games["average_rating"])
Out[75]:

As we can see in the histogram above that most average_ratings lie around ~6. Let's confirm our inference and calculate the standard deviation and mean for the average_rating column.

In [76]:
sd = board_games["average_rating"].std()
mean = board_games["average_rating"].mean()

print(sd,mean)
1.5788299348332662 6.016112849333889

Clustering and Error Metrics

To look for patterns in the data we are going to use a clustering algorithm to create clusters in the data and plot them out. A good first choice would be the Kmeans class from the scikit-learn library in Python. The KMeans class only works with numeric columns, so numeric columns of our dataset will be extracted out.

As the data is continuous it might be a good idea to use mean squared error as an error metric. The mean square error will penalize larger errors in our model more.

For the cluster assingment we find the mean and standard deviation of each row, then make a scatterplot for mean vs standard deviation and shade the points according to their cluster assignment.

In [77]:
from sklearn.cluster import KMeans
import numpy as np
from pandas import DataFrame
%matplotlib inline

kmodel = KMeans(n_clusters=5, random_state=1)
board_games_numeric = board_games.drop(['name','type','id'],axis=1)
kmodel.fit(board_games_numeric)

games_mean = board_games_numeric.apply(np.mean, axis=1)
games_std = board_games_numeric.apply(np.std, axis=1)

labels = kmodel.labels_
df = DataFrame()
df["mean"] = games_mean
df["standard_deviation"] = games_std
sns.set_style(style="ticks")
sns.jointplot(x="mean", y="standard_deviation", data=df, c=labels, color="g")
Out[77]:

Looking at the clusters above in different shades of green it can be observed that most of the games are similar in their distribution across parameters. But as the attributes tend to increase (increase in mean), say because of the number of users who rated the game are high (users_rated column), we have fewer and fewer games.

The users_rated column specifies the number of users who rated the game, hence a possible take away from a higher attribute value is that a few games get rated by a lot of users, but most games don't get played much.

Next, let's figure out which columns correlate well with average_rating . This would allow us to remove columns that do not add much predictive power to our model.

We will be using a regression model to predict average ratings and columns that are uncorrelated with the target don't help a linear regression model. It also enables us to remove columns that are derived from the target, or could otherwise cause overfitting.

Correlations

In [78]:
correlations = board_games_numeric.corr()

correlations["average_rating"] #Shows us how each column in board game dataset is correlated with average_rating
Out[78]:
yearpublished           0.108461
minplayers             -0.032701
maxplayers             -0.008335
playingtime             0.048994
minplaytime             0.043985
maxplaytime             0.048994
minage                  0.210049
users_rated             0.112564
average_rating          1.000000
bayes_average_rating    0.231563
total_owners            0.137478
total_traders           0.119452
total_wanters           0.196566
total_wishers           0.171375
total_comments          0.123714
total_weights           0.109691
average_weight          0.351081
Name: average_rating, dtype: float64

Correlations tell you what columns are closely related to the column you are interested in. The closer to 0 the correlation, the weaker the connetion. The closer to 1, the stronger the positive correlation, and the closer to -1, the stronger the negative correlation.

As we see above a couple of columns show higher values of correlating with our average_rating column. The average_weight column seems to be correlated with our average_rating column implying the more "weight" a game has the more highly it tends to be rated. Weight is a subjective measure that is made up by BoardGameGeek. It describes how "deep" or involved a game is.

We can also note that games for older players, where minage is high, tend to have higher average rating. The yearpublished correlation values tell us that newer games tend to have a higher rating.

The bayes_average_rating column seems to be calculated from the average_rating column in our dataset, and therefore will be removed as it will cause our model to overfit.

Modelling and Validation

We are going to create a linear regression model to make predictions for newly created board games. Let's see the performance of the model on just the training set. As discussed before, we'll drop the highly correlated columns from the dataset and use all numerical colums as input predictors to our model. The average_rating column will be our target output.

In [134]:
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.metrics import mean_squared_error 

board_games_numeric2 = board_games_numeric.drop(['average_rating','bayes_average_rating'],axis=1)


lr = LinearRegression()
lr.fit(board_games_numeric2, board_games["average_rating"])
predictions = lr.predict(board_games_numeric2)


mse = mean_squared_error(board_games["average_rating"],predictions)
print(mse)
rmse = mse ** (1/2)
print(rmse)
2.09339697583
1.44685762113

We obtained a mean square error of 2.09. Let's see how our predictions plot over the actual average_ratings.

In [94]:
sns.distplot(board_games["average_rating"])
sns.distplot(predictions)
Out[94]:

K-Folds Cross Validation

We now split our data into training and test sets, train our algorithm on the training sets and test its performance on our test sets. For validation purposes we will use K-folds cross validation to see if our predictive model can be improved.

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used, but in general k remains an unfixed parameter. (Source)

In this case we use K=10 folds. To see how the good the model actually is we can compare our mean squared error to the standard deviation.

In [135]:
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import cross_val_predict

x = board_games_numeric2
y = board_games["average_rating"]

kf = KFold(len(board_games_numeric2),10, shuffle=True, random_state=8)
mse = cross_val_score(lr,x, y, scoring= 'mean_squared_error', cv=kf)
averagemse = np.mean(mse)
print(abs(averagemse))
rmse = (abs(averagemse) ** (1/2))
print(rmse)

predicted = cross_val_predict(lr,x,y,cv=kf)
sns.distplot(board_games["average_rating"])
sns.distplot(predicted)
2.09490764514
1.4473795788
Out[135]:

As we can see K-folds cross validation did not significantly improve our model accuracy, with our MSE still at 2.09. Our model may not have high predictive powers as our mean square error is close to the standard deviation of game ratings. We would need to dig deeper into our dataset to obtain a better understanding.

Next Steps

Some of the potential next steps could be :

  • Calculate new predictors based off other existings columns such as player ranges (maxplayers-minplayers)
  • Scraping more data from BoardGameGeek to increase the size of the dataset.
  • Try algorithms other than linear regression


  • Comments

    comments powered by Disqus