A/B Testing the Bayesian Way

"When the facts change, I change my mind. What do you do, sir?" - John Maynard Keynes (maybe)

A/B testing as the name suggests, is a method of statistically testing two different versions of a design pattern to determine difference of effectiviness. Simply put, an A/B test is a way to compare two versions of a single variable by testing a subject's response to one version against the other, and determining which of the two is more effective.

I work in the biopharma industry and A/B tests come up all the time. Our aim lies in calculating the effectiveness of a certain drug A versus drug B. So we would test the drug A on some percentage of a patient group and the drug B on the rest of the patient group. After sufficient trials are performed, the in-house statisticians (my SAS people!) measure the efficacy to ascertain which drug gave better results.

more ...

A music recommender - making sense of 24 million user plays from 1.6 million unique artists using Apache Spark.

Data applications are like sausages. It is better not to see them being made - Otto von Bismarck

Data science, two words which even the initiated have trouble with explaining to the layperson. There's an old saying that if you can't explain it to your grandmother, you probably do not understand it well enough. That stands very true, especially today in 2016, given the complexities that more often than not plague most intellectual endeavors. Personally my favorite go-to line when I'm on the receiving end is - Please explain to me as you might to a child, or a ...

more ...

Predicting Board Game Reviews using KMeans Clustering & Linear Regression

Board games have been making a comeback lately, and deeper, more strategic boardgames, like Settlers of Catan have become hugely popular. BoardGameGeek is a popular site where these types of board games are discussed and reviewed.

Here we have a dataset that contains 80000 board games and their associated review scores. The data was scraped from BoardGameGeek and compiled into CSV format by Sean Beck.

Let's preview the data and look at some interesting columns.


Police killings in USA, where the police have killed in 2015

Due in part to recent high-profile shootings of civilians by police in the US, the media and public have been scrutinizing police killings heavily. The team at FiveThirtyEight assembled a dataset using crowdsourced data from Guardian and census data. It contains information on each police killing in the US.

Each of the 467 rows in the dataset contains information on a police killing of a civilian in the US in from January 2015 to June 2015. Let's see how the data looks like.


Predicting Bike Rentals by Applying Decision Trees & Random Forests

In many cities, there are communal bicycle sharing stations where you can rent bicycles by the hour or by the day. Kind of reminds me how we rent out cloud servers as I write this, haha. Washington, D.C. is one of these cities, and has detailed data available about how many bicycles were rented by hour and by day.

The powerful Hadi Fanaee-T at the University of Porto has compiled this data into a CSV file The file contains 17380 rows, and each row represents the bike rentals in a single hour of a single day. Let's take a closer look at it.

more ...

Who are we really talking about? Trump Vs Clinton, a Twitter Sentiment Analysis

The race to the 2016 presidential election is going strong, there is a lot of talk and the general public opinion is up for grabs. The data for the endeavor below is being pulled from the Twitter Streaming API and I have scheduled new data to be pulled every 6 hours as we reach closer to the election day. The code runs periodically on a DigitalOcean cloud server running Ubuntu 16.04.1x64 and Python 3.5.

The dataset at the time this article was published comprised of 16,232 tweets about Hillary Clinton and Donald Trump. Even though the code walkthrough and the tables are snapshot in time, the visualisations will keep updating automatically as we keep getting new tweets. Watch this space!

more ...