The race to the 2016 presidential election is going strong, there is a lot of talk and the general public opinion is up for grabs. The data for the endeavor below is being pulled from the Twitter Streaming API and I have scheduled new data to be pulled every 6 hours as we reach closer to the election day. The code runs periodically on a DigitalOcean cloud server running Ubuntu 16.04.1x64 and Python 3.5.

The dataset at the time this article was published comprised of 16,232 tweets about Hillary Clinton and Donald Trump. Even though the code walkthrough and the tables are snapshot in time, the visualisations will keep updating automatically as we keep getting new tweets. Watch this space!

Let's load up the data do some exploration.

In [2]:

#importing from Twitter API/JSON dump
import pandas as pd
import numpy as np
tweets = pd.read_csv("/Users/Guneet/TwitterScrape/tweets.csv")
tweets.head(2)

Out[2]:

	id	retweet_count	text	user_description	id_str	user_name	user_location	polarity	geo	subjectivity	user_followers	user_created	coordinates	created	user_bg_color
0	1	0	I love watching the Clinton News Network actin...	Conservative. Constitutionalist. Common sense....	764025773069967360	Regula_Iuris	United States	0.25	NaN	0.3	118	2015-12-06T03:11:29	NaN	2016-08-12T09:08:33	C0DEED
1	2	0	RT @DrJillStein: Clinton, Obama, Pelosi & ...	Redstone Tutorial Let's plays \| Making things ...	764025773644623872	Anikidomo	NY	0.00	NaN	0.0	5499	2007-10-31T13:38:00	NaN	2016-08-12T09:08:34	000000

Here are some of the columns of interest in the data:

id_str – the id of the tweet on Twitter.

user_location – the location the tweeter specified in their Twitter bio.

user_name – the Twitter username of the tweeter.

polarity – the sentiment of the tweet, from -1, to 1. 1 indicates strong positivity, -1 strong negativity.

created – when the tweet was sent.

user_description – the description the tweeter specified in their bio.

user_created – when the tweeter created their account.

user_follower – the number of followers the tweeter has.

text – the text of the tweet.

subjectivity – the subjectivity or objectivity of the tweet. 0 is very objective, 1 is very subjective.

Some of the interesting stuff we can do here is to compare contents of the tweets. Let's generate a column which tells us what candidates are mentioned in each tweet so we can start comparing tweets about one candidate to another.

In [3]:

#adding candidates column to table based on contents in text column in tweets.csv

def get_candidate(row):
    candidates=[]
    text = row["text"].lower()
    if "clinton" in text or "hillary" in text:
        candidates.append("clinton")
    if "trump" in text or "donald" in text:
        candidates.append("trump")
    return",".join(candidates)
               
tweets["candidate"] = tweets.apply(get_candidate,axis=1)

The user_location column is another interesting column that tells us about the location of the tweeter mentioned in their Twitter bio. Let's extract the state out of their location to another column called code.

In [4]:

# adding code column specifying state abbreviation from the location of the user

def get_location(row):
    
    code=[]
    states = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'}
    
    text = row["user_location"]
    if text is np.nan:
        text = '-'   
    for key in states:
        if key in text or states[key] in text:
            code.append(states[key])
            break
            
    return ",".join(code)

tweets["code"] = tweets.apply(get_location,axis=1)
tweets.head(2)

Out[4]:

	id	retweet_count	text	user_description	id_str	user_name	user_location	polarity	geo	subjectivity	user_followers	user_created	coordinates	created	user_bg_color	candidate	code
0	1	0	I love watching the Clinton News Network actin...	Conservative. Constitutionalist. Common sense....	764025773069967360	Regula_Iuris	United States	0.25	NaN	0.3	118	2015-12-06T03:11:29	NaN	2016-08-12T09:08:33	C0DEED	clinton,trump
1	2	0	RT @DrJillStein: Clinton, Obama, Pelosi & ...	Redstone Tutorial Let's plays \| Making things ...	764025773644623872	Anikidomo	NY	0.00	NaN	0.0	5499	2007-10-31T13:38:00	NaN	2016-08-12T09:08:34	000000	clinton	NY

One of the things we could look at is the age of the twitter accounts, how old they are and when they were created. This could give us a better understanding about the accounts of users who tweet about either candidate. A candidiate having more user accounts created recently might imply some kind of manipulation owing to fake Twitter accounts.

In [5]:

# create new column called user_age from data in created and user_created column
from datetime import datetime
tweets["created"] = pd.to_datetime(tweets["created"])
tweets["user_created"] = pd.to_datetime(tweets["user_created"])
tweets["user_age"] = tweets["user_created"].apply(lambda x: (datetime.now() - x).total_seconds() / 3600 / 24 / 365)
cl_tweets = tweets["user_age"][tweets["candidate"]=="clinton"]
tr_tweets = tweets["user_age"][tweets["candidate"]=="trump"]

This also gives us an opportunity to extract out the number of tweets made for each candidate by those accounts. Let's look at these two together.

In [ ]:

#plotting number of tweets mentioning each candidate combination
import plotly 
plotly.tools.set_credentials_file(username='**********', api_key='**********')

import plotly.plotly as py
from plotly.graph_objs import *

import plotly.graph_objs as go

import numpy as np
x0 = cl_tweets
x1 = tr_tweets

trace1 = go.Histogram(
    x=x0,
    histnorm='count',
    name='Clinton',
    

    marker=dict(
        color='blue',
        line=dict(
            color='blue',
            width=0
        )
    ),
    opacity=0.75
)
trace2 = go.Histogram(
    x=x1,
    name='Trump',
    

    marker=dict(
        color='red'
    ),
    opacity=0.75
)
data = [trace1, trace2]
layout = go.Layout(
    title='Tweets mentioning each candidate',
    xaxis=dict(
        title='Twitter account age in years'
    ),
    yaxis=dict(
        title='Number of tweets'
    ),
    barmode='stack',
    bargap=0.2,
    bargroupgap=0.1
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='number-of-tweets')

We can take a step further and breakdown the data by the state from which the Tweets are coming from. This would give us an idea about how the this tweet traffic is spread across different states. Let's see how this geographic layout is for Hillary Clinton.

In [ ]:

#Visualize tweets by location for Hillary Clinton.

scl = [[0.0, 'rgb(170,170,255)'],[0.2, 'rgb(130,130,255)'],[0.4, 'rgb(120,120,255)'],\
            [0.6, 'rgb(100,100,255)'],[0.8, 'rgb(50,50,255)'],[1.0, 'rgb(0,0,255)']]

data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = clcount.index, 
        z = clcount,
        locationmode = 'USA-states',
        text = "Tweets about Cliton",
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Location by the numbers")
        ) ]

layout = dict(
        title = 'Who is talking about Hillary Clinton?',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' )
             ))
    
fig = dict( data=data, layout=layout )
py.iplot( fig, filename='d3-cloropleth-map' )

And now let's do the same for Donald Trump.

In [ ]:

# Visualize tweets by location for Donald Trump. 

scl = [[0.0, 'rgb(255,170,170)'],[0.2, 'rgb(255,130,130)'],[0.4, 'rgb(255,120,120)'],\
            [0.6, 'rgb(255,100,100)'],[0.8, 'rgb(255,50,50)'],[1.0, 'rgb(255,0,0)']]

data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = trcount.index, 
        z = trcount,
        locationmode = 'USA-states',
        text = "Tweets about Trump",
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Tweet Locations")
        ) ]

layout = dict(
        title = 'Who is talking about Donald Trump?',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' )
             ))
    
fig = dict( data=data, layout=layout )
py.iplot( fig, filename='d3-cloropleth-map' )

And now we can put the data behind the above two visualisations together and see who gets more traffic across the country. This gives us some interesting insights about certain as to which candidate they are more vocal about.

In [ ]:

# Visualize tweet content for Clinton and Trump on state basis

scl = [[0.0, 'rgb(130,130,255)'],[0.2, 'rgb(255,170,170)'],[0.4, 'rgb(255,150,150)'],\
            [0.6, 'rgb(255,100,100)'],[0.8, 'rgb(255,50,50)'],[1.0, 'rgb(255,0,0)']]


data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = filteredpopularityindex.index, 
        z = filteredpopularityindex,
        locationmode = 'USA-states',
        text = "Trump and Clinton together",
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Tweet Majority")
        ) ]

layout = dict(
        title = 'Who are we really talking about?',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' )
             ))
    
fig = dict( data=data, layout=layout )
py.iplot( fig, filename='d3-cloropleth-map' )

Sentiment analysis is an area dedicated to extracting subjective emotions from text. It's the process of learning whether the writer feels positively or negatively about a topic.

We generated sentiment scores for each tweet using TextBlob, which are stored in the polarity column. We can plot the mean value for each candidate, along with the standard deviation. The standard deviation will tell us how wide the variation is between all the tweets, whereas the mean will tell us how the average tweet is.

TextBlob is a Python library for processing textual data. It provides a platform to dive into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, and more.

On the whole, we can see the mean tweet sentiment for each candidate as below.

In [ ]:

data = [go.Bar(
            x=['clinton','trump'],
            y=[mean[0],mean[2]]
    )]
layout = go.Layout(
    title='Mean Tweet Sentiment',
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='basic-bar')

As we can see, Donald Trump is the more talked about candidate and the mean sentiment for him is higher than for Hillary Clinton (or it was, at the time this sentence was published on 14 August 2016. We plot the standard deviation of the sentiment below which tells how wide the variation between all the tweets is.

In [ ]:

data = [go.Bar(
            x=['clinton','trump'],
            y=[std[0],std[2]]
    )]
layout = go.Layout(
    title='Standard Deviation of Tweet Sentiment',
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='basic-bar')

Next Steps

This has been a good start to pique some interest and we can now branch off in a number of directions. Some things we could check out further :

Analyze user descriptions, and see how description length varies by candidate

See what kinds of usernames tweet more about what kinds of candidates

Identify potential swing states by taking into account sentiment across known republican and democratic states

Keep watching this space as we draw closer to the elections. Twitter activity around the presidential debates should give us some pretty interesting insights.

Who are we really talking about? Trump Vs Clinton, a Twitter Sentiment Analysis

Next Steps

Comments