The race to the 2016 presidential election is going strong, there is a lot of talk and the general public opinion is up for grabs. The data for the endeavor below is being pulled from the Twitter Streaming API and I have scheduled new data to be pulled every 6 hours as we reach closer to the election day. The code runs periodically on a DigitalOcean cloud server running Ubuntu 16.04.1x64 and Python 3.5.

The dataset at the time this article was published comprised of 16,232 tweets about Hillary Clinton and Donald Trump. Even though the code walkthrough and the tables are snapshot in time, the visualisations will keep updating automatically as we keep getting new tweets. Watch this space!

Let's load up the data do some exploration.

In [2]:
#importing from Twitter API/JSON dump
import pandas as pd
import numpy as np
tweets = pd.read_csv("/Users/Guneet/TwitterScrape/tweets.csv")
tweets.head(2)
Out[2]:
id retweet_count text user_description id_str user_name user_location polarity geo subjectivity user_followers user_created coordinates created user_bg_color
0 1 0 I love watching the Clinton News Network actin... Conservative. Constitutionalist. Common sense.... 764025773069967360 Regula_Iuris United States 0.25 NaN 0.3 118 2015-12-06T03:11:29 NaN 2016-08-12T09:08:33 C0DEED
1 2 0 RT @DrJillStein: Clinton, Obama, Pelosi & ... Redstone Tutorial Let's plays | Making things ... 764025773644623872 Anikidomo NY 0.00 NaN 0.0 5499 2007-10-31T13:38:00 NaN 2016-08-12T09:08:34 000000

Here are some of the columns of interest in the data:

  • id_str – the id of the tweet on Twitter.
  • user_location – the location the tweeter specified in their Twitter bio.
  • user_name – the Twitter username of the tweeter.
  • polarity – the sentiment of the tweet, from -1, to 1. 1 indicates strong positivity, -1 strong negativity.
  • created – when the tweet was sent.
  • user_description – the description the tweeter specified in their bio.
  • user_created – when the tweeter created their account.
  • user_follower – the number of followers the tweeter has.
  • text – the text of the tweet.
  • subjectivity – the subjectivity or objectivity of the tweet. 0 is very objective, 1 is very subjective.
  • Some of the interesting stuff we can do here is to compare contents of the tweets. Let's generate a column which tells us what candidates are mentioned in each tweet so we can start comparing tweets about one candidate to another.

    In [3]:
    #adding candidates column to table based on contents in text column in tweets.csv
    
    def get_candidate(row):
        candidates=[]
        text = row["text"].lower()
        if "clinton" in text or "hillary" in text:
            candidates.append("clinton")
        if "trump" in text or "donald" in text:
            candidates.append("trump")
        return",".join(candidates)
                   
    tweets["candidate"] = tweets.apply(get_candidate,axis=1)
    

    The user_location column is another interesting column that tells us about the location of the tweeter mentioned in their Twitter bio. Let's extract the state out of their location to another column called code.

    In [4]:
    # adding code column specifying state abbreviation from the location of the user
    
    def get_location(row):
        
        code=[]
        states = {
        'Alabama': 'AL',
        'Alaska': 'AK',
        'Arizona': 'AZ',
        'Arkansas': 'AR',
        'California': 'CA',
        'Colorado': 'CO',
        'Connecticut': 'CT',
        'Delaware': 'DE',
        'Florida': 'FL',
        'Georgia': 'GA',
        'Hawaii': 'HI',
        'Idaho': 'ID',
        'Illinois': 'IL',
        'Indiana': 'IN',
        'Iowa': 'IA',
        'Kansas': 'KS',
        'Kentucky': 'KY',
        'Louisiana': 'LA',
        'Maine': 'ME',
        'Maryland': 'MD',
        'Massachusetts': 'MA',
        'Michigan': 'MI',
        'Minnesota': 'MN',
        'Mississippi': 'MS',
        'Missouri': 'MO',
        'Montana': 'MT',
        'Nebraska': 'NE',
        'Nevada': 'NV',
        'New Hampshire': 'NH',
        'New Jersey': 'NJ',
        'New Mexico': 'NM',
        'New York': 'NY',
        'North Carolina': 'NC',
        'North Dakota': 'ND',
        'Ohio': 'OH',
        'Oklahoma': 'OK',
        'Oregon': 'OR',
        'Pennsylvania': 'PA',
        'Rhode Island': 'RI',
        'South Carolina': 'SC',
        'South Dakota': 'SD',
        'Tennessee': 'TN',
        'Texas': 'TX',
        'Utah': 'UT',
        'Vermont': 'VT',
        'Virginia': 'VA',
        'Washington': 'WA',
        'West Virginia': 'WV',
        'Wisconsin': 'WI',
        'Wyoming': 'WY'}
        
        text = row["user_location"]
        if text is np.nan:
            text = '-'   
        for key in states:
            if key in text or states[key] in text:
                code.append(states[key])
                break
                
        return ",".join(code)
    
    tweets["code"] = tweets.apply(get_location,axis=1)
    tweets.head(2)
    
    Out[4]:
    id retweet_count text user_description id_str user_name user_location polarity geo subjectivity user_followers user_created coordinates created user_bg_color candidate code
    0 1 0 I love watching the Clinton News Network actin... Conservative. Constitutionalist. Common sense.... 764025773069967360 Regula_Iuris United States 0.25 NaN 0.3 118 2015-12-06T03:11:29 NaN 2016-08-12T09:08:33 C0DEED clinton,trump
    1 2 0 RT @DrJillStein: Clinton, Obama, Pelosi & ... Redstone Tutorial Let's plays | Making things ... 764025773644623872 Anikidomo NY 0.00 NaN 0.0 5499 2007-10-31T13:38:00 NaN 2016-08-12T09:08:34 000000 clinton NY

    One of the things we could look at is the age of the twitter accounts, how old they are and when they were created. This could give us a better understanding about the accounts of users who tweet about either candidate. A candidiate having more user accounts created recently might imply some kind of manipulation owing to fake Twitter accounts.

    In [5]:
    # create new column called user_age from data in created and user_created column
    from datetime import datetime
    tweets["created"] = pd.to_datetime(tweets["created"])
    tweets["user_created"] = pd.to_datetime(tweets["user_created"])
    tweets["user_age"] = tweets["user_created"].apply(lambda x: (datetime.now() - x).total_seconds() / 3600 / 24 / 365)
    cl_tweets = tweets["user_age"][tweets["candidate"]=="clinton"]
    tr_tweets = tweets["user_age"][tweets["candidate"]=="trump"]
    

    This also gives us an opportunity to extract out the number of tweets made for each candidate by those accounts. Let's look at these two together.

    In [ ]:
    #plotting number of tweets mentioning each candidate combination
    import plotly 
    plotly.tools.set_credentials_file(username='**********', api_key='**********')
    
    import plotly.plotly as py
    from plotly.graph_objs import *
    
    import plotly.graph_objs as go
    
    import numpy as np
    x0 = cl_tweets
    x1 = tr_tweets
    
    trace1 = go.Histogram(
        x=x0,
        histnorm='count',
        name='Clinton',
        
    
        marker=dict(
            color='blue',
            line=dict(
                color='blue',
                width=0
            )
        ),
        opacity=0.75
    )
    trace2 = go.Histogram(
        x=x1,
        name='Trump',
        
    
        marker=dict(
            color='red'
        ),
        opacity=0.75
    )
    data = [trace1, trace2]
    layout = go.Layout(
        title='Tweets mentioning each candidate',
        xaxis=dict(
            title='Twitter account age in years'
        ),
        yaxis=dict(
            title='Number of tweets'
        ),
        barmode='stack',
        bargap=0.2,
        bargroupgap=0.1
    )
    fig = go.Figure(data=data, layout=layout)
    py.iplot(fig, filename='number-of-tweets')
    
    Tweets mentioning each candidate

    We can take a step further and breakdown the data by the state from which the Tweets are coming from. This would give us an idea about how the this tweet traffic is spread across different states. Let's see how this geographic layout is for Hillary Clinton.

    In [ ]:
    #Visualize tweets by location for Hillary Clinton.
    
    scl = [[0.0, 'rgb(170,170,255)'],[0.2, 'rgb(130,130,255)'],[0.4, 'rgb(120,120,255)'],\
                [0.6, 'rgb(100,100,255)'],[0.8, 'rgb(50,50,255)'],[1.0, 'rgb(0,0,255)']]
    
    data = [ dict(
            type='choropleth',
            colorscale = scl,
            autocolorscale = False,
            locations = clcount.index, 
            z = clcount,
            locationmode = 'USA-states',
            text = "Tweets about Cliton",
            marker = dict(
                line = dict (
                    color = 'rgb(255,255,255)',
                    width = 2
                ) ),
            colorbar = dict(
                title = "Location by the numbers")
            ) ]
    
    layout = dict(
            title = 'Who is talking about Hillary Clinton?',
            geo = dict(
                scope='usa',
                projection=dict( type='albers usa' )
                 ))
        
    fig = dict( data=data, layout=layout )
    py.iplot( fig, filename='d3-cloropleth-map' )
    
    Who is talking about Hillary Clinton?

    And now let's do the same for Donald Trump.

    In [ ]:
    # Visualize tweets by location for Donald Trump. 
    
    scl = [[0.0, 'rgb(255,170,170)'],[0.2, 'rgb(255,130,130)'],[0.4, 'rgb(255,120,120)'],\
                [0.6, 'rgb(255,100,100)'],[0.8, 'rgb(255,50,50)'],[1.0, 'rgb(255,0,0)']]
    
    data = [ dict(
            type='choropleth',
            colorscale = scl,
            autocolorscale = False,
            locations = trcount.index, 
            z = trcount,
            locationmode = 'USA-states',
            text = "Tweets about Trump",
            marker = dict(
                line = dict (
                    color = 'rgb(255,255,255)',
                    width = 2
                ) ),
            colorbar = dict(
                title = "Tweet Locations")
            ) ]
    
    layout = dict(
            title = 'Who is talking about Donald Trump?',
            geo = dict(
                scope='usa',
                projection=dict( type='albers usa' )
                 ))
        
    fig = dict( data=data, layout=layout )
    py.iplot( fig, filename='d3-cloropleth-map' )
    
    Who is talking about Donald Trump?

    And now we can put the data behind the above two visualisations together and see who gets more traffic across the country. This gives us some interesting insights about certain as to which candidate they are more vocal about.

    In [ ]:
    # Visualize tweet content for Clinton and Trump on state basis
    
    scl = [[0.0, 'rgb(130,130,255)'],[0.2, 'rgb(255,170,170)'],[0.4, 'rgb(255,150,150)'],\
                [0.6, 'rgb(255,100,100)'],[0.8, 'rgb(255,50,50)'],[1.0, 'rgb(255,0,0)']]
    
    
    data = [ dict(
            type='choropleth',
            colorscale = scl,
            autocolorscale = False,
            locations = filteredpopularityindex.index, 
            z = filteredpopularityindex,
            locationmode = 'USA-states',
            text = "Trump and Clinton together",
            marker = dict(
                line = dict (
                    color = 'rgb(255,255,255)',
                    width = 2
                ) ),
            colorbar = dict(
                title = "Tweet Majority")
            ) ]
    
    layout = dict(
            title = 'Who are we really talking about?',
            geo = dict(
                scope='usa',
                projection=dict( type='albers usa' )
                 ))
        
    fig = dict( data=data, layout=layout )
    py.iplot( fig, filename='d3-cloropleth-map' )
    
    Who are we really talking about?

    Sentiment analysis is an area dedicated to extracting subjective emotions from text. It's the process of learning whether the writer feels positively or negatively about a topic.

    We generated sentiment scores for each tweet using TextBlob, which are stored in the polarity column. We can plot the mean value for each candidate, along with the standard deviation. The standard deviation will tell us how wide the variation is between all the tweets, whereas the mean will tell us how the average tweet is.

    TextBlob is a Python library for processing textual data. It provides a platform to dive into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, and more.

    On the whole, we can see the mean tweet sentiment for each candidate as below.

    In [ ]:
    data = [go.Bar(
                x=['clinton','trump'],
                y=[mean[0],mean[2]]
        )]
    layout = go.Layout(
        title='Mean Tweet Sentiment',
    )
    fig = go.Figure(data=data, layout=layout)
    py.iplot(fig, filename='basic-bar')
    
    Mean Tweet Sentiment

    As we can see, Donald Trump is the more talked about candidate and the mean sentiment for him is higher than for Hillary Clinton (or it was, at the time this sentence was published on 14 August 2016. We plot the standard deviation of the sentiment below which tells how wide the variation between all the tweets is.

    In [ ]:
    data = [go.Bar(
                x=['clinton','trump'],
                y=[std[0],std[2]]
        )]
    layout = go.Layout(
        title='Standard Deviation of Tweet Sentiment',
    )
    fig = go.Figure(data=data, layout=layout)
    py.iplot(fig, filename='basic-bar')
    
    Standard Deviation of Tweet Sentiment

    Next Steps

    This has been a good start to pique some interest and we can now branch off in a number of directions. Some things we could check out further :

  • Analyze user descriptions, and see how description length varies by candidate
  • See what kinds of usernames tweet more about what kinds of candidates
  • Identify potential swing states by taking into account sentiment across known republican and democratic states
  • Keep watching this space as we draw closer to the elections. Twitter activity around the presidential debates should give us some pretty interesting insights.



    Comments

    comments powered by Disqus