Due in part to recent high-profile shootings of civilians by police in the US, the media and public have been scrutinizing police killings heavily. The team at FiveThirtyEight assembled a dataset using crowdsourced data from Guardian and census data. It contains information on each police killing in the US.
Each of the 467 rows in the dataset contains information on a police killing of a civilian in the US in from January 2015 to June 2015. Let's see how the data looks like.
#read into pandas object and see first five rows
import pandas as pd
police_killings = pd.read_csv("Police killings/police_killings.csv",encoding="ISO-8859-1")
police_killings.head(5)
Since our dataset is very wide, it's useful to just see all columns for a single row of data.
print(police_killings.iloc[0,:])
Some of the interesting columns in the dataset are :
- name -- the name of the civilian.
- age -- the age of the civilian.
- gender -- the gender of the civilian.
- raceethnicity -- the race and ethnicity of the civilian.
- month, day, and year -- when the shooting occurred.
- streetaddress, city, state -- where the shooting occurred.
- lawenforcementagency -- the agency that was involved.
- cause -- the cause of death.
- armed -- whether or not the civilian was armed.
- pop -- population of the census area where the incident occurred.
- county_income -- median household income in the county.
Let us explore the incidents by race.
police_killings["raceethnicity"].value_counts()
#Visualise the above incidents
%matplotlib inline
import matplotlib.pyplot as plt
numbers = police_killings["raceethnicity"].value_counts()
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.set_ylabel('Deaths')
fig.suptitle('Incidents by Race', fontsize=14, fontweight='bold')
plt.xticks(range(6),numbers.index, rotation="vertical")
plt.bar(range(6), numbers)
Let's see how this breakdown compares to the population breakdown for USA.
#Explore deaths as a percentage
percentage = numbers/sum(numbers) * 100
percentage
People identified as Black are far more represented in these shootings as compared to their representation in the total population of the United States. 28% of the vicitims are black, and they only represent ~ 12% of the total US population.(Source)
The p_income column is an interesting column in our dataset and contains the median personal income by census area. Looking at this will tell us if more of the shootings happened in less affluent areas or more affluent areas.
incomedf = police_killings[police_killings["p_income"]!='-']
income = incomedf["p_income"]
income.astype(float).hist(bins=20)
plt.suptitle("Histogram of median personal income by census area", fontsize=14, fontweight='bold')
plt.xlabel("Personal Income")
plt.ylabel("Frequency")
#Calculate the median income in the dataset
income.median()
This does not say much other than confirming the idea that the shootings happen in less affluent areas of the United States, which have a median income of $22,348.
The per capita income for the overall population in 2008 was $26,984. (source)
Let's pull in some population data, to look at a population adjusted rate of shootings in a state. If more people were shot in Texas than in Rhode Island, it doesn't automatically mean that cops are more likely to shoot people in Texas. This is because Texas has a much larger population than Rhode Island.
state_pop = pd.read_csv("Police killings/state_population.csv")
state_pop.head(6)
The Census data was obtained from here. Note that, the 'state' column in the police_killings dataset only has state abbreaviations. Whereas the 'NAME' column in this census data has the entire name of the state. We can work around this and use 'state_fp' column in police_killings to match "STATE" column in the census data. The below code does that.
Let's create a series called counts which has the occurences of each state_fp value in police_killings.
counts = police_killings["state_fp"].value_counts()
counts
Let's assign counts to a dataframe having two columns, the index from counts as a 'STATE' column (since that is the state_fp value) and the value from counts as the 'shootings' column because it contains the values of the state_fp occurences.
states = pd.DataFrame({"STATE": counts.index, "shootings" : counts})
states
Now merge both the datasets (states and state_p) on the common 'STATE' column, since it is common to both tables. Just to recapitulate, states comes from counts, which is a series which comes from police_killings table. state_p is our census data.
states = states.merge(state_pop, on="STATE")
Next, we add new columns to show normalized population, divided by a million. And we add another column, rate, which is number of shootings divided by population in terms of millions to give police killings per one million people in each state.
states["pop_millions"] = states["POPESTIMATE2015"] / 1000000
states["rate"] = states["shootings"] / states["pop_millions"]
states.sort("rate") #sorting by rate, states lowest to highest in police killings per million
We can see the number of incidents in each state now -
police_killings["state"].value_counts() #Number of incidents in each state
Here we create a new dataframe called g where '-' values are removed from certain columns and converted to floating types. Note that removing '-' or unknown columns does add bias to the data.
g = police_killings[(police_killings["share_white"]!='-') & (police_killings["share_black"]!='-') &
(police_killings["share_hispanic"]!='-')]
g["share_white"] = g["share_white"].astype(float)
g["share_black"] = g["share_black"].astype(float)
g["share_hispanic"] = g["share_hispanic"].astype(float)
With out merged table we can now look at states with highest and lowest rates, and some interesting columns which tell us more about them. Here are some inferences -
lowest_states = ["CT", "PA", "IA", "NY", "MA", "NH", "ME", "IL", "OH", "WI"]
highest_states = ["OK", "AZ", "NE", "HI", "AK", "ID", "NM", "LA", "CO", "DE"]
ls = g[g["state"].isin(lowest_states)] #states from g where the state is in list of 10 states with lowest rates
hs = g[g["state"].isin(highest_states)] #states from g where the state in in list of 10 states with highest rates
#Looking at some interesting columns for these states like pop, county_income, share_white, share_black,
# share_hispanic
columnlist = ["pop","county_income","share_white","share_black","share_hispanic"]
ls[columnlist].mean() #mean values across states with lowest killing rates
hs[columnlist].mean() #mean values across states with highest killing rates
Inferences
If we look at the data above we see that states with lower shooting rates tend to have a higher proportion of the population identifying as black. Also states with higher shooting rates show an increased proportion of people identifying as hispanic. States with higher shooting rates also have lower median county incomes as compared to states with lower shooting rates.An interesting thing to note is that we are actually looking at columns that contain county-level data for where the shooting occurred which might differ from a state by state comparison of the columns. This may bias the data, and result in different observations.
Next Steps
For further analysis we could explore some of the columns that were not analysed for the above analysis and integrate more external data sources :
- Data.gov - http://www.data.gov/
- Socrata - https://opendata.socrata.com/
- Github - https://github.com/caesar0301/awesome-public-datasets
- Census Data - https://www.census.gov
It'll be interesting to map out a state-level data in a choropleth map with matplotlib. A choropleth map is a kind of a thematic map that can be used to display data that varies across geographic regions. Data values are usually mapped to different color saturations for numerical variables or color hues for categorical variables. Different patterns can also be used, but that is not as common. Typical examples are maps that show election results.
It's useful to look more into the cause column as well, and see if there are any patterns. Looking more broadly at crime rates where the shootings occured could be a good point of investigation as well.
Comments
comments powered by Disqus