Due in part to recent high-profile shootings of civilians by police in the US, the media and public have been scrutinizing police killings heavily. The team at FiveThirtyEight assembled a dataset using crowdsourced data from Guardian and census data. It contains information on each police killing in the US.

Each of the 467 rows in the dataset contains information on a police killing of a civilian in the US in from January 2015 to June 2015. Let's see how the data looks like.

In [28]:
#read into pandas object and see first five rows
import pandas as pd
police_killings = pd.read_csv("Police killings/police_killings.csv",encoding="ISO-8859-1")
police_killings.head(5)
Out[28]:
name age gender raceethnicity month day year streetaddress city state ... share_hispanic p_income h_income county_income comp_income county_bucket nat_bucket pov urate college
0 A'donte Washington 16 Male Black February 23 2015 Clearview Ln Millbrook AL ... 5.6 28375 51367.0 54766 0.937936 3.0 3.0 14.1 0.097686 0.168510
1 Aaron Rutledge 27 Male White April 2 2015 300 block Iris Park Dr Pineville LA ... 0.5 14678 27972.0 40930 0.683411 2.0 1.0 28.8 0.065724 0.111402
2 Aaron Siler 26 Male White March 14 2015 22nd Ave and 56th St Kenosha WI ... 16.8 25286 45365.0 54930 0.825869 2.0 3.0 14.6 0.166293 0.147312
3 Aaron Valdez 25 Male Hispanic/Latino March 11 2015 3000 Seminole Ave South Gate CA ... 98.8 17194 48295.0 55909 0.863814 3.0 3.0 11.7 0.124827 0.050133
4 Adam Jovicic 29 Male White March 19 2015 364 Hiwood Ave Munroe Falls OH ... 1.7 33954 68785.0 49669 1.384868 5.0 4.0 1.9 0.063550 0.403954

5 rows × 34 columns

Since our dataset is very wide, it's useful to just see all columns for a single row of data.

In [29]:
print(police_killings.iloc[0,:]) 
name                             A'donte Washington
age                                              16
gender                                         Male
raceethnicity                                 Black
month                                      February
day                                              23
year                                           2015
streetaddress                          Clearview Ln
city                                      Millbrook
state                                            AL
latitude                                    32.5296
longitude                                  -86.3628
state_fp                                          1
county_fp                                        51
tract_ce                                      30902
geo_id                                   1051030902
county_id                                      1051
namelsad                        Census Tract 309.02
lawenforcementagency    Millbrook Police Department
cause                                       Gunshot
armed                                            No
pop                                            3779
share_white                                    60.5
share_black                                    30.5
share_hispanic                                  5.6
p_income                                      28375
h_income                                      51367
county_income                                 54766
comp_income                                0.937936
county_bucket                                     3
nat_bucket                                        3
pov                                            14.1
urate                                     0.0976864
college                                     0.16851
Name: 0, dtype: object

Some of the interesting columns in the dataset are :

  • name -- the name of the civilian.
  • age -- the age of the civilian.
  • gender -- the gender of the civilian.
  • raceethnicity -- the race and ethnicity of the civilian.
  • month, day, and year -- when the shooting occurred.
  • streetaddress, city, state -- where the shooting occurred.
  • lawenforcementagency -- the agency that was involved.
  • cause -- the cause of death.
  • armed -- whether or not the civilian was armed.
  • pop -- population of the census area where the incident occurred.
  • county_income -- median household income in the county.

Let us explore the incidents by race.

In [30]:
police_killings["raceethnicity"].value_counts()
Out[30]:
White                     236
Black                     135
Hispanic/Latino            67
Unknown                    15
Asian/Pacific Islander     10
Native American             4
Name: raceethnicity, dtype: int64
In [31]:
#Visualise the above incidents

%matplotlib inline
import matplotlib.pyplot as plt
numbers = police_killings["raceethnicity"].value_counts()
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.set_ylabel('Deaths')
fig.suptitle('Incidents by Race', fontsize=14, fontweight='bold')
plt.xticks(range(6),numbers.index, rotation="vertical")

plt.bar(range(6), numbers)
Out[31]:

Let's see how this breakdown compares to the population breakdown for USA.

In [32]:
#Explore deaths as a percentage

percentage = numbers/sum(numbers) * 100
percentage
Out[32]:
White                     50.535332
Black                     28.907923
Hispanic/Latino           14.346895
Unknown                    3.211991
Asian/Pacific Islander     2.141328
Native American            0.856531
Name: raceethnicity, dtype: float64

People identified as Black are far more represented in these shootings as compared to their representation in the total population of the United States. 28% of the vicitims are black, and they only represent ~ 12% of the total US population.(Source)

The p_income column is an interesting column in our dataset and contains the median personal income by census area. Looking at this will tell us if more of the shootings happened in less affluent areas or more affluent areas.

In [33]:
incomedf = police_killings[police_killings["p_income"]!='-']
income = incomedf["p_income"]
income.astype(float).hist(bins=20)
plt.suptitle("Histogram of median personal income by census area", fontsize=14, fontweight='bold')
plt.xlabel("Personal Income")
plt.ylabel("Frequency")
Out[33]:
In [34]:
#Calculate the median income in the dataset
income.median()
Out[34]:
22348.0

This does not say much other than confirming the idea that the shootings happen in less affluent areas of the United States, which have a median income of $22,348.

The per capita income for the overall population in 2008 was $26,984. (source)

Let's pull in some population data, to look at a population adjusted rate of shootings in a state. If more people were shot in Texas than in Rhode Island, it doesn't automatically mean that cops are more likely to shoot people in Texas. This is because Texas has a much larger population than Rhode Island.

In [35]:
state_pop = pd.read_csv("Police killings/state_population.csv")
state_pop.head(6)
Out[35]:
SUMLEV REGION DIVISION STATE NAME POPESTIMATE2015 POPEST18PLUS2015 PCNT_POPEST18PLUS
0 10 0 0 0 United States 321418820 247773709 77.1
1 40 3 6 1 Alabama 4858979 3755483 77.3
2 40 4 9 2 Alaska 738432 552166 74.8
3 40 4 8 4 Arizona 6828065 5205215 76.2
4 40 3 7 5 Arkansas 2978204 2272904 76.3
5 40 4 9 6 California 39144818 30023902 76.7

The Census data was obtained from here. Note that, the 'state' column in the police_killings dataset only has state abbreaviations. Whereas the 'NAME' column in this census data has the entire name of the state. We can work around this and use 'state_fp' column in police_killings to match "STATE" column in the census data. The below code does that.

Let's create a series called counts which has the occurences of each state_fp value in police_killings.

In [36]:
counts = police_killings["state_fp"].value_counts()
counts
Out[36]:
6     74
48    47
12    29
4     25
40    22
13    16
36    13
8     12
34    11
53    11
22    11
17    11
39    10
29    10
24    10
37    10
26     9
45     9
51     9
41     8
18     8
1      8
42     7
21     7
20     6
27     6
28     6
31     6
47     6
25     5
49     5
55     5
35     5
16     4
15     4
5      4
32     3
30     2
19     2
2      2
54     2
10     2
33     1
9      1
11     1
23     1
56     1
Name: state_fp, dtype: int64

Let's assign counts to a dataframe having two columns, the index from counts as a 'STATE' column (since that is the state_fp value) and the value from counts as the 'shootings' column because it contains the values of the state_fp occurences.

In [37]:
states = pd.DataFrame({"STATE": counts.index, "shootings" : counts})
states
Out[37]:
STATE shootings
6 6 74
48 48 47
12 12 29
4 4 25
40 40 22
13 13 16
36 36 13
8 8 12
34 34 11
53 53 11
22 22 11
17 17 11
39 39 10
29 29 10
24 24 10
37 37 10
26 26 9
45 45 9
51 51 9
41 41 8
18 18 8
1 1 8
42 42 7
21 21 7
20 20 6
27 27 6
28 28 6
31 31 6
47 47 6
25 25 5
49 49 5
55 55 5
35 35 5
16 16 4
15 15 4
5 5 4
32 32 3
30 30 2
19 19 2
2 2 2
54 54 2
10 10 2
33 33 1
9 9 1
11 11 1
23 23 1
56 56 1

Now merge both the datasets (states and state_p) on the common 'STATE' column, since it is common to both tables. Just to recapitulate, states comes from counts, which is a series which comes from police_killings table. state_p is our census data.

In [38]:
states = states.merge(state_pop, on="STATE")

Next, we add new columns to show normalized population, divided by a million. And we add another column, rate, which is number of shootings divided by population in terms of millions to give police killings per one million people in each state.

In [39]:
states["pop_millions"] = states["POPESTIMATE2015"] / 1000000
states["rate"] = states["shootings"] / states["pop_millions"]
states.sort("rate") #sorting by rate, states lowest to highest in police killings per million
/Users/Guneet/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:3: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  app.launch_new_instance()
Out[39]:
STATE shootings SUMLEV REGION DIVISION NAME POPESTIMATE2015 POPEST18PLUS2015 PCNT_POPEST18PLUS pop_millions rate
43 9 1 40 1 1 Connecticut 3590886 2826827 78.7 3.590886 0.278483
22 42 7 40 1 2 Pennsylvania 12802503 10112229 79.0 12.802503 0.546768
38 19 2 40 2 4 Iowa 3123899 2395103 76.7 3.123899 0.640226
6 36 13 40 1 2 New York 19795791 15584974 78.7 19.795791 0.656705
29 25 5 40 1 1 Massachusetts 6794422 5407335 79.6 6.794422 0.735898
42 33 1 40 1 1 New Hampshire 1330608 1066610 80.2 1.330608 0.751536
45 23 1 40 1 1 Maine 1329328 1072948 80.7 1.329328 0.752260
11 17 11 40 2 3 Illinois 12859995 9901322 77.0 12.859995 0.855366
12 39 10 40 2 3 Ohio 11613423 8984946 77.4 11.613423 0.861073
31 55 5 40 2 3 Wisconsin 5771337 4476711 77.6 5.771337 0.866350
16 26 9 40 2 3 Michigan 9922576 7715272 77.8 9.922576 0.907023
28 47 6 40 3 6 Tennessee 6600299 5102688 77.3 6.600299 0.909050
15 37 10 40 3 5 North Carolina 10042802 7752234 77.2 10.042802 0.995738
36 32 3 40 4 8 Nevada 2890845 2221681 76.9 2.890845 1.037759
18 51 9 40 3 5 Virginia 8382993 6512571 77.7 8.382993 1.073602
40 54 2 40 3 5 West Virginia 1844128 1464532 79.4 1.844128 1.084523
25 27 6 40 2 4 Minnesota 5489594 4205207 76.6 5.489594 1.092977
20 18 8 40 2 3 Indiana 6619680 5040224 76.1 6.619680 1.208518
8 34 11 40 1 2 New Jersey 8958013 6959192 77.7 8.958013 1.227951
35 5 4 40 3 7 Arkansas 2978204 2272904 76.3 2.978204 1.343091
2 12 29 40 3 5 Florida 20271272 16166143 79.7 20.271272 1.430596
44 11 1 40 3 5 District of Columbia 672228 554121 82.4 0.672228 1.487591
9 53 11 40 4 9 Washington 7170351 5558509 77.5 7.170351 1.534095
5 13 16 40 3 5 Georgia 10214860 7710688 75.5 10.214860 1.566346
23 21 7 40 3 6 Kentucky 4425092 3413425 77.1 4.425092 1.581888
13 29 10 40 2 4 Missouri 6083672 4692196 77.1 6.083672 1.643744
21 1 8 40 3 6 Alabama 4858979 3755483 77.3 4.858979 1.646436
14 24 10 40 3 5 Maryland 6006401 4658175 77.6 6.006401 1.664891
30 49 5 40 4 8 Utah 2995919 2083423 69.5 2.995919 1.668937
46 56 1 40 4 8 Wyoming 586107 447212 76.3 0.586107 1.706173
1 48 47 40 3 7 Texas 27469114 20257343 73.7 27.469114 1.711013
17 45 9 40 3 5 South Carolina 4896146 3804558 77.7 4.896146 1.838180
0 6 74 40 4 9 California 39144818 30023902 76.7 39.144818 1.890416
37 30 2 40 4 8 Montana 1032949 806529 78.1 1.032949 1.936204
19 41 8 40 4 9 Oregon 4028977 3166121 78.6 4.028977 1.985616
26 28 6 40 3 6 Mississippi 2992333 2265485 75.7 2.992333 2.005124
24 20 6 40 2 4 Kansas 2911641 2192084 75.3 2.911641 2.060694
41 10 2 40 3 5 Delaware 945934 741548 78.4 0.945934 2.114312
7 8 12 40 4 8 Colorado 5456574 4199509 77.0 5.456574 2.199182
10 22 11 40 3 7 Louisiana 4670724 3555911 76.1 4.670724 2.355095
32 35 5 40 4 8 New Mexico 2085109 1588201 76.2 2.085109 2.397956
33 16 4 40 4 8 Idaho 1654930 1222093 73.8 1.654930 2.417021
39 2 2 40 4 9 Alaska 738432 552166 74.8 0.738432 2.708442
34 15 4 40 4 9 Hawaii 1431603 1120770 78.3 1.431603 2.794071
27 31 6 40 2 4 Nebraska 1896190 1425853 75.2 1.896190 3.164240
3 4 25 40 4 8 Arizona 6828065 5205215 76.2 6.828065 3.661359
4 40 22 40 3 7 Oklahoma 3911338 2950017 75.4 3.911338 5.624674

We can see the number of incidents in each state now -

In [40]:
police_killings["state"].value_counts() #Number of incidents in each state
Out[40]:
CA    74
TX    46
FL    29
AZ    25
OK    22
GA    16
NY    14
CO    12
IL    11
LA    11
NJ    11
WA    11
NC    10
MO    10
OH    10
MD    10
VA     9
SC     9
MI     9
OR     8
IN     8
AL     8
PA     7
KY     7
TN     6
NE     6
MS     6
KS     6
MN     6
UT     5
WI     5
NM     5
MA     5
ID     4
HI     4
AR     4
NV     3
AK     2
WV     2
IA     2
DE     2
MT     2
ME     1
CT     1
NH     1
DC     1
WY     1
Name: state, dtype: int64

Here we create a new dataframe called g where '-' values are removed from certain columns and converted to floating types. Note that removing '-' or unknown columns does add bias to the data.

In [41]:
g = police_killings[(police_killings["share_white"]!='-') & (police_killings["share_black"]!='-') &
                    (police_killings["share_hispanic"]!='-')]

g["share_white"] = g["share_white"].astype(float)
g["share_black"] = g["share_black"].astype(float)
g["share_hispanic"] = g["share_hispanic"].astype(float)
/Users/Guneet/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/Guneet/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/Guneet/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

With out merged table we can now look at states with highest and lowest rates, and some interesting columns which tell us more about them. Here are some inferences -

In [42]:
lowest_states = ["CT", "PA", "IA", "NY", "MA", "NH", "ME", "IL", "OH", "WI"] 
highest_states = ["OK", "AZ", "NE", "HI", "AK", "ID", "NM", "LA", "CO", "DE"]

ls = g[g["state"].isin(lowest_states)] #states from g where the state is in list of 10 states with lowest rates
hs = g[g["state"].isin(highest_states)] #states from g where the state in in list of 10 states with highest rates
In [43]:
#Looking at some interesting columns for these states like pop, county_income, share_white, share_black,
# share_hispanic

columnlist = ["pop","county_income","share_white","share_black","share_hispanic"]

ls[columnlist].mean() #mean values across states with lowest killing rates
Out[43]:
pop                4201.660714
county_income     54830.839286
share_white          60.616071
share_black          21.257143
share_hispanic       12.948214
dtype: float64
In [44]:
hs[columnlist].mean() #mean values across states with highest killing rates
Out[44]:
pop                4315.750000
county_income     48706.967391
share_white          55.652174
share_black          11.532609
share_hispanic       20.693478
dtype: float64

Inferences

If we look at the data above we see that states with lower shooting rates tend to have a higher proportion of the population identifying as black. Also states with higher shooting rates show an increased proportion of people identifying as hispanic. States with higher shooting rates also have lower median county incomes as compared to states with lower shooting rates.

An interesting thing to note is that we are actually looking at columns that contain county-level data for where the shooting occurred which might differ from a state by state comparison of the columns. This may bias the data, and result in different observations.

Next Steps

For further analysis we could explore some of the columns that were not analysed for the above analysis and integrate more external data sources :

  • Data.gov - http://www.data.gov/
  • Socrata - https://opendata.socrata.com/
  • Github - https://github.com/caesar0301/awesome-public-datasets
  • Census Data - https://www.census.gov

It'll be interesting to map out a state-level data in a choropleth map with matplotlib. A choropleth map is a kind of a thematic map that can be used to display data that varies across geographic regions. Data values are usually mapped to different color saturations for numerical variables or color hues for categorical variables. Different patterns can also be used, but that is not as common. Typical examples are maps that show election results.

It's useful to look more into the cause column as well, and see if there are any patterns. Looking more broadly at crime rates where the shootings occured could be a good point of investigation as well.



Comments

comments powered by Disqus