Just days after the first SARS-CoV-2 cases were identified in the United States, the degree to which different states and counties were affected became apparent. The USNS Hospital Ship Comfort arrived at Pier 88 of New York Harbor on the Monday of March 30, while life in rural US remained unaffected for the majority. This difference of impact continued throughout the summer and into the fall, as state and local governments took an array of different approaches to balance the public health impact of COVID-19 with the economic impact.
In the following analysis, I will attempt to display varying degrees of impact COVID-19 has had on six states:
These states represent a variety of demographics and geographies. By analyzing the death toll of these states, as well as their attributes such as geography and population density, I hope to identify casual relationships that will be useful in building a predictive model.
Pandas and Geopandas will be used for data manipulation, Bokeh for interactive plotting, StatsModels for correlation calculation, and SciKit Learn for ML model creation.
#!pip install jupyterthemes folium geopandas
"""""
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:60% !important; !important; } }</style>"))
import requests
import pandas as pd, geopandas as gpd
import geojson, json
import numpy as np
import censusdata
from datetime import datetime
import bokeh.plotting as bkplt
from bokeh.io import output_notebook, show, curdoc
from bokeh.resources import INLINE
from bokeh.layouts import gridplot as grid, row, column
from bokeh.models import LinearAxis, Range1d, Div
from bokeh.themes import built_in_themes, Theme
from bokeh import palettes
from sklearn import preprocessing
from statsmodels.stats.weightstats import DescrStatsW as dwStats
from sklearn.tree import DecisionTreeRegressor as dTree
from sklearn.model_selection import cross_val_score, ShuffleSplit, KFold
from sklearn.linear_model import LinearRegression
""""
bkplt.output_notebook(hide_banner = True)
# Defining state color palette using bokeh set3 palette
colors = palettes.Set3[6]
states = ['North Carolina', 'Maryland', 'California', 'New York', 'Texas', 'Michigan']
# applying custom bokeh theme to match jupyter theme
with open('customTheme.json') as f:
d = json.load(f)
curdoc().theme = Theme(json = d)
COVID-19 data is sourced from Novel Corona Virus 2019 Dataset - Kaggle, GeoJson county level files from Opendatasoft, and US Census data is pulled from the CensusData Api
First, using pandas, we read in the COVID-19 dataset into a DataFrame, pandas own two dimensional data structure.
cov19df = pd.read_csv("time_series_covid_19_deaths_US.csv")
#reading df and displaying last two columns as a sample
cov19df.iloc[-2:]
Next, using GeoPandas, we read in the geoJSON files downloaded from ODS, into a GeoDataFrame. This is nothing but a pandas DataFrame with defined columns for geometry, but the use of the GeoPandas library allows easy conversion from a geoJSON into a DataFrame. You may notice the line:
mdGdf.at[15, 'name'] = 'Baltimore City'
This is an example of Data pre-processing, since the counties of Baltimore City as well Baltimore County were both marked as 'Baltimore' in the name column of our GeoDataFrame, later merges with the COVID-19 dataset (which contain a 'Baltimore' and 'Baltimore City') would be unsuccessful or incomplete if not resolved. Manually fixing data can be a painstaking yet necessary part of data science.
#reading in geojson files using geopandas
nc_geo = "ncCounties.geojson"
tx_geo = "txCounties.geojson"
ny_geo = "nyCounties.geojson"
ca_geo = "caCounties.geojson"
md_geo = 'mdCounties.geojson'
mi_geo = 'miCounties.geojson'
ncGdf = gpd.read_file(nc_geo)
txGdf = gpd.read_file(tx_geo)
nyGdf = gpd.read_file(ny_geo)
caGdf = gpd.read_file(ca_geo)
mdGdf = gpd.read_file(md_geo)
miGdf = gpd.read_file(mi_geo)
#showing duplicated county name 'Baltimore' which needs to be resolved
mdGdf[mdGdf['name'] == 'Baltimore']
#Changing county name to baltimore city to match other data sources, and showing that the issue has been resolved
mdGdf.at[15, 'name'] = 'Baltimore City'
mdGdf[mdGdf['name'] == 'Baltimore']
Last we will use the CensusData Api to collect population figures for each county in our selected states, these numbers will also be stored in a DataFrame. To use the CensusData Api we need the ANSI FIPS codes for our states, which our convienently located in the GeoDataFrames we just produced.
#function to get county data using CensusData api
def censusData(stateCode):
censusDf = censusdata.download('acs5', 2015, censusdata.censusgeo(
[('state', stateCode),('county', '*')]), ['B01001_001E'])
censusDf = censusDf.reset_index()
censusDf['name'] = censusDf.apply(lambda x: str(x['index']).split(" Count")[0] , axis = 1)
censusDf = censusDf.rename(columns = {'B01001_001E': 'pop'}).drop(['index'], axis = 1)
return censusDf
#using census api to get population for all state counties
ncCensus = censusData(ncGdf.iloc[0]['statefp'])
mdCensus = censusData(mdGdf.iloc[0]['statefp'])
txCensus = censusData(txGdf.iloc[0]['statefp'])
nyCensus = censusData(nyGdf.iloc[0]['statefp'])
caCensus = censusData(caGdf.iloc[0]['statefp'])
miCensus = censusData(miGdf.iloc[0]['statefp'])
ncCensus.iloc[-2:]
To make things easier, lets gather our DataFrames in a single data structure.
#defining dictionary of dataframes
frames = {"Maryland":{"census": mdCensus, "gdf": mdGdf},
"North Carolina":{"census": ncCensus, "gdf": ncGdf},
"Michigan":{"census": miCensus, "gdf": miGdf},
"Texas":{"census": txCensus, "gdf": txGdf},
"New York":{"census": nyCensus, "gdf": nyGdf},
"California":{"census": caCensus, "gdf": caGdf}}
Now we have our geographic and census DataFrames stored together, but our COVID-19 dataset is still in an unusable form. Before we separate it into different our different states, we need to make it Tidy, currently the DataFrame has one row for every county, and columns for every day, with other attributes in columns as well, to make this data tidy, we need to make every observation (every day, for every county) into its own row. To do this we use melt
a pandas function to un-pivot our DataFrame. We do this for every state and then add these frames to our dictionary of frames.
for f in frames:
#select the state's covid data
state_covid_data = cov19df[cov19df['Province_State'] == f]
# using melt
state_covid_data = state_covid_data.rename(columns = {"Admin2": "name"})
state_covid_data = pd.melt(state_covid_data,id_vars='name', value_vars = list(state_covid_data.columns[12:332].values),
var_name = 'date', value_name = 'deaths')
#saving melted covid dataframes to the dictionary
frames[f]['covid'] = state_covid_data
Now that we have all of our data, we could get straight to plotting and analysis, but to make our lives easier we should join these frames. We will use merge
the pandas function analogous to SQL join
. Since our rows will now all contain state data as well, we can drop the use of a dictionary to hold our frames, and append them together.
df = pd.DataFrame()
#combining all state frames into a single dataframe (state_name, name) is a working primary key
for f in frames:
df = df.append(frames[f]['covid'].merge(frames[f]['census'], on = 'name', how = 'inner').merge(frames[f]['gdf'], on = 'name', how = 'inner'))
df.iloc[-2:]
Date is currently being represented as a string, which isn't optimal, lets convert it now to datetime.
# lambda func application to add datetime column
df['datetime'] = df.apply(lambda x: datetime.strptime(x['date'], '%m/%d/%y'), axis = 1)
Since population density will be of use to use later, lets go ahead and calculate that now. The column aland
is the area in land in $m^{2}$ for each county, originally obtained from our GeoJSON files, I'll be using $km^{2}$ so you'll notice a factor of $1000^{2} = 1000000$ coming into play. We also want state death totals for each day, so I'll do that as well, using a groupby
operation, which like merge
, is analogous to the same SQL statement
#groupby operation to make seperate state dateframe
dfTotals = df.groupby(by = ['datetime', 'state_name']).sum().reset_index()
df = df.reset_index(drop = True)
#lambda application for population density for both dfs
df['pDensity'] = df.apply(lambda x: (x['pop']/x['aland'])*1000000, axis = 1)
dfTotals['pDensity'] = dfTotals.apply(lambda x: (x['pop']/x['aland'])*1000000, axis = 1)
#per capity lambda application
dfTotals['perCap'] = dfTotals.apply(lambda x: (x['deaths']/x['pop'])*1000, axis = 1)
df['perCap'] = df.apply(lambda x: (x['deaths']/x['pop'])*1000, axis = 1)
#sample rows displayed
display(df.iloc[-2:], dfTotals.iloc[-6:])
Now that we have county and state level DataFrames of tidy data, we can move onto analysis.
I'll be primarily using Bokeh
for visualization, since Bokeh allows interaction once exported to html, which matplotlib
/seaborn
do not, at least not in an easy to apply fashion. To get a feel for some trends visually, I'll plot three things:
To plot efficiently, I'll zip
two lists, one of the names of my states, and another of my chosen colors from the palette I defined during my environment setup.
# zip object holding state names and colors
states_colors = zip(states, colors)
# making plots using bokeh
f1 = bkplt.figure(plot_height = 550, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths',
title ="Cumulative Deaths for Selected States", x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']))
f2 = bkplt.figure(plot_height = 550, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths/Capita',
title ="Cumulative Deaths Per Capita for Selected States", x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']))
f3 = bkplt.figure(plot_height = 550, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths/Population Density',
title ="Cumulative Deaths Per Population Density for Selected States", x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']))
for state,color in states_colors:
df1 = dfTotals[dfTotals['state_name'] == state].copy()
df1['var3'] = df1.apply(lambda x: x['deaths']/x['pDensity'], axis = 1)
f1.line(df1['datetime'], df1['deaths'], line_width = 6, alpha = .8, color = color, legend_label= state)
f2.line(df1['datetime'], df1['perCap'], line_width = 6, alpha = .8, color = color, legend_label= state)
f3.line(df1['datetime'], df1['var3'], line_width = 6, alpha = .8, color = color,legend_label= state)
for f in [f1,f2,f3]:
f.legend.location = "top_left"
f.legend.title = "Click To Hide"
f.legend.title_text_color = "White"
f.legend.click_policy="hide"
show(column(f1,f2,f3, sizing_mode = 'stretch_width'))
The extremely fast rise of deaths in New York immediately pops out at us, both on an absolute as well as per capita basis. Viewing certain pairs of states also serves for interesting comparisons. Try hiding every state besides California, Texas, and Michigan in the per capita plot. We see an initial surge from Michigan, while Texas and California remain tightly paired. As the weather heats up though, deaths in Texas rose dramatically. I would guess this is due to more people congregating inside during the summer months, compared to the more comfortable summers of CA and MI. This trend reverses as the weather cools off, and Michigan deaths begin to rise dramatically. I would assume this is a similar phenomenon, as the weather becomes unbearable, so do outside gatherings, and more people are bound to catch COVID-19 inside. When controlling for population density in the third graph, the problem with comparing states to each other becomes apparent. Texas deaths skyrocket, but this is almost certainly due to the many largely uninhabited counties with extremly large land areas, skewing their population density far lower than the density the average Texan actually lives at. Therefore it is clearly necessary to do further exploratory data analysis at a county level before attempting to build a and train a model.
To get a better understanding of what's going on at the county level, I'll make two plots for each state, one displaying deaths per capita for each county, and one displaying deaths while controlling for population density, like done in the last two statewide plots.
plots = []
states_colors = zip(states, colors)
for state, color in states_colors:
f2 = bkplt.figure( plot_height = 500, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths/Capita',
title ="Deaths Per Capita for All Counties in "+ state+", Top 5 Deadliest Counties Selected", x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']))\
# temporary frame of just a single state
df1 = df[df['state_name'] == state].copy()
# sorting obeservations by time
df1 = df1.sort_values(by = 'datetime')
#finding number of counties
county_count = len(df1['name'].unique())
#finding top 5 deadliest counties by sorting by deaths on the last day
top5 = list(df1[-county_count:].sort_values(by = 'deaths')['name'][-5:])
#sorting top5 descending
top5.reverse()
#for each county
for s in df1['name'].unique():
df2 = df1[df1['name'] == s].copy()
#if county is top 5 deadliest, make line slightly more pronounced and add to legend, giving option to highlight the county
if( s in top5 ):
f2.line(df2['datetime'], df2['perCap'], alpha = .2, color = color, line_width = 3, muted_alpha=1, legend_label= s, muted_line_width = 6)
continue
#else plot line normally
f2.line(df2['datetime'], df2['perCap'], alpha = .05, color = color, line_width = 2)
#style code and legend click policy
for f in [f2]:
f.legend.location = "top_left"
f.legend.title = "Click To Hide"
f.legend.title_text_color = "White"
f.legend.click_policy="mute"
f.legend.title = "Top 5 "+state+" Counties (by deaths): Click To Highlight"
#add states plot to list of plots
plots.append(f2)
#display a column of state plots
show(column(plots, sizing_mode = 'stretch_width'))
Now that we can see all the counties together, it is clear that a one size fits all model for each state is insufficient. While counties with more deaths than others may follow similar patterns, smaller or less densely populated counties heavily skew the statewide results. For example by analyzing the Texas counties we see four of the top five most deadly counties followed very similar trends, but Hidalgo county acted much differently. This suggests that there differences at the county level that we must account for if we attempt to model the spread of COVID-19. If we look at the top five counties in New York, we see that they have much higher death rates than most of the other counties, similarly this is likely due to the high density of people living in and around New York City, while those living in upstate New York are much more spread out.
Let's try and determine there is a relationship between per capita death rate and population density by plotting the final death rate per capita of all counties on one axis, and their population density on another.
plots = []
states_colors = zip(states, colors)
covar = []
names = []
for state, color in states_colors:
f = bkplt.figure(plot_height = 450, x_axis_label = 'Deaths per Capita', y_axis_label = 'Population Density',
title ='Pop. Density vs Deaths per Capita on 12/6/2020 for '+ state+ ' Counties',)
df1 = df[df['state_name'] == state]
final_date = df1['datetime'].max()
df1 = df1[df1['datetime'] == final_date]
#finding average deaths in state to normalize dot size later
avg_deaths = df1['deaths'].mean()
#plotting pouints, using 1+ deaths/avgDeaths as the dot size
f.scatter(df1['perCap'], df1['pDensity'], fill_color = color, alpha = .6, size = 1 + (df1['deaths']/avg_deaths))
stats = dwStats(df1[['perCap','pDensity']], df1['deaths'])
names.append(state)
#calculating correlation coefficient for the state adding to list
covar.append(stats.corrcoef[0][1])
plots.append(f)
#making bar chart of states and their correlation coefficient
f2 = bkplt.figure(plot_height = 300, x_range = names, x_axis_label = 'State', y_axis_label = 'Covariance',
title ='Covariance Between Per Capita Deaths and Populatiion Density',)
f2.vbar(x = names, top = covar, width = .8, color = colors )
plots.append(f2)
show(column(plots, sizing_mode = 'stretch_width'))
Since the correlation coefficient varies depending on the state, if we want to use it in our model it is best to include an interaction term between the state and the county population if using a linear model, or use a more complex model like a Decision Tree. We still have a rocky view of what factors are important to determine how a county will be affected by COVID-19, but let's proceed and attempt to build a model
Picking a decision tree regression seems to make sense for this problem for a variety of reasons. First, we have a variety of numeric data such as dates, deaths, latitude, longitude, population, and land area. Since we don't know exactly how these variables relate to each other and to our problem of modeling COVID-19 deaths, other models would require making a lot of interaction features and our resulting model would likely be hard to interpret, even if it was accurate. Using a decision tree lets us build a model that can actually be interpreted visually, without a large amount of preprocessing of data.
To avoid overfitting, we will use cross-validatiion) to train our model, selecting hyperparameters) to achieve the best accuracy score.
#changing date, latitude, and longitude to numeric values to use for modeling
df['numDate'] = df.apply(lambda x: (x['datetime'] - datetime(2020,1,27)).days, axis = 1)
df['lat'] = df.apply(lambda x: float(x['intptlat']), axis = 1)
df['lon'] = df.apply(lambda x: float(x['intptlon']), axis = 1)
#trimming first thirty days of data since deaths had not started to accumulate
df1 = df[df['numDate'] > 30]
df1 = df1.reset_index()
#defining X attributes to use as well as y output (deaths)
X = df1[['numDate','lat','lon','aland','pop']].to_numpy()
y = df1['deaths'].values
#making two decision trees, one short and one tall to test method before proceeding
tree = dTree(max_depth = 6, min_samples_split = 20, min_samples_leaf = 10)
tree2 = dTree(max_depth = 50, min_samples_split = 20, min_samples_leaf = 10)
#fitting model
estimator = tree.fit(X,y)
estimator2 = tree2.fit(X,y)
#adding predictions to dataframe
df1['prediction1'] = estimator.predict(X)
df1['prediction2'] = estimator2.predict(X)
f = bkplt.figure(plot_height = 450, x_axis_label = 'Date', y_axis_label = 'Deaths',
title ='Decision Tree Regressor Test for Wake County', x_axis_type = 'datetime', sizing_mode = 'stretch_width')
wake = df1[df1['name'] == 'Wake']
#plotting predictions for Wake county to get some initial understanding of their performance
f.line(wake['datetime'], wake['deaths'], color = 'red', legend_label = 'Actual Data')
f.line(wake['datetime'], wake['prediction1'], color = 'blue', legend_label = 'Tree of Depth = 6')
f.line(wake['datetime'], wake['prediction2'], color = 'lightblue', legend_label = 'Tree of Depth = 50')
show(f, sizing_mode = 'stretch_width')
Immediately you can see that there will be problems implementing a decision tree regression, when using a large depth the model is badly overfitting, as you can observe it tightly tracking Wake County, NC above. On the other hand, limiting the number of levels of the tree restricts the level of resolution on the date axis, and will not predict any further than the end of our training dataset, making our model essentially worthless as a prediction tool.
Instead we will need to train a linear regression model, but since our result is non-linear and our inputs have very different scales, we will first need to scale our input data and generate polynomial features. To do this we will utilize SciKitLearn's StandardScaler
which will allow us to scale data and use the same transformation on later test data. Then we will generate polynomial features.
#defining standard scaler
scaler = preprocessing.StandardScaler()
#polynomial features to generate interaction features
poly = preprocessing.PolynomialFeatures(5, interaction_only = True)
#generates quadratic features, applies scaling transformation, and then polynomial interaction features
def pre_process(X, scaler = scaler, poly = poly):
return poly.fit_transform(scaler.fit_transform(np.hstack((X, X**2))))
Since cases seem to rise in a quadratic fashion, we added a new set of features based on the features squared, and append them to our matrix using np.hstack. This is a way to model a relationship of higher order while still using linear least regression to minimize the loss function. We then added polynomial features up to degree 5, since we have 5 features, and we defined this process in pre_process
so we can quickly access it when training and testing.
#training model and adding predictions to DF
reg = LinearRegression().fit(pre_process(X), y)
df1['LinearPrediction'] = reg.predict(pre_process(X))
Since we've now fit our model to the data, and added the predictions to a column in our DataFrame, let's do a quick visual analysis of our result to see if we're on the right track. I'll plot the prediction and the actual death count for the most populous county in each state.
plots = []
states_colors = zip(states, colors)
for state, color in states_colors:
df2 = df1[df1['state_name'] == state].reset_index(drop = True)
#finding largest county in each state
largest_county = df2['name'].iloc[df2[['pop']].idxmax()].iloc[0]
#temporary dataframe of just largest county
df2 = df2[df2['name'] == largest_county]
#plotting both actual values and prediction
f = bkplt.figure(plot_height = 450, x_axis_label = 'Date', y_axis_label = 'Deaths',
title ='Linear Regression vs Actual Data for '+ largest_county + ' County, ' + state,
x_axis_type = 'datetime', sizing_mode = 'stretch_width')
f.line(df2['datetime'], df2['deaths'], color = 'red', legend_label = 'Actual')
f.line(df2['datetime'], df2['LinearPrediction'], color = 'lightblue', legend_label = 'Prediction')
f.legend.location = "top_left"
plots.append(f)
show(column(plots, sizing_mode = 'stretch_width'))
From a visual perspective the model looks great, but lets get a more formal interpretation of its accuracy. To do so we will reset the model and run Cross Valiidation) on shuffled folds of our training data. To do this we will use SciKit learn's built in cross_val_score
, which performs the validation for us and returns accuracy scores for us. For the linear regression model these accuracy scores are defined as $R^{2} \equiv \frac{1-u}{v}$ where $u$ is the residual sum of squares, and $v$ is the total sum of squares. The best possible score is $1.0$.
#reset the model then run cross validation test, using shuffled input data
reg = LinearRegression()
cross_val_score(reg, pre_process(X), y, cv = ShuffleSplit())
These are great scores, especially since when using 10-fold shuffled cross validation as done here, each trained model only used 10% of data, randomly selected, and was over 99% accurate when tested against the remaining 90% of the data. To visualize the success, let's reconstruct the state numbers, using only our prediction for county numbers, and compare to the true state totals.
#using groupby operation to construct state totals and state total predictiions
df1Totals = df1.groupby(by = ['datetime', 'state_name']).sum().reset_index()
f = bkplt.figure(plot_height = 450, x_axis_label = 'Date', y_axis_label = 'Deaths',
title ='Linear Regression vs Actual Data for Reconstructed State Data',
x_axis_type = 'datetime', sizing_mode = 'stretch_width')
states_colors = zip(states, colors)
plots = []
for state, color in states_colors:
f = bkplt.figure(plot_height = 450, x_axis_label = 'Date', y_axis_label = 'Deaths',
title ='Sum of Linear Regression County Predictions vs Actual Data for ' + state,
x_axis_type = 'datetime', sizing_mode = 'stretch_width')
#temporary dataframe of a single state
df2 = df1Totals[df1Totals['state_name'] == state]
#plotting state data as well as constructed prediction
f.line(df2['datetime'], df2['deaths'], line_width = 3, alpha = .5, color = color, legend_label= 'Deaths')
f.line(df2['datetime'], df2['LinearPrediction'], line_width = 3, alpha = .8, color = color, legend_label= 'Sum of County Models', line_dash = 'dotted' )
f.legend.location = "top_left"
f.legend.title = "Click To Hide"
f.legend.title_text_color = "White"
f.legend.click_policy="hide"
plots.append(f)
show(column(plots, sizing_mode = 'stretch_width'))
By gathering data on COVID-19 deaths, population data, and geographical data, I was able to successfully create a model to track deaths at a county level. Due to the accuracy of the model, it appears that in the limited context of the United States at least, just those factors are enough to predict how a given county would have been affected by the pandemic over the past year. How well this model or more complex models such as the CDC's Death Forecast can see into the future though, remains to be seen.
To learn more about data management, exploratory data analysis, statistics, and machine learning, consider exploring the following links:
All source files can be found at Github - Miguel Chavez