COVID-19: Modeling The Relative Impact on US States

By Miguel Chavez

Introduction

Just days after the first SARS-CoV-2 cases were identified in the United States, the degree to which different states and counties were affected became apparent. The USNS Hospital Ship Comfort arrived at Pier 88 of New York Harbor on the Monday of March 30, while life in rural US remained unaffected for the majority. This difference of impact continued throughout the summer and into the fall, as state and local governments took an array of different approaches to balance the public health impact of COVID-19 with the economic impact.

In the following analysis, I will attempt to display varying degrees of impact COVID-19 has had on six states:

  • North Carolina
  • Maryland
  • California
  • New York
  • Texas
  • Michigan

These states represent a variety of demographics and geographies. By analyzing the death toll of these states, as well as their attributes such as geography and population density, I hope to identify casual relationships that will be useful in building a predictive model.

Evironment Setup

Pandas and Geopandas will be used for data manipulation, Bokeh for interactive plotting, StatsModels for correlation calculation, and SciKit Learn for ML model creation.

In [46]:
#!pip install jupyterthemes folium geopandas
"""""
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:60% !important; !important;  } }</style>"))
import requests
import pandas as pd, geopandas as gpd
import geojson, json
import numpy as np
import censusdata
from datetime import datetime
import bokeh.plotting as bkplt
from bokeh.io import output_notebook, show, curdoc
from bokeh.resources import INLINE
from bokeh.layouts import gridplot as grid, row, column
from bokeh.models import LinearAxis, Range1d, Div
from bokeh.themes import built_in_themes, Theme
from bokeh import palettes
from sklearn import preprocessing
from statsmodels.stats.weightstats import DescrStatsW as dwStats
from sklearn.tree import DecisionTreeRegressor as dTree
from sklearn.model_selection import cross_val_score, ShuffleSplit, KFold
from sklearn.linear_model import LinearRegression
""""
bkplt.output_notebook(hide_banner = True)
# Defining state color palette using bokeh set3 palette
colors = palettes.Set3[6]
states = ['North Carolina', 'Maryland', 'California', 'New York', 'Texas', 'Michigan']
# applying custom bokeh theme to match jupyter theme
with open('customTheme.json') as f:
    d = json.load(f)
    curdoc().theme = Theme(json = d)

Data Collection and Management

Data Sources

COVID-19 data is sourced from Novel Corona Virus 2019 Dataset - Kaggle, GeoJson county level files from Opendatasoft, and US Census data is pulled from the CensusData Api

Reading and Storing Data

First, using pandas, we read in the COVID-19 dataset into a DataFrame, pandas own two dimensional data structure.

In [47]:
cov19df = pd.read_csv("time_series_covid_19_deaths_US.csv")
#reading df and displaying last two columns as a sample
cov19df.iloc[-2:]
Out[47]:
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 11/27/20 11/28/20 11/29/20 11/30/20 12/1/20 12/2/20 12/3/20 12/4/20 12/5/20 12/6/20
3338 84056043 US USA 840 56043.0 Washakie Wyoming US 43.904516 -107.680187 ... 8 8 8 8 8 8 8 8 8 8
3339 84056045 US USA 840 56045.0 Weston Wyoming US 43.839612 -104.567488 ... 1 1 1 1 2 2 2 2 2 2

2 rows × 332 columns

Next, using GeoPandas, we read in the geoJSON files downloaded from ODS, into a GeoDataFrame. This is nothing but a pandas DataFrame with defined columns for geometry, but the use of the GeoPandas library allows easy conversion from a geoJSON into a DataFrame. You may notice the line:

mdGdf.at[15, 'name'] = 'Baltimore City'

This is an example of Data pre-processing, since the counties of Baltimore City as well Baltimore County were both marked as 'Baltimore' in the name column of our GeoDataFrame, later merges with the COVID-19 dataset (which contain a 'Baltimore' and 'Baltimore City') would be unsuccessful or incomplete if not resolved. Manually fixing data can be a painstaking yet necessary part of data science.

In [48]:
#reading in geojson files using geopandas
nc_geo = "ncCounties.geojson"
tx_geo = "txCounties.geojson"
ny_geo = "nyCounties.geojson"
ca_geo = "caCounties.geojson"
md_geo = 'mdCounties.geojson'
mi_geo = 'miCounties.geojson' 
ncGdf = gpd.read_file(nc_geo)
txGdf = gpd.read_file(tx_geo)
nyGdf = gpd.read_file(ny_geo)
caGdf = gpd.read_file(ca_geo)
mdGdf = gpd.read_file(md_geo)
miGdf = gpd.read_file(mi_geo)
#showing duplicated county name 'Baltimore' which needs to be resolved
mdGdf[mdGdf['name'] == 'Baltimore']
Out[48]:
intptlat countyfp_nozero name cbsafp funcstat intptlon lsad stusab classfp awater ... geoid aland countyfp countyns mtfcc namelsad statefp state_name metdivfp geometry
7 +39.4431666 5 Baltimore 12580 A -076.6165693 06 MD H1 215959023 ... 24005 1549745106 005 01695314 G4020 Baltimore County 24 Maryland None POLYGON ((-76.88730 39.44050, -76.88732 39.440...
15 +39.3000324 510 Baltimore 12580 F -076.6104761 25 MD C7 28758714 ... 24510 209650970 510 01702381 G4020 Baltimore city 24 Maryland None POLYGON ((-76.71151 39.36621, -76.71151 39.366...

2 rows × 21 columns

In [49]:
#Changing county name to baltimore city to match other data sources, and showing that the issue has been resolved
mdGdf.at[15, 'name'] = 'Baltimore City'
mdGdf[mdGdf['name'] == 'Baltimore']
Out[49]:
intptlat countyfp_nozero name cbsafp funcstat intptlon lsad stusab classfp awater ... geoid aland countyfp countyns mtfcc namelsad statefp state_name metdivfp geometry
7 +39.4431666 5 Baltimore 12580 A -076.6165693 06 MD H1 215959023 ... 24005 1549745106 005 01695314 G4020 Baltimore County 24 Maryland None POLYGON ((-76.88730 39.44050, -76.88732 39.440...

1 rows × 21 columns

Last we will use the CensusData Api to collect population figures for each county in our selected states, these numbers will also be stored in a DataFrame. To use the CensusData Api we need the ANSI FIPS codes for our states, which our convienently located in the GeoDataFrames we just produced.

In [50]:
#function to get county data using CensusData api
def censusData(stateCode):
    censusDf = censusdata.download('acs5', 2015, censusdata.censusgeo(
        [('state', stateCode),('county', '*')]), ['B01001_001E'])
    censusDf = censusDf.reset_index()
    censusDf['name'] = censusDf.apply(lambda x: str(x['index']).split(" Count")[0] , axis = 1)
    censusDf = censusDf.rename(columns = {'B01001_001E': 'pop'}).drop(['index'], axis = 1)
    return censusDf 
#using census api to get population for all state counties
ncCensus = censusData(ncGdf.iloc[0]['statefp'])
mdCensus = censusData(mdGdf.iloc[0]['statefp'])
txCensus = censusData(txGdf.iloc[0]['statefp'])
nyCensus = censusData(nyGdf.iloc[0]['statefp'])
caCensus = censusData(caGdf.iloc[0]['statefp'])
miCensus = censusData(miGdf.iloc[0]['statefp'])
ncCensus.iloc[-2:]
Out[50]:
pop name
98 37971 Yadkin
99 17604 Yancey

To make things easier, lets gather our DataFrames in a single data structure.

In [51]:
#defining dictionary of dataframes
frames = {"Maryland":{"census": mdCensus, "gdf": mdGdf},
         "North Carolina":{"census": ncCensus, "gdf": ncGdf},
         "Michigan":{"census": miCensus, "gdf": miGdf},
         "Texas":{"census": txCensus, "gdf": txGdf},
         "New York":{"census": nyCensus, "gdf": nyGdf},
         "California":{"census": caCensus, "gdf": caGdf}}

Now we have our geographic and census DataFrames stored together, but our COVID-19 dataset is still in an unusable form. Before we separate it into different our different states, we need to make it Tidy, currently the DataFrame has one row for every county, and columns for every day, with other attributes in columns as well, to make this data tidy, we need to make every observation (every day, for every county) into its own row. To do this we use melt a pandas function to un-pivot our DataFrame. We do this for every state and then add these frames to our dictionary of frames.

In [52]:
for f in frames:
    #select the state's covid data
    state_covid_data = cov19df[cov19df['Province_State'] == f]
    # using melt
    state_covid_data = state_covid_data.rename(columns = {"Admin2": "name"})
    state_covid_data = pd.melt(state_covid_data,id_vars='name', value_vars = list(state_covid_data.columns[12:332].values),
        var_name = 'date', value_name = 'deaths')
    #saving melted covid dataframes to the dictionary
    frames[f]['covid'] = state_covid_data

Now that we have all of our data, we could get straight to plotting and analysis, but to make our lives easier we should join these frames. We will use merge the pandas function analogous to SQL join. Since our rows will now all contain state data as well, we can drop the use of a dictionary to hold our frames, and append them together.

In [53]:
df = pd.DataFrame()
#combining all state frames into a single dataframe (state_name, name) is a working primary key
for f in frames:
    df = df.append(frames[f]['covid'].merge(frames[f]['census'], on = 'name', how = 'inner').merge(frames[f]['gdf'], on = 'name', how = 'inner'))
df.iloc[-2:]
Out[53]:
name date deaths pop intptlat countyfp_nozero cbsafp funcstat intptlon lsad ... geoid aland countyfp countyns mtfcc namelsad statefp state_name metdivfp geometry
18558 Yuba 12/5/20 11 73437 +39.2701300 115 49700 A -121.3442587 06 ... 06115 1636913845 115 00277322 G4020 Yuba County 06 California None POLYGON ((-121.59768 39.12779, -121.59781 39.1...
18559 Yuba 12/6/20 11 73437 +39.2701300 115 49700 A -121.3442587 06 ... 06115 1636913845 115 00277322 G4020 Yuba County 06 California None POLYGON ((-121.59768 39.12779, -121.59781 39.1...

2 rows × 24 columns

Date is currently being represented as a string, which isn't optimal, lets convert it now to datetime.

In [54]:
# lambda func application to add datetime column
df['datetime'] = df.apply(lambda x: datetime.strptime(x['date'], '%m/%d/%y'), axis = 1)

Since population density will be of use to use later, lets go ahead and calculate that now. The column aland is the area in land in $m^{2}$ for each county, originally obtained from our GeoJSON files, I'll be using $km^{2}$ so you'll notice a factor of $1000^{2} = 1000000$ coming into play. We also want state death totals for each day, so I'll do that as well, using a groupby operation, which like merge, is analogous to the same SQL statement

In [55]:
#groupby operation to make seperate state dateframe
dfTotals = df.groupby(by = ['datetime', 'state_name']).sum().reset_index()
df = df.reset_index(drop = True)
#lambda application for population density for both dfs
df['pDensity'] = df.apply(lambda x: (x['pop']/x['aland'])*1000000, axis = 1)
dfTotals['pDensity'] = dfTotals.apply(lambda x: (x['pop']/x['aland'])*1000000, axis = 1)
#per capity lambda application
dfTotals['perCap'] = dfTotals.apply(lambda x: (x['deaths']/x['pop'])*1000, axis = 1)
df['perCap'] = df.apply(lambda x: (x['deaths']/x['pop'])*1000, axis = 1)
#sample rows displayed
display(df.iloc[-2:], dfTotals.iloc[-6:])
name date deaths pop intptlat countyfp_nozero cbsafp funcstat intptlon lsad ... countyns mtfcc namelsad statefp state_name metdivfp geometry datetime pDensity perCap
185598 Yuba 12/5/20 11 73437 +39.2701300 115 49700 A -121.3442587 06 ... 00277322 G4020 Yuba County 06 California None POLYGON ((-121.59768 39.12779, -121.59781 39.1... 2020-12-05 44.863082 0.149788
185599 Yuba 12/6/20 11 73437 +39.2701300 115 49700 A -121.3442587 06 ... 00277322 G4020 Yuba County 06 California None POLYGON ((-121.59768 39.12779, -121.59781 39.1... 2020-12-06 44.863082 0.149788

2 rows × 27 columns

datetime state_name deaths pop awater aland pDensity perCap
1914 2020-12-06 California 19928 38421464 20305454540 403660088482 95.182717 0.518668
1915 2020-12-06 Maryland 4208 5308084 6950582256 24942075326 212.816453 0.792753
1916 2020-12-06 Michigan 10211 9900571 103878116983 146608689865 67.530588 1.031355
1917 2020-12-06 New York 34789 19673174 19246143659 122050000805 161.189462 1.768347
1918 2020-12-06 North Carolina 5543 9845333 13463401534 125925929633 78.183524 0.563008
1919 2020-12-06 Texas 23137 26538614 18991880422 676668210823 39.219537 0.871824

Now that we have county and state level DataFrames of tidy data, we can move onto analysis.

Exploratory Data Analysis

State Level Visualization

I'll be primarily using Bokeh for visualization, since Bokeh allows interaction once exported to html, which matplotlib/seaborn do not, at least not in an easy to apply fashion. To get a feel for some trends visually, I'll plot three things:

  1. $$\textrm{Cumulative Deaths: } \int_{i = 3/02/20}^{12/6/20} deaths_{i}$$
  1. $$\textrm{Cumulative Deaths Per Capita: }\int_{i = 3/02/20}^{12/6/20} \frac{deaths_{i}}{pop}$$
  1. $$\textrm{Cumulative Deaths Per Population Density $km^{2}$: }\int_{i = 3/02/20}^{12/6/20} \frac{deaths_{i}}{pop/km^{2}} \equiv \int_{i = 1/22/20}^{12/6/20} \frac{deaths_{i}*km^2}{pop} $$

To plot efficiently, I'll zip two lists, one of the names of my states, and another of my chosen colors from the palette I defined during my environment setup.

In [56]:
# zip object holding state names and colors
states_colors = zip(states, colors)
# making plots using bokeh
f1 = bkplt.figure(plot_height = 550, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths',
    title ="Cumulative Deaths for Selected States", x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']))
f2 = bkplt.figure(plot_height = 550, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths/Capita',
    title ="Cumulative Deaths Per Capita for Selected States", x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']))
f3 = bkplt.figure(plot_height = 550, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths/Population Density',
    title ="Cumulative Deaths Per Population Density for Selected States", x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']))
for state,color in states_colors:
    df1 = dfTotals[dfTotals['state_name'] == state].copy()
    df1['var3'] = df1.apply(lambda x: x['deaths']/x['pDensity'], axis = 1)
    f1.line(df1['datetime'], df1['deaths'], line_width = 6, alpha = .8, color = color, legend_label= state)
    f2.line(df1['datetime'], df1['perCap'], line_width = 6, alpha = .8, color = color, legend_label= state)
    f3.line(df1['datetime'], df1['var3'], line_width = 6, alpha = .8, color = color,legend_label= state)
for f in [f1,f2,f3]:
    f.legend.location = "top_left"
    f.legend.title = "Click To Hide"
    f.legend.title_text_color = "White"
    f.legend.click_policy="hide"
show(column(f1,f2,f3, sizing_mode = 'stretch_width'))

Initial Observations

The extremely fast rise of deaths in New York immediately pops out at us, both on an absolute as well as per capita basis. Viewing certain pairs of states also serves for interesting comparisons. Try hiding every state besides California, Texas, and Michigan in the per capita plot. We see an initial surge from Michigan, while Texas and California remain tightly paired. As the weather heats up though, deaths in Texas rose dramatically. I would guess this is due to more people congregating inside during the summer months, compared to the more comfortable summers of CA and MI. This trend reverses as the weather cools off, and Michigan deaths begin to rise dramatically. I would assume this is a similar phenomenon, as the weather becomes unbearable, so do outside gatherings, and more people are bound to catch COVID-19 inside. When controlling for population density in the third graph, the problem with comparing states to each other becomes apparent. Texas deaths skyrocket, but this is almost certainly due to the many largely uninhabited counties with extremly large land areas, skewing their population density far lower than the density the average Texan actually lives at. Therefore it is clearly necessary to do further exploratory data analysis at a county level before attempting to build a and train a model.

County Level Visualization

To get a better understanding of what's going on at the county level, I'll make two plots for each state, one displaying deaths per capita for each county, and one displaying deaths while controlling for population density, like done in the last two statewide plots.

In [57]:
plots = []
states_colors = zip(states, colors)
for state, color in states_colors:
    f2 = bkplt.figure( plot_height = 500, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths/Capita',
        title ="Deaths Per Capita for All Counties in "+ state+", Top 5 Deadliest Counties Selected", x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']))\
    # temporary frame of just a single state
    df1 = df[df['state_name'] == state].copy()
    # sorting obeservations by time
    df1 = df1.sort_values(by = 'datetime')
    #finding number of counties
    county_count = len(df1['name'].unique())
    #finding top 5 deadliest counties by sorting by deaths on the last day
    top5 = list(df1[-county_count:].sort_values(by = 'deaths')['name'][-5:])
    #sorting top5 descending
    top5.reverse()
    #for each county
    for s in df1['name'].unique():
        df2 = df1[df1['name'] == s].copy()
        #if county is top 5 deadliest, make line slightly more pronounced and add to legend, giving option to highlight the county
        if( s in top5 ):
            f2.line(df2['datetime'], df2['perCap'], alpha = .2, color = color, line_width = 3, muted_alpha=1, legend_label= s, muted_line_width = 6)
            continue
        #else plot line normally
        f2.line(df2['datetime'], df2['perCap'], alpha = .05, color = color, line_width = 2)
        #style code and legend click policy
    for f in [f2]:
        f.legend.location = "top_left"
        f.legend.title = "Click To Hide"
        f.legend.title_text_color = "White"
        f.legend.click_policy="mute"
        f.legend.title = "Top 5 "+state+" Counties (by deaths): Click To Highlight"
    #add states plot to list of plots
    plots.append(f2)
#display a column of state plots
show(column(plots, sizing_mode = 'stretch_width'))
    

Now that we can see all the counties together, it is clear that a one size fits all model for each state is insufficient. While counties with more deaths than others may follow similar patterns, smaller or less densely populated counties heavily skew the statewide results. For example by analyzing the Texas counties we see four of the top five most deadly counties followed very similar trends, but Hidalgo county acted much differently. This suggests that there differences at the county level that we must account for if we attempt to model the spread of COVID-19. If we look at the top five counties in New York, we see that they have much higher death rates than most of the other counties, similarly this is likely due to the high density of people living in and around New York City, while those living in upstate New York are much more spread out.

Exploring the Possibility of Death Rate // Population Density Correlation

Let's try and determine there is a relationship between per capita death rate and population density by plotting the final death rate per capita of all counties on one axis, and their population density on another.

In [58]:
plots = []
states_colors = zip(states, colors)
covar = []
names = []
for state, color in states_colors:
    f = bkplt.figure(plot_height = 450, x_axis_label = 'Deaths per Capita', y_axis_label = 'Population Density',
        title ='Pop. Density vs Deaths per Capita on 12/6/2020 for '+ state+ ' Counties',)
    df1 = df[df['state_name'] == state]
    final_date = df1['datetime'].max()
    df1 = df1[df1['datetime'] == final_date]
    #finding average deaths in state to normalize dot size later
    avg_deaths = df1['deaths'].mean()
    #plotting pouints, using 1+ deaths/avgDeaths as the dot size
    f.scatter(df1['perCap'], df1['pDensity'], fill_color = color, alpha = .6, size = 1 + (df1['deaths']/avg_deaths))
    stats = dwStats(df1[['perCap','pDensity']], df1['deaths'])
    names.append(state)
    #calculating correlation coefficient for the state adding to list
    covar.append(stats.corrcoef[0][1])
    plots.append(f)
#making bar chart of states and their correlation coefficient
f2 = bkplt.figure(plot_height = 300, x_range = names, x_axis_label = 'State', y_axis_label = 'Covariance',
        title ='Covariance Between Per Capita Deaths and Populatiion Density',)
f2.vbar(x = names, top = covar, width = .8, color = colors )    
plots.append(f2)
show(column(plots, sizing_mode = 'stretch_width'))

Since the correlation coefficient varies depending on the state, if we want to use it in our model it is best to include an interaction term between the state and the county population if using a linear model, or use a more complex model like a Decision Tree. We still have a rocky view of what factors are important to determine how a county will be affected by COVID-19, but let's proceed and attempt to build a model

Machine Learning

Building, Training, and Testing a Decision Tree Regression

Picking a decision tree regression seems to make sense for this problem for a variety of reasons. First, we have a variety of numeric data such as dates, deaths, latitude, longitude, population, and land area. Since we don't know exactly how these variables relate to each other and to our problem of modeling COVID-19 deaths, other models would require making a lot of interaction features and our resulting model would likely be hard to interpret, even if it was accurate. Using a decision tree lets us build a model that can actually be interpreted visually, without a large amount of preprocessing of data.

To avoid overfitting, we will use cross-validatiion) to train our model, selecting hyperparameters) to achieve the best accuracy score.

In [59]:
#changing date, latitude, and longitude to numeric values to use for modeling
df['numDate'] = df.apply(lambda x: (x['datetime'] - datetime(2020,1,27)).days, axis = 1)
df['lat'] = df.apply(lambda x: float(x['intptlat']), axis = 1)
df['lon'] = df.apply(lambda x: float(x['intptlon']), axis = 1)
#trimming first thirty days of data since deaths had not started to accumulate
df1 = df[df['numDate'] > 30]
df1 = df1.reset_index()
#defining X attributes to use as well as y output (deaths)
X = df1[['numDate','lat','lon','aland','pop']].to_numpy()
y = df1['deaths'].values
In [60]:
#making two decision trees, one short and one tall to test method before proceeding
tree = dTree(max_depth = 6, min_samples_split = 20, min_samples_leaf = 10)
tree2 = dTree(max_depth = 50, min_samples_split = 20, min_samples_leaf = 10)
#fitting model
estimator = tree.fit(X,y)
estimator2 = tree2.fit(X,y)
#adding predictions to dataframe
df1['prediction1'] = estimator.predict(X)
df1['prediction2'] = estimator2.predict(X)
In [61]:
f = bkplt.figure(plot_height = 450, x_axis_label = 'Date', y_axis_label = 'Deaths',
        title ='Decision Tree Regressor Test for Wake County', x_axis_type = 'datetime', sizing_mode = 'stretch_width')
wake = df1[df1['name'] == 'Wake']
#plotting predictions for Wake county to get some initial understanding of their performance
f.line(wake['datetime'], wake['deaths'], color = 'red', legend_label = 'Actual Data')
f.line(wake['datetime'], wake['prediction1'], color = 'blue', legend_label = 'Tree of Depth = 6')
f.line(wake['datetime'], wake['prediction2'], color = 'lightblue', legend_label = 'Tree of Depth = 50')
show(f, sizing_mode = 'stretch_width')

Immediately you can see that there will be problems implementing a decision tree regression, when using a large depth the model is badly overfitting, as you can observe it tightly tracking Wake County, NC above. On the other hand, limiting the number of levels of the tree restricts the level of resolution on the date axis, and will not predict any further than the end of our training dataset, making our model essentially worthless as a prediction tool.

Ditching the Decision Tree in Favor of a Linear Regression

Instead we will need to train a linear regression model, but since our result is non-linear and our inputs have very different scales, we will first need to scale our input data and generate polynomial features. To do this we will utilize SciKitLearn's StandardScaler which will allow us to scale data and use the same transformation on later test data. Then we will generate polynomial features.

In [62]:
#defining standard scaler
scaler = preprocessing.StandardScaler()
#polynomial features to generate interaction features
poly = preprocessing.PolynomialFeatures(5, interaction_only = True)
#generates quadratic features, applies scaling transformation, and then polynomial interaction features
def pre_process(X, scaler = scaler, poly = poly):
    return poly.fit_transform(scaler.fit_transform(np.hstack((X, X**2))))

Since cases seem to rise in a quadratic fashion, we added a new set of features based on the features squared, and append them to our matrix using np.hstack. This is a way to model a relationship of higher order while still using linear least regression to minimize the loss function. We then added polynomial features up to degree 5, since we have 5 features, and we defined this process in pre_process so we can quickly access it when training and testing.

In [63]:
#training model and adding predictions to DF
reg = LinearRegression().fit(pre_process(X), y)
df1['LinearPrediction'] = reg.predict(pre_process(X))

Since we've now fit our model to the data, and added the predictions to a column in our DataFrame, let's do a quick visual analysis of our result to see if we're on the right track. I'll plot the prediction and the actual death count for the most populous county in each state.

In [64]:
plots = []
states_colors = zip(states, colors)
for state, color in states_colors:
    df2 = df1[df1['state_name'] == state].reset_index(drop = True)
    #finding largest county in each state
    largest_county = df2['name'].iloc[df2[['pop']].idxmax()].iloc[0]
    #temporary dataframe of just largest county
    df2 = df2[df2['name'] == largest_county]
    #plotting both actual values and prediction
    f = bkplt.figure(plot_height = 450, x_axis_label = 'Date', y_axis_label = 'Deaths',
        title ='Linear Regression vs Actual Data for '+ largest_county + ' County, ' + state,
            x_axis_type = 'datetime', sizing_mode = 'stretch_width')
    f.line(df2['datetime'], df2['deaths'], color = 'red', legend_label = 'Actual')
    f.line(df2['datetime'], df2['LinearPrediction'], color = 'lightblue', legend_label = 'Prediction')
    f.legend.location = "top_left"
    plots.append(f)
show(column(plots, sizing_mode = 'stretch_width'))

From a visual perspective the model looks great, but lets get a more formal interpretation of its accuracy. To do so we will reset the model and run Cross Valiidation) on shuffled folds of our training data. To do this we will use SciKit learn's built in cross_val_score, which performs the validation for us and returns accuracy scores for us. For the linear regression model these accuracy scores are defined as $R^{2} \equiv \frac{1-u}{v}$ where $u$ is the residual sum of squares, and $v$ is the total sum of squares. The best possible score is $1.0$.

In [65]:
#reset the model then run cross validation test, using shuffled input data
reg = LinearRegression()
cross_val_score(reg, pre_process(X), y, cv = ShuffleSplit())
Out[65]:
array([0.98888817, 0.99248691, 0.99272637, 0.99360786, 0.99413932,
       0.9937445 , 0.99047001, 0.99515735, 0.99363819, 0.99144629])

These are great scores, especially since when using 10-fold shuffled cross validation as done here, each trained model only used 10% of data, randomly selected, and was over 99% accurate when tested against the remaining 90% of the data. To visualize the success, let's reconstruct the state numbers, using only our prediction for county numbers, and compare to the true state totals.

Constructing a State Total Prediction by Summing County Level Predictions

In [66]:
#using groupby operation to construct state totals and state total predictiions
df1Totals = df1.groupby(by = ['datetime', 'state_name']).sum().reset_index()
f  = bkplt.figure(plot_height = 450, x_axis_label = 'Date', y_axis_label = 'Deaths',
        title ='Linear Regression vs Actual Data for Reconstructed State Data',
            x_axis_type = 'datetime', sizing_mode = 'stretch_width')
states_colors = zip(states, colors)
plots = []
for state, color in states_colors:
    f  = bkplt.figure(plot_height = 450, x_axis_label = 'Date', y_axis_label = 'Deaths',
        title ='Sum of Linear Regression County Predictions vs Actual Data for ' + state,
            x_axis_type = 'datetime', sizing_mode = 'stretch_width')
    #temporary dataframe of a single state
    df2 = df1Totals[df1Totals['state_name'] == state]
    #plotting state data as well as constructed prediction
    f.line(df2['datetime'], df2['deaths'], line_width = 3, alpha = .5, color = color, legend_label= 'Deaths')
    f.line(df2['datetime'], df2['LinearPrediction'], line_width = 3, alpha = .8, color = color, legend_label= 'Sum of County Models', line_dash = 'dotted' )
    f.legend.location = "top_left"
    f.legend.title = "Click To Hide"
    f.legend.title_text_color = "White"
    f.legend.click_policy="hide"
    plots.append(f)
show(column(plots, sizing_mode = 'stretch_width'))

Conclusion

By gathering data on COVID-19 deaths, population data, and geographical data, I was able to successfully create a model to track deaths at a county level. Due to the accuracy of the model, it appears that in the limited context of the United States at least, just those factors are enough to predict how a given county would have been affected by the pandemic over the past year. How well this model or more complex models such as the CDC's Death Forecast can see into the future though, remains to be seen.

To learn more about data management, exploratory data analysis, statistics, and machine learning, consider exploring the following links:

  1. NumPy Quickstart Tutorial
  2. Pandas Cheat Sheet
  3. Pandas User Guide
  4. Scikit-Learn: Getting Started
  5. Wikipedia: Machine Learning
  6. Tidy Data - Hadley Wickham

All source files can be found at Github - Miguel Chavez