In our last article, we looked at how to run a simple query using the Covid NYT dataset to find out what states had the most total cases. Unsurprisingly, we found that the largest states by population tended to have the highest number of total cases.
We’ll now do a bit more complex analysis and add in a demographic dataset to get population adjusted infection and mortality rates by counties and states.
You can find a YouTube video of this article below —
We already know where to find Covid data from our last article. Let’s navigate over to the American Community Survey by the US Census. This data is indexed in Fidap thanks to Google Cloud’s Public Dataset Program. Let’s look at the details —
There are a whopping 219 tables in here! For our purposes, the best table is the county_2018_5yr table —
This table gives us detailed demographic details by county. We’ll be most interested in population. Let’s look at some of the columns available, there are over 250 available —
Let’s hop into Jupyter (again, via Colab). As before, we’ll install Fidap via pip and grab our API key so we’re ready to go. Here’s a ready version of this Notebook — just hit the “Open in Colab” button to run it.
We’ve got two queries to write. First, let’s query the Covid dataset —
SELECT *
FROM bigquery-public-data.covid19_nyt.us_counties
WHERE date = CAST('2021-06-28' AS DATE)
ORDER BY confirmed_cases DESC
LIMIT 20;
Next, let’s get the population data —
SELECT total_pop, geo_id
FROM bigquery-public-data.census_bureau_acs.county_2018_5yr
LIMIT 20;
Finally, we can join these together to get a population adjusted infection rate —
WITH covid_cases AS (
SELECT *
FROM bigquery-public-data.covid19_nyt.us_counties
WHERE date = CAST('2021-06-28' AS DATE))
SELECT acs.total_pop, c.county, c.state_name, c.confirmed_cases, c.deaths, (ROUND(10000*c.confirmed_cases/acs.total_pop, 2)) AS per_capita_county_infection_rate_10k, (ROUND(10000*c.deaths/acs.total_pop,2)) AS per_capita_county_death_rate_10k
FROM covid_cases AS c
INNER JOIN bigquery-public-data.census_bureau_acs.county_2018_5yr AS acs
ON acs.geo_id = c.county_fips_code
ORDER BY per_capita_county_infection_rate_10k;
This gives us the following result —
Now, since we already have a state column here, let’s just use pandas to do the group by instead of SQL —
Let’s take a deeper look at the results. The bottom ten states (eg lowest infection rates) are —
Hawaii, which was pretty famous for some really strict Covid policies, leads the way. The states above are all fairly liberal states and concentrated in the northeast for the most part.
The states with the highest infection rates, led by North Dakota, South Dakota and Utah are listed above. This is somewhat counterintuitive, as I would have suspected that higher density areas may have higher infection rates.
Let’s try to visualize how this looks using a histogram. The vast majority of states are bunched in the middle.
Logically, we would expect states with high levels of infections per capita to also have higher death rates. But which are the ones that are above average? We can create a scatterplot of infections against deaths, and divide the plot into four quarters delineated by the mean infection and death rates.
I find it very fascinating that the epicenters of the outbreak in the US, such as Los Angeles and New York City, are in states that have not notched up particularly high per-capita infection and death rates. They are close to the average infection and death rates. On the other hand, we have states like North and South Dakota, Arizona, and New Jersey with astronomical death and infection rates. Interestingly, states in New England like Maine, Vermont, and New Hampshire saw below average death and infection rates.
It might be easier to see this data using a map, although it is static. For the purposes of a quick map, let us just look at the lower 48 states. Hawaii and Alaska are so far from the mainland that plotting them in directly gives rise to all sorts of issues that require a much longer fix.
It might be interesting to extend this analysis by looking at predictive factors that may help explain the differential in infection rates. Would things like voting patterns, population density, state average income, or education rates be correlated with COVID rates?
We can also group states into regions to track divergence by region here.
We have some of this data on Fidap and invite you to see whether you can find anything.
Find our company news, product announcements, and in depth data analysis on our blog.