Hi, my name is Ash and I’m the Founder of Fidap. We provide clean data for data scientists. Today, we’re going to go get Covid data into a Jupyter Notebook. Here’s a YouTube video that goes through this post as well —
We’ll answer the following question:
What states have the most total infections?
In order to do this, we’ll use a combination of tools on the data side — Fidap’s data platform, BigQuery, BigQuery Public Datasets, and New York Times’ Covid dataset. We’ll also use Jupyter, Python, pandas and SQL.
First, let’s navigate over to Fidap’s data catalog and search for Covid.First, let’s navigate over to Fidap’s data catalog and search for Covid.
After we select the Covid NYT dataset, we can see descriptive stats —
Let’s navigate to the Tables tab to see the tables in this dataset -
Once we select us_states, we can see details about this table —
We recently added the Explore tab, which gives us detailed exploratory stats on every column (thanks to pandas profiling). For example, we see the distribution of the confirmed_cases column. This can give us a great sense of our data and may alert us to any data quality issues as well.
Let’s move on to actually query this. We can do that using Fidap’s query tool —
We’ve saved the query here. The query generates the following results —
That’s great, but most data scientists would prefer doing this via a Jupyter Notebook than via a web interface. For this reason, Fidap has built a Python package.
Open up a Jupyter Notebook or alternatively, use this Google Colab Notebook. First, we install Fidap via pip:
pip install fidap
Next, let’s instantiate the fidap-client and enter your API key (you can get it from your Account section in Fidap).
import fidap
fc = fidap.fidap_client(api_key='xxx')
Finally, let’s run the query —
fc.sql("""select * from bigquery-public-data.covid19_nyt.us_states where date = CAST('2021-06-28' AS DATE) order by confirmed_cases desc limit 10""")
We get back a pandas DataFrame with our result —
The total number of Covid cases, as expected, are highly correlated with just the states with the largest populations like California and Texas. This isn’t tremendously interesting.
Next time, we’ll do some more complex stuff.
Find our company news, product announcements, and in depth data analysis on our blog.