I am a data person and if you are reading this you probably are too. I like to be informed and maintain control over situations. Data helps me with the former and at least gives the illusion of reaching the latter. Whilecrisis is impossible to control and the numbers around it can be stress-relieving, analyzing the data and understanding them a little better provides a certain level of understanding that can help.
Johns Hopkins provides coronavirus data
With that in mind, you may be aware that Johns Hopkins University’s (JHU) Center for Systems Science and Engineering (CSSE) maintains an ongoing update dashboard for the viruses’ spread, based on officially reported case counts from around the world, and depicted in the figure above. Although the dashboard itself is useful, the data is pulled down from Hopkins’ GitHub repo and it can be a little tricky to do your own analysis on it.
With this in mind, Tableau looked at this data, performed his own computer engineering / ETL work on it, and has published the output of this effort as his COVID-19 Data Resource Hub. On the Resource Hub website, the consolidated data is available in Tableau’s own Hyper and .tde formats, and also available in Google sheets and CSV formats, making it compatible with BI tools beyond Tableau.
Tableau makes it more accessible
On Friday I spoke Steve Schwartz, Director of Public Affairs at Tableau, and Sarah Goehri. Tableau Corporate Communications Manager. When each of us works from home and talked to each other via headsets and web cams, Schwartz explained to me that with the company’s establishment of Tableau Foundation, it now has a track record of working with data sets around the global health topic (more info on that here). Because of that, and because it found JHU data difficult to work with in its raw form, Tableau decided to perform computer engineering work and make the data available to everyone.
Tableau has released a fairly simple “opening panel“based on the data and made it available on Tableau Public. It has also made the .hyper, .tde, and CSV versions of the dataset available through the online data directory platform data.world. While this creates an “extra hop” to get to the data, access is still relatively straightforward, improving its accessibility across all platforms – not just Tableau. It also facilitates collaboration and discussions around the use of the dataset itself, something data.world focuses explicitly on.
Directory trading with data.world
When you visit the data.world storage site for Tableau’s version of JHU Coronoavirus data, you must log in. Provided you do not already have a data.world account, you can connect via Google, Facebook, GitHub or Twitter credentials and then change the password on your data.world account to something unique and secure.
Once done, you can enable integration with your BI platform through the “Open in app” option and then use the data.world connector for one of the supported platforms (including Microsoft Power BI. Excel. MicroStrategy. Google Data Studio. Plot.ly. Jupyter notebooks, rstudio and many others) to access the data. This connection process requires you to provide the owner, dataset ID, and query information that data.world will present to you for easy copying and pasting into the connection dialog invoked.
Since my own skill set in Microsoft Power BI is much stronger than for other tools, I chose to analyze the data there. Power BI Desktop is a free download and includes a beta data.world connector. Power BI Desktop runs on Windows only, although it may host other operating systems through certain virtualization platforms. (For example, Mac users can run Power BI Desktop using Parallels).
Dimensions, dimensions and granularity
Once connected to the data, you can see that their schedule is relatively straightforward. The granularity of the data is at the data level, per. Geography – either country_region or, for some countries, province_state (in some cases the granularity can drill down even slightly – see details below).
The data set’s measurements are “cases” and “difference”, which gives a cumulative and daily case total, respectively. Day for each tracked geography. Case_type is an important dimension as it allows you to filter or break down after active cases, deaths, recoveries and total confirmed cases. To avoid double counting, either look only in confirmed cases or otherwise only look at one or all of the other types (since rows of “Confirmed” cases essentially represent the sum of corresponding rows with case_type of “Active”, “Death” and “Recovered”).
The names country_region and province_stat allow you to map the data, but if you prefer not to use location names, “lat” and “long” columns provide latitude and longitude data that can also be used. Another column, simply called “location”, provides lat / long data together in POINT (long, lat) format. The latest date column reports data on the data set’s latest data and prep_flow_runtime appears to provide the date and time that Tableau’s data repair process was run.
Surprisingly, I found that using lat / long sometimes gives more data points per province state, apparently including a generic data point and for the whole area as well as some more location specific, the latter often having zero or a relatively small number of cases. I would recommend being careful about this feature in the granularity of the dataset.
Visualization of the data
Once you connect to and load the data, visualizing it is relatively straightforward. While it is not clear whether it is a blessing or a curse, it does help certain things to become evident. I filtered the data to exclude rows with case_type = Confirmed so I could perform cross-analysis of data with case type values for Active, Death and Recovered.
I’ll share some of these analyzes with you below. please note Tableau delivers the data as it is, and I do the same with my analyzes. None of the analyzes presented below have been closely studied, and they are only current until Saturday, 3/21/2020, based on values in the most recent date column in the dataset.
That said, depending on your tolerance for news, you can keep reading, otherwise stop here.
Let’s start with the obvious. Here is a bubble card that collects all new covid-19 active cases in the United States (including Alaska, Hawaii, Puerto Rico and Guam):
A simple combination chart of cases and difference columns by the data column shows how much we need to flatten the proverbial curve. Here is the chart for the state of New York, where the total active cases are drawn as a line and new active cases as columns:
We are used to seeing maps and plots like these in news coverage. They can be scary, overwhelming and frankly, they are not very detailed. What I found most helpful about the Tableau-supplied dataset was that it gave me the opportunity to go deeper and perform more specific analyzes.
For example, you can also look at the countries that have the largest number of cases, broken down by case_type. I did this by creating a stacked bar graph, sorted by number of cases in descending order, showing the seven most affected countries, exclusive China (since its case statement dominates the others). When I ran this on Saturday, the United States was in fifth place (again except China). When I updated the data this morning, unfortunately, the United States had moved into second place, with only Italy having more cases:
You can also drill down to the individual country level. Here is a pie chart of active cases by province_state, filtered by country that reveals the relationship of cases in Hubei Province with others in China:
And here’s the corresponding chart for the United States with New York State, Washington State and California dominating in that order:
Adding a timeline filter and setting the effective date back 10 days to March 11 can show how Washington State had the most number of cases at that time. Ten days makes a big difference, as we have learned and as this visualization shows.
We can also analyze the data using a more free form approach. Power BI’s Q&A feature makes it possible to query the data in relatively plain English. To test this, I asked Q&A to show me the number of newly reported deaths per year. US Day starting March 1 by entering “difference by date of death and US as area map on or after March 1 (date).” Here’s what I got back:
I should point out that the data I get back from the Hopkins dataset does not seem to match the reported data New York Governor Andrew Cuomo’s news conference this morning (Cuomo’s numbers were probably newer and therefore unfortunately higher). Although nothing seems available yet, I keep an eye on the open data pages for both town and state from New York, to see if I could add their data to my analysis in the future.
I will do my best to end on a positive note. Here’s what I came back to when I asked questions and answers to show me cumulative global “Recovered cases by date”:
We are getting close to 100,000 recoveries. Let’s hope this curve gets steeper, not flatter until the crisis is over.
The column chart visualization showing the seven countries with the highest number of cases (excluding China) was updated at 22.46 ET on Monday, March 23 to correct that the original version was not filtered on the current date. As such, on the Y axis, the original displayed a false aggregation of each day’s total number, producing a meaningless figure, exaggerated by an order of magnitude. The text describing visualization was also updated.
The map and three pie chart visualizations were updated at. 23.11 ET to correct the same error.