Navigating the COVID-19 data jungle

Marcus Russi

2021/09/08

A collage of graphs and tables showing problems with COVID-19 data

Above: Whether you’re analyzing how frequently states report 0s by day of the week, or finding that some moving averages work better than others depending on reporting patterns, COVID-19 data issues are seemingly everywhere.

Problems with case and death data constitute the most exasperating class of issues we deal with at Covidestim. Naturally, any sufficiently large information-processing system will encounter data quality issues - and we aren’t an exception, with >1.6M rows of input data from 3100+ entities flowing into instances (runs) of the model every day. That being said, case and death data issues will continue to be a vexing problem for us. These issues lessen the accuracy of our estimates, lower user confidence, and present a weed-like array of challenges related to issue resolution. Today, I’ll discuss what the primary issues are when it comes to “bad” data, and talk about our current and future approach to each issue. But before we get to that, let’s briefly talk about how cases and deaths data travels from a state or county website to an instance of our model.

Data pipeline

The data pipeline for the Covidestim project

Every day, we retrieve timeseries data for each state and county, and pass this data to instances of our model (I say “instances” because the model runs independently for each state and county). To do this, we look to Johns Hopkins' Center for Systems Science and Engineering (CSSE). They have undertaken a mammoth effort to cobble together a national COVID-19 dataset through scraping dozens of state and county websites for the latest case and death numbers every day, and posting the results to GitHub. Thanks to their consistent reporting since early 2020, they have earned their place as the most credible US COVID-19 data provider. Additionally, they have become an important voice in the ongoing conversation about improving the quality of COVID-19 data, domestically and internationally.

So, at 2:00am every day, after CSSE has completed their scraping of county and state websites, our pipeline clones CSSE’s repository and performs some basic data cleaning, which involves little more than:

That’s about it! The data is then split by county/state and passed to instances of the model, and then the results become available on the website every day. Being deliberately lightweight, this process makes no attempt to deal with most of the data issues we will discuss below. There’s a 7-day moving average inside the model, which helps smooth over aberrations in the case/death data, but as we will see, there are limits to what a moving average can do to highly flawed data. Now, let’s move on to the issues at hand:


Data-level issues

  1. “Dumps” - big spikes in the cases or deaths data
  2. Nonreporting - 0s for one or more days of cases or deaths data

System-level issues

  1. Ownership - Who is responsible for the accuracy of COVID-19 timeseries data?
  2. Authority - Who should have the authority to edit errant data?
  3. Lack of context - What are we not seeing when we look at cases and deaths data?

Data-level issues

“Dumps”

An example of data dumps with a 7-day moving average

Above: Los Angeles County’s cases and deaths data

A data “dump” is when the cases or deaths on a particular day are really high due to reasons beyond the normal case/death reporting process. Here are a few kinds of “dumps” that happen:

There are many more variations on this, but you get the idea. Sometimes the county or state involved will announce what they have done, but sometimes they won’t, or they won’t until somebody starts asking questions.

What does this do to model results?

These dumps go straight into the model, where they meet a 7-day moving average, which will result in an averaged timeseries which has an abrupt shift upwards (as the dump enters the 7-day window) and then an abrupt shift downwards (as the dump exits the window). Look for the trellis shapes in the image above of Los Angeles' cases and deaths data:

These trellis shapes make the model less accurate. For example, if there is a large “dump” of death data, the model will generally:

Neither of these outcomes is good! However, Covidestim’s options to intervene are limited: excising the “dump” entirely would throw out valuable information about the cumulative number of cases or deaths, which then makes the model less accurate. Furthermore, “back-distributing” the “dump” – spreading it out across the previous month, for instance – may introduce yet more problems if we perform a “back-distribution” which isn’t in line with when these “dump”-ed cases or deaths actually occurred. Remember, we generally have little to no information to use to “back-distribute” appropriately.

What will Covidestim do in the future about this?

We’re working on systems to flag these dumps, both to make users aware when recent estimates may be unreliable, and to enhance our own awareness of the scope of the issue.

Nonreporting

New Haven, CT case data, showing weekend and holiday nonreporting

Above: New Haven, CT’s case data. Orange dots are weekends where no data was reported, and green dots are dumps of cases which accumulated during the nonreporting period. Note the extra days of nonreporting around Thanksgiving and Christmas.

Nonreporting is when a county or state doesn’t report case or death data for one or more days. This happens because:

  1. It’s the weekend, and the state/county in question doesn’t report on weekends
  2. It’s Christmas/Thanksgiving/some other holiday
  3. Their system broke, or there was some other problem which prevented reporting
  4. They’ve decided to report only once a week
  5. They’ve decided to stop reporting, period

We’ve seen all of these - including #5, sadly. There are two primary problems with nonreporting:

First, nonreporting represents a delay of information. When we have to wait a week to see the latest cases/deaths data (Florida), we can’t perform near-present-day estimation as well because we are seeing (in the worst case) week-old information.

Second, nonreporting is generally followed by some kind of spike, as cases or deaths which accumulated during the nonreporting period are added to the timeseries. This creates weird jumps in the 7-day moving average, because the moving average declines on the days where 0s are reported, then generally increases. This reduces the accuracy of near-present-day estimates, but isn’t an issue further back in the timeseries where this phenomenon has less effect on gross infection trends.

What will Covidestim do in the future about this?

Soon, we plan to deal with mild nonreporting (weekends, for example) by distributing Monday’s cases between Saturday, Sunday and Monday. With extreme cases of nonreporting, such as Florida’s counties no longer reporting deaths, we tell the model to ignore their death timeseries past a certain date. Obviously, this reduces model accuracy, but it specifically muddles the relationship between cases and deaths. This decreases the quality of inference the model can make with respect to case ascertainment.

System-level issues - ownership, authority, context

So far, we’ve discussed specific problems in the data – problems that manifest to the eye and have easily-describable effects on our model estimates. However, it’s also worth delving into some of the broader phenomena at play here, which apply to nearly every entity that currently participates in the COVID-19 data ecosystem.

Ownership

A bunch of GitHub issues, seen through an API call

Above: A sample of GitHub issues filed on CSSE’s GitHub Issues page

Key question: Who is responsible for the accuracy of COVID-19 timeseries data?

The stewardship of case and death information from its origin to its dissemination to the public is unclear. If there are issues with the data, who is supposed to fix them? Usually, it’s an open question. The result of this is that issues originating at one level of the system bleed through multiple levels. This leaves organizations like ours, which sit high on the “food chain” when it comes to COVID-19 data, in a bind when we see bad data affecting our estimates. Here’s a typical situation to illustrate:


Step 1: A private citizen creates a GitHub Issue, recognizing that a state has dumped a bunch of cases into the most recent day of timeseries data.

A few observations:

Step 2: CSSE acknowledges the issue. In this case, CSSE chooses to reach out to the state, and ask for a revised timeseries.

This is an important step towards resolution, but we again see muddled responsibility:

Step 3: The state publishes a new timeseries with the dumped cases spread out correctly.

The issue at hand is now solved - likely several days later - but we still see that:


We see here that this example problem eventually got solved. I’m happy to say that many similar situations have had similar outcomes in real life. However, look up nearly any state or large county’s cases/deaths timeseries and you will spot obvious imperfections. Without a mandate to “own” their data, in the form of reporting standards or legal agreements, states and counties may feel no obligation whatsoever to correct errors - and in fact, they may not see data dumps as an error at all.

Authority

Key question: Who should have the authority to edit errant data?

We are frequently asked, “if you know that there are data dumps or other data problems, why don’t you just try to fix them yourself?” There are two observations to take into account with respect to this question:

Perhaps the more important question is, “If Covidestim starts modifying input data to get rid of dumps and fill in nonreporting days, are they asserting that their state and county cases/deaths data are a more authoritative national record than the totality of government-sourced cases/deaths data?” Currently, we are not interested in taking this stance.

Lack of context

Key question: What are we not seeing when we look at cases and deaths data?

One of the consequences of a lack of standardization when it comes to COVID-19 cases/deaths data is that there is no consensus on how to report information that could best be described as metadata. To take a very simple example, most (all?) scientific programming languages have numeric types which allow for a concept of missingness. In R, the Fibonacci sequence with a missing first element would be expressed as:

c(NA, 1, 2, 3, 5, 8, ...)

But, consider a hypothetical Monday when a state fails to report because their data manager calls in sick. This will likely manifest as:

We see here that a lack of standardization leads to an environment where there is essentially no way to programatically communicate information relevant to the interpretation of cases/deaths data - which further erodes our ability to confidently modify any input data we see as problematic. Notably, in the healthcare industry, there are many standards that are focused on tackling exactly this problem of effectively communicating metadata. After all, a blood test may not be very useful if you can’t identify when the sample was taken, when the patient ate, or when they last took any relevant medications.


In conclusion, COVID-19 data is messy - both in terms of problems in the datasets, and in terms of how the entire ecosystem is structured. We currently feel that our most important duty is to inform users about how issues with the input data are affecting the model estimates, make small tweaks in situations where there is an obvious nonreporting pattern, and participate in the ongoing discussion about how this ecosystem may be improved going forward. If you have ideas or questions, don’t hesitate to reach out!