A collage of graphs and tables showing problems with COVID-19 data

Above: Whether you’re analyzing how frequently states report 0s by day of the week, or finding that some moving averages work better than others depending on reporting patterns, COVID-19 data issues are seemingly everywhere.

Problems with case and death data constitute the most exasperating class of issues we deal with at Covidestim. Naturally, any sufficiently large information-processing system will encounter data quality issues - and we aren’t an exception, with >1.6M rows of input data from 3100+ entities flowing into instances (runs) of the model every day. That being said, case and death data issues will continue to be a vexing problem for us. These issues lessen the accuracy of our estimates, lower user confidence, and present a weed-like array of challenges related to issue resolution. Today, I’ll discuss what the primary issues are when it comes to “bad” data, and talk about our current and future approach to each issue. But before we get to that, let’s briefly talk about how cases and deaths data travels from a state or county website to an instance of our model.

Data pipeline

The data pipeline for the Covidestim project

Every day, we retrieve timeseries data for each state and county, and pass this data to instances of our model (I say “instances” because the model runs independently for each state and county). To do this, we look to Johns Hopkins' Center for Systems Science and Engineering (CSSE). They have undertaken a mammoth effort to cobble together a national COVID-19 dataset through scraping dozens of state and county websites for the latest case and death numbers every day, and posting the results to GitHub. Thanks to their consistent reporting since early 2020, they have earned their place as the most credible US COVID-19 data provider. Additionally, they have become an important voice in the ongoing conversation about improving the quality of COVID-19 data, domestically and internationally.

So, at 2:00am every day, after CSSE has completed their scraping of county and state websites, our pipeline clones CSSE’s repository and performs some basic data cleaning, which involves little more than:

Making sure none of the case/death numbers are negative
Filtering out counties and territories we don’t want to run (usually because they only have a few weeks of data)
Supplanting early-2020 CSSE data with the Atlantic’s Covid Tracking Project data, when appropriate

That’s about it! The data is then split by county/state and passed to instances of the model, and then the results become available on the website every day. Being deliberately lightweight, this process makes no attempt to deal with most of the data issues we will discuss below. There’s a 7-day moving average inside the model, which helps smooth over aberrations in the case/death data, but as we will see, there are limits to what a moving average can do to highly flawed data. Now, let’s move on to the issues at hand:

Data-level issues

“Dumps” - big spikes in the cases or deaths data
Nonreporting - 0s for one or more days of cases or deaths data

System-level issues

Ownership - Who is responsible for the accuracy of COVID-19 timeseries data?
Authority - Who should have the authority to edit errant data?
Lack of context - What are we not seeing when we look at cases and deaths data?

Data-level issues

“Dumps”

An example of data dumps with a 7-day moving average

Above: Los Angeles County’s cases and deaths data

A data “dump” is when the cases or deaths on a particular day are really high due to reasons beyond the normal case/death reporting process. Here are a few kinds of “dumps” that happen:

Death reclassification: A state reviews its death records and reclassifies a bunch of, say, pneumonia deaths as being caused by COVID-19.
Test reclassification: A county adopts a broader definition of a case (for instance, the county starts accepting the results of at-home tests), and revises their cumulative case number to include all of the now-qualifying cases.
“Other”: For example, a state or county loses a bunch of tests, then later finds them. The state or county then updates their entire timeseries to reflect the days these tests were administered. They post the new timeseries to their website. But, CSSE’s scraper for that state or county may be programmed to only read the latest cumulative cases number off of that state or county’s dashboard, and append it to the end of CSSE’s timeseries file! Thus, the cumulative number of cases scraped by CSSE will register as a large increase from the number CSSE’s scraper saw 24 hours prior – even though the state or county correctly revised their data! Unfortunately, events like these actually happen.

There are many more variations on this, but you get the idea. Sometimes the county or state involved will announce what they have done, but sometimes they won’t, or they won’t until somebody starts asking questions.

What does this do to model results?

These dumps go straight into the model, where they meet a 7-day moving average, which will result in an averaged timeseries which has an abrupt shift upwards (as the dump enters the 7-day window) and then an abrupt shift downwards (as the dump exits the window). Look for the trellis shapes in the image above of Los Angeles' cases and deaths data:

These trellis shapes make the model less accurate. For example, if there is a large “dump” of death data, the model will generally:

Estimate an infections timeseries which includes a sufficient number of infections at the right points in time to result in the observed number of “dump”-ed deaths. So, the number of infections may be somewhat accurate, but the timing will be off - those infections likely occurred earlier in time. Of course, this will also affect estimates of $R_t$, likely biasing them to be high if the “dump” is only a few days old.
Modify assumptions about how fatal COVID-19 is, and how likely it is for infected individuals to get tested (generally, the degree of modification is small, but it depends on the size and timing of the “dump”)

Neither of these outcomes is good! However, Covidestim’s options to intervene are limited: excising the “dump” entirely would throw out valuable information about the cumulative number of cases or deaths, which then makes the model less accurate. Furthermore, “back-distributing” the “dump” – spreading it out across the previous month, for instance – may introduce yet more problems if we perform a “back-distribution” which isn’t in line with when these “dump”-ed cases or deaths actually occurred. Remember, we generally have little to no information to use to “back-distribute” appropriately.

What will Covidestim do in the future about this?

We’re working on systems to flag these dumps, both to make users aware when recent estimates may be unreliable, and to enhance our own awareness of the scope of the issue.

Nonreporting

New Haven, CT case data, showing weekend and holiday nonreporting

Above: New Haven, CT’s case data. Orange dots are weekends where no data was reported, and green dots are dumps of cases which accumulated during the nonreporting period. Note the extra days of nonreporting around Thanksgiving and Christmas.

Nonreporting is when a county or state doesn’t report case or death data for one or more days. This happens because:

It’s the weekend, and the state/county in question doesn’t report on weekends
It’s Christmas/Thanksgiving/some other holiday
Their system broke, or there was some other problem which prevented reporting
They’ve decided to report only once a week
They’ve decided to stop reporting, period

We’ve seen all of these - including #5, sadly. There are two primary problems with nonreporting:

First, nonreporting represents a delay of information. When we have to wait a week to see the latest cases/deaths data (Florida), we can’t perform near-present-day estimation as well because we are seeing (in the worst case) week-old information.

Second, nonreporting is generally followed by some kind of spike, as cases or deaths which accumulated during the nonreporting period are added to the timeseries. This creates weird jumps in the 7-day moving average, because the moving average declines on the days where 0s are reported, then generally increases. This reduces the accuracy of near-present-day estimates, but isn’t an issue further back in the timeseries where this phenomenon has less effect on gross infection trends.

What will Covidestim do in the future about this?

Soon, we plan to deal with mild nonreporting (weekends, for example) by distributing Monday’s cases between Saturday, Sunday and Monday. With extreme cases of nonreporting, such as Florida’s counties no longer reporting deaths, we tell the model to ignore their death timeseries past a certain date. Obviously, this reduces model accuracy, but it specifically muddles the relationship between cases and deaths. This decreases the quality of inference the model can make with respect to case ascertainment.

System-level issues - ownership, authority, context

So far, we’ve discussed specific problems in the data – problems that manifest to the eye and have easily-describable effects on our model estimates. However, it’s also worth delving into some of the broader phenomena at play here, which apply to nearly every entity that currently participates in the COVID-19 data ecosystem.

Ownership

A bunch of GitHub issues, seen through an API call

Above: A sample of GitHub issues filed on CSSE’s GitHub Issues page

Key question: Who is responsible for the accuracy of COVID-19 timeseries data?

The stewardship of case and death information from its origin to its dissemination to the public is unclear. If there are issues with the data, who is supposed to fix them? Usually, it’s an open question. The result of this is that issues originating at one level of the system bleed through multiple levels. This leaves organizations like ours, which sit high on the “food chain” when it comes to COVID-19 data, in a bind when we see bad data affecting our estimates. Here’s a typical situation to illustrate:

Step 1: A private citizen creates a GitHub Issue, recognizing that a state has dumped a bunch of cases into the most recent day of timeseries data.

A few observations:

CSSE’s GitHub Issues page has become the de facto clearinghouse for unresolved U.S. COVID-19 data issues. However, CSSE doesn’t make COVID-19 data - they report it! Already, we are “knocking on the wrong door,” so to speak.
The state has found it appropriate to dump the cases onto one day, instead of revise their timeseries. This is because there exists no nationally-standardized way for how to report COVID-19 cases, so the state is technically doing nothing wrong here.
CSSE may not have been aware, up until this point, that there was a problem. So, are they just reporting whatever they scrape? Do they do any data cleaning? Do they have an alerting system? Because their system is closed-source and they have not spoken publicly on this, we cannot know.

Step 2: CSSE acknowledges the issue. In this case, CSSE chooses to reach out to the state, and ask for a revised timeseries.

This is an important step towards resolution, but we again see muddled responsibility:

CSSE was under no obligation to acknowledge the issue, or take any action - indeed, they make no guarantees about the quality of any of the information they report. It’s unfair to expect them to chase every lead down. There are several GitHub issues like this filed every week.
If nobody had reached out, would the state ever correct the dump? The dump might be cemented into the timeseries forever. Does a state’s responsibility to curate a day of timeseries data end as soon as that day of data first appears on their website?

Step 3: The state publishes a new timeseries with the dumped cases spread out correctly.

The issue at hand is now solved - likely several days later - but we still see that:

The state might never have done this if CSSE hadn’t reached out. What if only private citizens had reached out? Has CSSE, for better or worse, become the “enforcer” for the domestic modeling community?
If the state had fixed the dump on their own, would CSSE’s scrapers have ingested the entire revised timeseries, or just the most recent day?

We see here that this example problem eventually got solved. I’m happy to say that many similar situations have had similar outcomes in real life. However, look up nearly any state or large county’s cases/deaths timeseries and you will spot obvious imperfections. Without a mandate to “own” their data, in the form of reporting standards or legal agreements, states and counties may feel no obligation whatsoever to correct errors - and in fact, they may not see data dumps as an error at all.

Authority

Key question: Who should have the authority to edit errant data?

We are frequently asked, “if you know that there are data dumps or other data problems, why don’t you just try to fix them yourself?” There are two observations to take into account with respect to this question:

At 1.6M+ rows in the input dataset, there are too many issues with cases/deaths data to manually investigate and address every issue. Thus, choosing to correct specific issues amounts to a shotgun-style approach which leaves all the other areas with data issues hanging. And of course, there’s a slippery slope argument in there too.
Every day, input data changes - and not always to simply add the latest day of data to the tail of the dataset. So, making manual (or even automated) edits amounts to creating a fork of the dataset, which then needs to be reconciled with CSSE’s latest dataset every night (automatically). There are a number of pitfalls here which would leave you with incorrect or corrupt data.

Perhaps the more important question is, “If Covidestim starts modifying input data to get rid of dumps and fill in nonreporting days, are they asserting that their state and county cases/deaths data are a more authoritative national record than the totality of government-sourced cases/deaths data?” Currently, we are not interested in taking this stance.

Lack of context

Key question: What are we not seeing when we look at cases and deaths data?

One of the consequences of a lack of standardization when it comes to COVID-19 cases/deaths data is that there is no consensus on how to report information that could best be described as metadata. To take a very simple example, most (all?) scientific programming languages have numeric types which allow for a concept of missingness. In R, the Fibonacci sequence with a missing first element would be expressed as:

c(NA, 1, 2, 3, 5, 8, ...)

But, consider a hypothetical Monday when a state fails to report because their data manager calls in sick. This will likely manifest as:

The cumulative cases number on their dashboard not changing (unhelpful)
A 0 being reported for that day (some NA-like identifier would be preferable, especially when there are other 0s nearby which actually represent days with zero cases or deaths)
A special announcement on their dashboard (which an automated scraper will almost definitely fail to notice)

We see here that a lack of standardization leads to an environment where there is essentially no way to programatically communicate information relevant to the interpretation of cases/deaths data - which further erodes our ability to confidently modify any input data we see as problematic. Notably, in the healthcare industry, there are many standards that are focused on tackling exactly this problem of effectively communicating metadata. After all, a blood test may not be very useful if you can’t identify when the sample was taken, when the patient ate, or when they last took any relevant medications.

In conclusion, COVID-19 data is messy - both in terms of problems in the datasets, and in terms of how the entire ecosystem is structured. We currently feel that our most important duty is to inform users about how issues with the input data are affecting the model estimates, make small tweaks in situations where there is an obvious nonreporting pattern, and participate in the ongoing discussion about how this ecosystem may be improved going forward. If you have ideas or questions, don’t hesitate to reach out!

Navigating the COVID-19 data jungle

Marcus Russi

2021/09/08

Data pipeline

Data-level issues

System-level issues

Data-level issues

“Dumps”

Nonreporting

System-level issues - ownership, authority, context

Ownership

Authority

Lack of context