Published by Weisser Zwerg Blog on
Some data analysis in python around the covid-19 data (including survival analysis with Kaplan-Meier).
I’ve started to follow the development of the corona virus numbers in Europe and especially in Germany some time ago and created a few jupyter notebooks for my own use. I am in no way a medical expert and there may be mistakes in my analysis done, but I publish the notebooks on github and if you spot mistakes then please submit a pull-requests or a bug-report.
It seems that there are currently data-changes in progress that cause issues. This is mostly an issue for the US data. Once these data-changes are finalized I’ll adapt the code, but until then the US data as used in this notebook is currently wrong.
My analysis predates the article Zahlen, bitte! 3,4 % Coronavirus-Fallsterblichkeit, eine “false Number”? Etwas Pandemie-Statistik, but the author summarizes (in german) the issues with the numbers that are floating around very well. The same reasoning that the author of that article outlines lead me to create the analysis notebooks based on raw numbers. In this blog-post I use the terminology that the author uses:
|Confirmed Case Fatility Rate or Ratio||CFR||death / number of cases with confirmed laboratory results|
|Symptomatic CFR||death / estimation of symptomatic diagnosed disease cases|
|Infection Fatility Rate or Ratio||IFR||death / estimation of infected subjects|
|Crude CFR||(CFR)||Confirmed CFR estimated from the primitive: death / confirmed cases|
|Alternative CFR||CFRb||Confirmed CFR estimated alternatively from: death / (recovered + death)|
|Kaplan-Meier-CFR||CFRkm||CFR estimated via Kaplan-Meier; this estimate is taking the dynamics of the epidemic into account|
|Mortality Rate||death / overall population|
|Overall CFR or IFR||averaged over all aspects like age, sex, prior health history, …|
A good picture representation of these numbers is given in 2019-nCoV: preliminary estimates of the confirmed-case-fatality-ratio and infection-fatality-ratio, and initial pandemic risk assessment
The crude CFR and the alternative CFR (CFRb) are easy to estimate and you can find numbers at the bottom of covid-19-data-analysis.ipynb.
In the table the crude CFR is called
death_rate and the CFRb is called
death_rate_. The crude CFR is under estimating the real fatality rate,
because some of the currently infected will die. The CFRb is over-estimating the real fatality rate, because after let’s say 7 days some of the subjects died, but most of them will
recover, but are not yet counted as recovered.
For that purpose survival analysis was invented. In order to properly use survival analysis we’d need the “life-lines” of the subjects, which we don’t have. We only have aggregate numbers and we don’t know which subjects exactly died when in aggregate new death are reported. In addition we don’t know the exact date when each of the new confirmed subjects actually got infected. We only know the date when the laboratory test confirmed the infection.
For that purpose we have to artificially create the life-lines with reasonable assumptions. In case we recognize that some of our assumptions are incorrect we can modify them in the notebook to adapt as needed. You can follow the process in the covid-19-survival.ipynb notebook.
First of all I consider the data only as valid from a certain date onwards. The authorities need some time to get up-to-speed with the developing situation.
That date is called
first_date. I typically use the day of the first dead person. I guess that from that day on the authorities are in full alarm state.
In case that at
first_date the number of known infected persons is less than 1000 I don’t think that this number is correct (mainly to the same reasons as mentioned above,
that the authorities are not yet fully up-to-speed) and I simply add to all known confirmed cases the difference between 1000 and the number of known infected persons on that day.
This parameter is called
I also have the parameter
mult available to simulate what would be the result of the analysis in case that a certain number of infected persons remain undiscovered.
This is the multiplier I multiply the confirmed cases with after adding the
init_add number. In code this is:
in_df.loc[:,'confirmed'] = (in_df['confirmed'] + init_add) * mult
Once this pre-processing is done I prepend the data-frame with a constant number of daily cases for 11 days. And then I prepend this data-frame again with another 11 days of
daily constant new cases. This is meant to simulate the ramp up from zero to
Based on this data I then artificially create the life-lines for the daily cases. I simply ignore the fact that subjects got infected before the day they get reported.
If it should turn out to be an important aspect of the problem then this can be changed. I use the day when subjects get reported as
new_confirmed as the start date of
new_death I select the life-lines from the day before until roughly 4 weeks in the past to select randomly some life-lines that will end (simulated death). I use
a time window of 4 weeks, because I suspect that once you’re past the 4 week period you’re recovered. When selecting randomly life-lines that will end I use a
gamma distribution on the age of the life-line with parameters:
I use that mean, because I read somewhere that the mean (or median) of the patients time until death after hospitalization is 7 days and I guess that they are not hospitalized
already on day 1. The
gamma_loc of 2 days (shifts the gamme distribution right by 2 days; causes the probability of death to be 0 for that time period) also makes sure that
nobody dies in the first two days after being diagnosed. The notebook
contains a plot of the distribution (here only a static picture).
In code this is:
ds_age = (dt - ldf.start_date).dt.days
distribution = stats.gamma(gamma_k, loc=2.0, scale=gamme_theta).pdf(ds_age)
distribution = distribution / distribution.sum()
death_indices = np.random.choice(len(ldf), new_death, replace=False, p=distribution)
If you do that then you get a CFRkm for China of 3.95%, which matches very well the 3.9% of the crude CFR. Once you have large enough numbers Kaplan-Meier does not change the result much more.
South Korea is the other country where the situation has already a longer history. For South Korea I took as
first_date 2020-02-20 and as
init_add 900. I did not use a
mult parameter, because I read that South Korea is extensively testing and therefore I guess that the number of reported cases is quite close to the number of real cases.
If you do that then I get a CFRkm for South Korea of 1.16%, which is higher than the 0.7 as estimated by the crude CFR. If you would not use the
init_add parameter the
CFRkm would be 1.47%.
The results in the notebook are typically reported like
(1.47, 0.97, 2.24) (1.16, 0.86, 1.56) 8413
The first tripple belongs to the CFRkm estimate with
mult=1.0. The first number is the mean value estimate and the other two numbers are the lower and upper
confidence interval values.
The second tripple belongs to the CFRkm estimate with
mult in use.
The value on the third line is the number of confirmed cases that would result with
mult in use.
Let’s see how that works for Italy. For Italy I used
mult=5.0. I came up with these values by playing around with the numbers
and seeing when the resulting CFRkm seems to become “reasonable”. The results are as follows:
(27.25, 24.57, 30.16) (3.1, 2.85, 3.38) 60745
The “raw” CFRkm without these
mult adaptations would be 27.25%. As this is the same disease as in China and Korea and the fatality rate should be in the same
ballpark it is more likely that the discovery of cases is hopelessly behind the real development in Italy. If I use
mult the estimate gets into the same
ballpark as for China (a bit better as I guess that the Italian health care system is a bit better than China). The consequence of this is that most likely the current number for
confirmed cases in Italy is off by a factor of 5.0! And we more likely have 61’000 cases rather than the 10’149 cases as reported by the official numbers.
For France we have a similar picture, but not as extreme as in Italy:
(12.53, 8.31, 18.66) (2.11, 1.48, 3.01) 4568
That would suggest that we have something like 4’500 cases rather than the officially reported 1’784.
In 2019-nCoV: preliminary estimates of the confirmed-case-fatality-ratio and infection-fatality-ratio, and initial pandemic risk assessment the estimate of the infection-fatality-ratio (IFR) is 0.94 (0.37, 2.9) percent. The data they used can be found here.
In Coronavirus (COVID-19) Mortality Rate the reported number is 3.4%, but I cannot tell for sure if it is crude CFR or IFR or … The section How to calculate the mortality rate during an outbreak references Methods for Estimating the Case Fatality Ratio for a Novel, Emerging Infectious Disease.
The MIDAS 2019 Novel Coronavirus Repository repository on github contains more references to data and estimates.