How we achieved predicting COVID-19 cases with 99% accuracy.

October 7, 2020

Tech

This is the second article in our series about COVID-19 case predictions using time series and machine learning models. The first article is linked here. The goal is to find a model to forecast the next 30 days of total COVID cases. We will do so for the United States and compare prediction capabilities between ETS and ARIMA models using Alteryx. We achieved 99% prediction accuracy with the ARIMA model, much higher than the previous prediction accuracy provided by the Facebook Prophet Model (95%).

The characteristics of a time series dataset consists of:

Continuous data over a long period of time
The data is in sequential order
Every consecutive pair of points are one day apart from each other
There is at most one value per date listed.

We are also attempting to provide a forecast for the following 30 days; hence we will hold out the last samples.

If we look at Figure 1, we can see how there is an upward trend occurring. We cannot see if there is a seasonal pattern from the plot below, but we will look more into this in the decomposition plot. There does not appear to be any cyclical pattern occurring in the data.

Figure 1: General Time Series Plot (COVID-19 Cases vs. Date)

Figure 2 confirms the upward trend. There also appears to be a seasonal pattern within the graph. Given our seasonal findings when using an ARIMA model we should find the seasonal difference. When using an ETS model, we can see that the magnitude changes for the seasonal component, hence we will consider using a multiplicative method, but will still compare to the additive method.

Finally, when looking at the error plot, the error does not stay consistent throughout the time series plot. It would be best to apply error with a multiplicative method when using the ETS model but will still compare to the additive method.

ETS MODEL

Earlier we mentioned how we were considering multiplicative methods for error and seasonality with an additive method for the trend. We end up with an extremely high error, so we compared the model with all the additive methods.

This results in an ETS(A, A, A) model.

ARIMA MODEL

From our previous analysis we will use an ARIMA(p, d, q)(P, D, Q)S model to forecast.

Time Series ACF and PACF:

From the ACF we can see how the data is decreasing at a steady pace. It would be wise to consider the seasonal difference in the series.

Figure 3: Autocorrelation Plots (without the seasonal difference)

Seasonal Difference ACF and PACF:

We can see similar results to the ACF and PACF from the initial plots without differencing. The only difference is that the correlation decreased. We will take another difference to remove correlation.

Figure 4: Autocorrelation Plots (Seasonal Difference)

Seasonal First Difference ACF and PACF:

We can see that the results for the ACF and PACF started decaying towards 0. We will take another difference.

Figure 5: Autocorrelation Plots (Seasonal First Difference)

Seasonal Second Difference ACF and PACF:

The correlation continues to decay more, hence taking another difference would be wise.

Figure 6: Autocorrelation Plots (Seasonal Second Difference)

Seasonal Third Difference ACF and PACF:

The correlation continues to decay; hence we will consider taking one more difference.

Figure 7: Autocorrelation Plots (Seasonal Third Difference)

Seasonal Fourth Difference ACF and PACF:

Although the correlation was decreasing, we can see how it also started increasing again towards the center of the ACF plot.

Figure 8: Autocorrelation Plots (Seasonal Fourth Difference)

Given that we could not ultimately decide what terms to use for the ARIMA model,we instead allowed the program to decide the parameter values. This resulted in: ARIMA(0, 2, 1)(0, 0, 4)[7]

Now, we will look at the in-sample errors to provide a closer look at the model accuracy.

The model results with a RMSE value of 4,081 units around the mean. The MAE is 2,541 units around the mean. We can also see the values for the AIC and BIC are 3676 and 3696, respectively.

MODEL COMPARISON:

Referring to our in-sample errors we can see how the RMSE, MAE, AIC, and BIC values are all smaller for the ARIMA model. Below we can further compare and find the same results where the error is smaller for the ARIMA model.

Therefore, we will use the ARIMA model for forecasting.

Figure 9: 30 days Forecast Graph with 80% Confidence Band (Shaded Light Blue Area), and 95% Confidence Band (Dotted Blue Lines)

Our model accuracy turned out to 99.6606%, which is a lot higher than the accuracy we achieved from our best Facebook Prophet model (around 95%). The ARIMA model appears to be more powerful than the Facebook Prophet model in this study, but we have to be cautious because we are using less than a year’s worth of data to predict 30 days’ worth of COVID-19 cases.Thus, our confidence intervals get wider as we go further and further into our forecast.

Feel free to connect with us on linkedin and stay tuned for our next series where we will explore predicting covid-19 cases with LSTM.

Yukon Peng https://www.linkedin.com/in/yukpeng/

Mario Gonzalez https://www.linkedin.com/in/mag93/

Bhanu Garg https://www.linkedin.com/in/bhanu-garg-084bb5102/

Nathan Blackmon https://www.linkedin.com/in/nathan-blackmon-3b917219b/

← View all posts