When the first COVID-19 case was discovered in the United States of America (USA) in early march, many believed that the disease would never spread in this country. However, the number of cases have spiked exponentially to 6.5 million and continue to grow each and every day. Despite the existence of many other Machine Learning models out there, Facebook Prophet seems to be one of the most effective models in predicting/forecasting future COVID-19 cases. Most importantly, it automates many of the parameters that require tuning in a time series model. By the end of the article, you will be able to understand what the Facebook Prophet does, and use it to build your own Machine Learning models.
In normal data analysis, the order of observations that are collected is usually irrelevant. However, most of the data collected in the real world has a certain order built into it. This structure can be exploited for better prediction ability by employing a Time Series Model. The first step in analyzing a time series model is identifying and understanding the built-in structure of the data over time. These underlying patterns can be classified into four distinct categories:
$$\text{Figure 1: Time Series Components}$$
The second step is to check the underlying assumptions:
Once these assumptions are met (or not too badly violated), we can start building our Time Series Model. The model description is as follows:
Let's look at a First Order Autoregressive Model (AR(1)) Model:
Let's look at some Forecasting Terminology and Notation:
The Facebook Prophet model is an accurate, fast modular regression model that often performs really well with just the default parameters. The python library, fbprophet, comes with mulitiple functionalities. Analysts, often, have to select the components that are relevant to their particular problem. We are going to modify a few of the existing components to fit our needs for this COVID-19 study.
$$\text{Figure 2: fbprophet functions in Python}$$
We shall give a brief introduction to each and every component in this graphic, before we apply the model to our analysis.
The main model equation (broken into individual components) is:
Using time as a regressor, Prophet is trying to fit several linear and non-linear functions of time as components. Modeling seasonality as an additive component is the same approach taken by exponential smoothing in Holt-Winters Technique. Prophet is framing the forecasting problem as a curve-fitting exercise rather than looking explicitly at the time based dependence of each observation within a time series.
Referring back to Figure 2, we can see that the Facebook Prophet model can utilize Markov Chain Monte Carlo (MCMC) to do the uncertainty interval estimation (error terms). This method is used for a Bayesian Statistical approach, but we are not going to apply that to our study. The default uncertainty interval is $80\%$. However, we chose to change that to $95\%$, because that is a commonly used significance level. We used a linear trend (default) for the United States Data because logistic trends are usually reserved for binary data.
Additionally, one advantage of Facebook Prophet is that it automatically discovers changes in trends, and tries to account for it in your analysis. We are not going to make any changes to the default functionality in this part. Facebook Prophet also has built in holidays, and you can input your own holidays, but we chose not to use that for our study. We also didn't use the parameter in estimation to control the impact of the holidays.
!ls
!pwd
import numpy as np
import pandas as pd
from fbprophet import Prophet
from fbprophet.plot import plot_plotly, plot_components_plotly
from fbprophet.diagnostics import cross_validation
from fbprophet.diagnostics import performance_metrics
from fbprophet.plot import plot_cross_validation_metric
stay_at_home_df = pd.read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv")
stay_at_home_df.head()
stay_at_home_df.columns
cleaned_stay_at_home_df = stay_at_home_df[['location','date','total_cases','total_tests','stringency_index','positive_rate','total_deaths']]
cleaned_stay_at_home_df
cleaned_stay_at_home_df['cases_per_tests_ratio'] = cleaned_stay_at_home_df['total_cases']/cleaned_stay_at_home_df['total_tests']
cleaned_stay_at_home_df
cleaned_stay_at_home_df.dtypes
groupby_location_df = cleaned_stay_at_home_df.groupby('location',as_index=False)
separated_location_df = dict(iter(groupby_location_df))
us_covid_df = separated_location_df['United States']
def plot_metrics(forecast_cv,location, imgname):
mse_plot = plot_cross_validation_metric(forecast_cv, metric = 'mse')
mse_plot.suptitle(f'{location}_mse_{imgname}',y=0.95)
mse_plot.savefig(f'./Seasonality_Comparison/{location}_mse_{imgname}')
rmse_plot =plot_cross_validation_metric(forecast_cv, metric = 'rmse')
rmse_plot.suptitle(f'{location}_rmse_{imgname}',y=0.95)
rmse_plot.savefig(f'./Seasonality_Comparison/{location}_rmse_{imgname}')
mae_plot = plot_cross_validation_metric(forecast_cv, metric = 'mae')
mae_plot.suptitle(f'{location}_mae_{imgname}',y=0.95)
mae_plot.savefig(f'./Seasonality_Comparison/{location}_mae_{imgname}')
mape_plot = plot_cross_validation_metric(forecast_cv, metric = 'mape')
mae_plot.suptitle(f'{location}_mape_{imgname}',y=0.95)
mae_plot.savefig(f'./Seasonality_Comparison/{location}_mape_{imgname}')
def plot_forecast(forecast,m,location,img_name):
forecastplot = m.plot(forecast)
forecastplot.suptitle(f'{location}_forecast_{img_name}',y=0.95)
forecastplot.savefig(f'./Seasonality_Comparison/{location}_forecast_{img_name}')
plot_components = m.plot_components(forecast)
plot_components.suptitle(f'{location}_components_{img_name}',y=0.95)
plot_components.savefig(f'./Seasonality_Comparison/{location}_components_{img_name}')
def prophet_predicts(location_df,location, yearly_seasonality, weekly_seasonality):
print(location_df.columns)
num_rows = str(location_df.shape[0] - 31) + ' days'
prophet_df = location_df[['date','total_cases']]
prophet_df = prophet_df.rename(columns={'date':'ds','total_cases': 'y'})
m = Prophet(interval_width = 0.95, yearly_seasonality = yearly_seasonality, weekly_seasonality = weekly_seasonality)
m.fit(prophet_df)
future = m.make_future_dataframe(periods=30)
forecast = m.predict(future)
forecast_cv = cross_validation(m, initial = num_rows, period = '30 days', horizon = '30 days' ) # of rows - 31 days
img_name = 'daily.png'
if(yearly_seasonality):
img_name = 'yearly_' + img_name
if(weekly_seasonality):
img_name = 'weekly_' + img_name
plot_forecast(forecast,m, location, img_name)
plot_metrics(forecast_cv, location, img_name)
prophet_predicts(us_covid_df, 'United States', yearly_seasonality = True, weekly_seasonality = False)
prophet_predicts(us_covid_df, 'United States', yearly_seasonality = False, weekly_seasonality = False)
prophet_predicts(us_covid_df, 'United States', yearly_seasonality = False, weekly_seasonality = True)
prophet_predicts(us_covid_df, 'United States', yearly_seasonality = True, weekly_seasonality = True)
In our study, we tested four different model types within Facebook Prophet. Each model type had a different seasonality parameter. Facebook Prophet always has daily seasonality by default, but we added on yearly, and weekly seasonlity from our end. We shall discuss the forecast graphs, and the model accuracies given by varying seasonality types.
$$\text{Figure 3: a) Yearly & Daily Seasonality. b) Daily Seasonality. c) Weekly & Daily Seasonality. d) Yearly, Weekly & Daily Seasonality.}$$
Model Type | Model Accuracy |
---|---|
Yearly & Daily Seasonality | $95.5585\%$ |
Daily Seasonality | $93.5955\%$ |
Weekly & Daily Seasonality | $93.5983\%$ |
Yearly, Weekly & Daily Seasonality | $94.5314\%$ |
From these results, we can clearly see that Yearly and Daily Seasonality performs marginally better than all the other models. However, even a 4-5% error rate is a lot when you consider the scale of our data. We are overestimating the number of COVID-19 cases in all the models. Our models are predicting 7.8 million cases by the beginning of October. We are currenlty at 7.4 million cases. This phenomenon might be occuring because the model hasn't detected the change in trend because it is a recent event. We believe that more training values will cause the models to detect the change and give more accurate predictions. Our next articles will cover the ARIMA Time Series Model, and the LSTM Neural Network Model.