8  Time series analysis

Time series analysis (TSA) is a mathematical approach for predicting or forecasting the future pattern of data using historical data arranged in a successive order for a particular time period. statsmodels.tsa package contains model classes and functions that are useful for time series analysis.

8.1 Models

  • AutoRegressive model (AR): it is a linear model where current period values are a sum of past outcomes multiplied by a numeric factor. We denote it as AR(p), where „p“ is called the order of the model and represents the number of lagged values we want to include. p can be determined from PACF plot. For p=1: \[ X_{t} = C + \phi_{1}X_{t-1} + \varepsilon_{t}, \] The coefficient \(\phi_{1}\) is a numeric constant with value between -1 and 1. When multiplied with past value it represents a part which remains in the future. You would choose an AR model if you believe that previous observations have a direct effect on the time series.

  • Moving Average (MA): it’s a statistic that captures the average change in data series over time. We denote it as MA(q), where „q“ is called the order of the model and represents the number of past forecast errors (or the size of the moving average window). q can be determined from ACF plot. You would choose an MA model if you believe that the errors have a direct effect on the time series.

  • AutoRegressive Moving Average (ARMA): p,q

  • AutoRegressive Integrated Moving Average (ARIMA): p,d,q.. where d is the difference order

  • AutoRegressive Moving Average with eXogeneous factors (ARMAX): exogeneous variables are external data used in forecast (external effects)

  • Seasonal AutoRegressive Integrated Moving Average (SARIMA): p,d,q,P,D,Q,m.. where m is the number of time steps for a single seasonal period, p,d,q are trend elements and P,D,Q are seasonal elements

  • Seasonal AutoRegressive Integrated Moving Average with eXogeneous factors (SARIMAX)

  • Moving Average (MA): it’s a statistic that captures the average change in data series over time. We denote it as MA(q), where „q“ is called the order of the model and represents the number of past forecast errors (or the size of the moving average window). q can be determined from ACF plot. You would choose an MA model if you believe that the errors have a direct effect on the time series.

  • AutoRegressive Moving Average (ARMA): p,q

  • AutoRegressive Integrated Moving Average (ARIMA): p,d,q.. where d is the difference order

  • AutoRegressive Moving Average with eXogeneous factors (ARMAX): exogeneous variables are external data used in forecast (external effects)

  • Seasonal AutoRegressive Integrated Moving Average (SARIMA): p,d,q,P,D,Q,m.. where m is the number of time steps for a single seasonal period, p,d,q are trend elements and P,D,Q are seasonal elements

  • Seasonal AutoRegressive Integrated Moving Average with eXogeneous factors (SARIMAX)

8.1.1 Steps for building a model

  1. Check for stationarity of time series and perform differencing if needed. This is because the term „autoregressive“ implies Linear Regression model (using its lags as predictors) and it works well for independent and non-correlated predictors
  2. Determine parameters. It can be done with inspecting acf/pacf plots
  3. Fit the model. Inspect coefficients and P(>|z|) with .summary() function and decide if it is needed for further tuning of parameters
  4. Check residuals for making sure model has captured adequte information from the data (they should look like white noise). If density looks normally distirbuted, model is ready.
  5. Make predictions (using .forecast() or .predict() function)
  6. Evaluate model predictions using common metrics (MAE, RMSE,..)

Suggestions

  • Alternatively, use pmdarima package and auto_arima function to automate steps 1 to 3. Be aware that sometimes the manually fitted model is closer to the actual test set
  • Alternatively, use plot_diagnostics to automate step 4. Values of good fit:
    1. Standardized residual: there are no obvious patterns in residuals, with values having a mean of zero
    2. The KDE curve should be very similar to the normal distribution
    3. Normal Q-Q: most of the data points should lie on the straight line
    4. Correlogram: 95% of correlations for lag greater than zero should not be significant
  • Suggestion: conduct time series cross-validation to select the best model, i.e. repeat model assessment for different train / test sets
  • Suggestion: if data shows exponential trend you can do a log transform before applying a model, then later apply inverse transformation (exponential function)

Useful tips/functions

  • Date increment used for a date range: pandas.tseries.offsets.DateOffset