8 Time series analysis
Time series analysis (TSA) is a mathematical approach for predicting or forecasting the future pattern of data using historical data arranged in a successive order for a particular time period. statsmodels.tsa package contains model classes and functions that are useful for time series analysis.
Prediction vs forecasting: prediction is concerned with estimating the outcomes of unseen data. Forecasting is a sub-discipline of prediction in which we are making predictions about the future on the basis of time series data, so the only difference is that we consider the temporal dimension
Trend vs season vs cyclic: A trend exists when there is a long-term increase or decrease in the data. It does not have to be linear. A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. Seasonality is always of a fixed and known frequency. A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency.
Rolling average/ moving average is a metric that calculates trends over short periods of time using a set of data. Is uses smaller parts of the data and then rolls or moves for each new period. Calculating next rolling period involves leaving off your earliest unit and adding in your next unit. For egz., if you want to track down monthly data, take 12-months rolling period. After calculating average of 12 months, leave first month and add new month, then calculate average again for new rolling period. In that way, rolling period keeps moving.
Augmented Dickey-Fuller test: tests the null hypothesis that a unit root is present in a time series sample. It is a negative number, and the more negative it is, the stronger the rejection of the hypothesis that there is a unit root at some level of confidence.
There are 3 main versions of the test (Dickey-Fuller test is presented for simplicity):
- Test for a unit root: \(\Delta y_{t} = \delta y_{t-1} + u_{t} \quad(u_{t} \text{ is error term})\)
- Test for a unit root with constant: \(\Delta y_{t} = a_{0} + \delta y_{t-1} + u_{t}\)
- Test for a unit root with constant and deterministic time trend: \(\Delta y_{t} = a_{0} + a_{1}t + \delta y_{t-1} + u_{t}\)
-> Hypothesis: H0: δ = 0 (process is not stationary) H1: δ < 0 (process is stationary)
-> from statsmodels.tsa.stattools import adfuller. For additional parameters, it is the best practice to put autolag=‘AIC’. regression parameter has 4 parameters: ‘c’ for only constant (default), ‘ct’ for constant and trend, ‘ctt’ for constant and linear and quadratic trend, ‘n’ for no constant and no trend.
-> Which version of test to choose? δ needs to be <= 0, so one way to find out is to see if it fits in the right interval. Other way is to compare AIC values and choose lowest. Also by inspecting data we can assume which to choose, but the best way is to perform all 3 types and inspect results.
Stationary time series: the only assumption in TSA is that the data is stationary. Data is stationary when the variance and mean of the series are constant with time, with no periodic component (independent of time influence).
- Check it with Augmented Dickey-Fuller test
- Trend can result in a varying mean over time, wheras seasonality can result in a changing variance over time, both which define a time series as being non-stationary. (stationary datasets are much easier to model).
- Differencing is a widely used data transform for making time series data stationary. Notice that some temporal structures may still exist after performing a differencing operation, such as in the case of a nonlinear trend. The number of times that differencing is performed is called the difference order. DataFrame diff() function can be used.
There are two popular types of non-stationary time series:
- Trend-stationarity time series are those whose mean trend is deterministic. In other words, the mean of the time series changes over time but at a constant rate. The time series is not stationary in the strict sense, but it is stationary in the sense that the trend is stable and predictable
- Difference-stationarity time series have a mean trend that is stochastic. In other words, the mean of the time series changes over time in a random pattern.
Log transform: time series with an exponential distribution can be made linear by taking the logarithm of the values. Log transforms are popular with time series data as they are effective at removing exponential variance
Autocorrelation analysis is used in detecting patterns and checking for randomness. The analysis involves looking at the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots.
- Autocorrelation is a mathematical representation of the degree of similarity between a given time series and a lagged version of itself over successive time intervals. ACF function measures and plots the average correlation between data points in time series and previous values of the series measured for different lag lenghts.
- Partial autocorrelation is similar to autocorrelation except that each partial correlation controls for any correlation between observations of a shorter lag length. For egz., at second lag, the PACF measures the correlation between data points at time „t“ with data points at time „t-2“, while the ACF measures the same correlation but after controlling for the correlation between data points at time „t“ with those at time „t-1“.
- from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
- Stationarity of time series can be inspected with ACF plot (along with ADF test). In case the autocorrelations are positive for multiple lags, the series requires further differencing; but if lag 1 autocorrelated itself pretty negatively, then the series is possibly over-differenced
8.1 Models
AutoRegressive model (AR): it is a linear model where current period values are a sum of past outcomes multiplied by a numeric factor. We denote it as AR(p), where „p“ is called the order of the model and represents the number of lagged values we want to include. p can be determined from PACF plot. For p=1: \[ X_{t} = C + \phi_{1}X_{t-1} + \varepsilon_{t}, \] The coefficient \(\phi_{1}\) is a numeric constant with value between -1 and 1. When multiplied with past value it represents a part which remains in the future. You would choose an AR model if you believe that previous observations have a direct effect on the time series.
Moving Average (MA): it’s a statistic that captures the average change in data series over time. We denote it as MA(q), where „q“ is called the order of the model and represents the number of past forecast errors (or the size of the moving average window). q can be determined from ACF plot. You would choose an MA model if you believe that the errors have a direct effect on the time series.
AutoRegressive Moving Average (ARMA): p,q
AutoRegressive Integrated Moving Average (ARIMA): p,d,q.. where d is the difference order
AutoRegressive Moving Average with eXogeneous factors (ARMAX): exogeneous variables are external data used in forecast (external effects)
Seasonal AutoRegressive Integrated Moving Average (SARIMA): p,d,q,P,D,Q,m.. where m is the number of time steps for a single seasonal period, p,d,q are trend elements and P,D,Q are seasonal elements
Seasonal AutoRegressive Integrated Moving Average with eXogeneous factors (SARIMAX)
Moving Average (MA): it’s a statistic that captures the average change in data series over time. We denote it as MA(q), where „q“ is called the order of the model and represents the number of past forecast errors (or the size of the moving average window). q can be determined from ACF plot. You would choose an MA model if you believe that the errors have a direct effect on the time series.
AutoRegressive Moving Average (ARMA): p,q
AutoRegressive Integrated Moving Average (ARIMA): p,d,q.. where d is the difference order
AutoRegressive Moving Average with eXogeneous factors (ARMAX): exogeneous variables are external data used in forecast (external effects)
Seasonal AutoRegressive Integrated Moving Average (SARIMA): p,d,q,P,D,Q,m.. where m is the number of time steps for a single seasonal period, p,d,q are trend elements and P,D,Q are seasonal elements
Seasonal AutoRegressive Integrated Moving Average with eXogeneous factors (SARIMAX)
8.1.1 Steps for building a model
- Check for stationarity of time series and perform differencing if needed. This is because the term „autoregressive“ implies Linear Regression model (using its lags as predictors) and it works well for independent and non-correlated predictors
- Determine parameters. It can be done with inspecting acf/pacf plots
- Fit the model. Inspect coefficients and P(>|z|) with .summary() function and decide if it is needed for further tuning of parameters
- Check residuals for making sure model has captured adequte information from the data (they should look like white noise). If density looks normally distirbuted, model is ready.
- Make predictions (using .forecast() or .predict() function)
- Evaluate model predictions using common metrics (MAE, RMSE,..)
Suggestions
- Alternatively, use pmdarima package and auto_arima function to automate steps 1 to 3. Be aware that sometimes the manually fitted model is closer to the actual test set
- Alternatively, use plot_diagnostics to automate step 4. Values of good fit:
- Standardized residual: there are no obvious patterns in residuals, with values having a mean of zero
- The KDE curve should be very similar to the normal distribution
- Normal Q-Q: most of the data points should lie on the straight line
- Correlogram: 95% of correlations for lag greater than zero should not be significant
- Suggestion: conduct time series cross-validation to select the best model, i.e. repeat model assessment for different train / test sets
- Suggestion: if data shows exponential trend you can do a log transform before applying a model, then later apply inverse transformation (exponential function)
Useful tips/functions
- Date increment used for a date range: pandas.tseries.offsets.DateOffset