Information Criteria - Examples

Example created by Wilson Rocha Lacerda Junior

Comparing different information criteria methods

Here we import the NARMAX model, the metric for model evaluation and the methods to generate sample data for tests. Also, we import pandas for specific usage.

pip install sysidentpy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sysidentpy.model_structure_selection import FROLS
from sysidentpy.basis_function._basis_function import Polynomial
from sysidentpy.metrics import root_relative_squared_error
from sysidentpy.utils.generate_data import get_siso_data
from sysidentpy.utils.display_results import results
from sysidentpy.utils.plotting import plot_residues_correlation, plot_results
from sysidentpy.residues.residues_correlation import compute_residues_autocorrelation, compute_cross_correlation

Generating sample data

The data is generated by simulating the following model: \(y_k = 0.2y_{k-1} + 0.1y_{k-1}x_{k-1} + 0.9x_{k-1} + e_{k}\)

If colored_noise is set to True:

\(e_{k} = 0.8\nu_{k-1} + \nu_{k}\)

where \(x\) is a uniformly distributed random variable and \(\nu\) is a gaussian distributed variable with \(\mu=0\) and \(\sigma=0.1\)

In the next example we will generate a data with 3000 samples with white noise and selecting 90% of the data to train the model.

x_train, x_valid, y_train, y_valid = get_siso_data(n=1000,
                                                   colored_noise=False,
                                                   sigma=0.2,
                                                   train_percentage=90)

The idea is to show the impact of the information criteria to select the number of terms to compose the final model. You will se why it is an auxiliary tool and let the algorithm select the number of terms based on the minimum value is not a good idea when dealing with data highly corrupted by noise (even white noise)

Note: You may find different results when running the examples. This is due the fact we are not setting a fixed random generator for the sample data. However, the main analysis remain.

AIC

basis_function = Polynomial(degree=2)

model = FROLS(
    order_selection=True,
    n_info_values=15,
    extended_least_squares=False,
    ylag=2, xlag=2,
    info_criteria='aic',
    estimator='least_squares',
    basis_function=basis_function
)
model.fit(X=x_train, y=y_train)
yhat = model.predict(X=x_valid, y=y_valid)
rrse = root_relative_squared_error(y_valid, yhat)
print(rrse)

r = pd.DataFrame(
    results(
        model.final_model, model.theta, model.err,
        model.n_terms, err_precision=8, dtype='sci'
        ),
    columns=['Regressors', 'Parameters', 'ERR'])
print(r)
plot_results(y=y_valid, yhat = yhat, n=1000)
ee = compute_residues_autocorrelation(y_valid, yhat)
plot_residues_correlation(data=ee, title="Residues", ylabel="$e^2$")
x1e = compute_cross_correlation(y_valid, yhat, x_valid)
plot_residues_correlation(data=x1e, title="Residues", ylabel="$x_1e$")

xaxis = np.arange(1, model.n_info_values + 1)
plt.plot(xaxis, model.info_values)
plt.xlabel('n_terms')
plt.ylabel('Information Criteria')
0.38926724002665514
      Regressors   Parameters             ERR
0        x1(k-2)   8.9197E-01  8.40239164E-01
1         y(k-1)   1.9624E-01  3.92056438E-02
2  x1(k-1)y(k-1)   8.3169E-02  2.25835069E-03
3   y(k-2)y(k-1)  -6.4115E-02  8.46298991E-04
4       y(k-2)^2   2.8534E-02  4.95098470E-04
../_images/information_criteria_examples_10_1.png ../_images/information_criteria_examples_10_2.png ../_images/information_criteria_examples_10_3.png
Text(0, 0.5, 'Information Criteria')
../_images/information_criteria_examples_10_5.png
model.info_values
array([-2726.72814962, -2977.48458651, -2992.4495674 , -2997.62741908,
       -2998.44371294, -2997.94195357, -2997.42660478, -2996.86187433,
       -2996.33235496, -2995.15967968, -2993.49434895, -2991.8239322 ,
       -2990.00537262, -2988.06004053, -2986.09131614])

As can be seen above, the minimum value make the algorithm choose a model with 5 terms. However, if you check the plot, 3 terms is the best choice. Increasing the number of terms from 3 upwards do not lead to a better model since the difference is very small.

In this case, you should run the model again with the parameters n_terms=3! The ERR algorithm ordered the terms in a correct way, so you will get the exact model structure again!

BIC

basis_function = Polynomial(degree=2)

model = FROLS(
    order_selection=True,
    n_info_values=15,
    extended_least_squares=False,
    ylag=2, xlag=2,
    info_criteria='bic',
    estimator='least_squares',
    basis_function=basis_function
)
model.fit(X=x_train, y=y_train)
yhat = model.predict(X=x_valid, y=y_valid)
rrse = root_relative_squared_error(y_valid, yhat)
print(rrse)

r = pd.DataFrame(
    results(
        model.final_model, model.theta, model.err,
        model.n_terms, err_precision=8, dtype='sci'
        ),
    columns=['Regressors', 'Parameters', 'ERR'])
print(r)
plot_results(y=y_valid, yhat = yhat, n=1000)
ee = compute_residues_autocorrelation(y_valid, yhat)
plot_residues_correlation(data=ee, title="Residues", ylabel="$e^2$")
x1e = compute_cross_correlation(y_valid, yhat, x_valid)
plot_residues_correlation(data=x1e, title="Residues", ylabel="$x_1e$")

xaxis = np.arange(1, model.n_info_values + 1)
plt.plot(xaxis, model.info_values)
plt.xlabel('n_terms')
plt.ylabel('Information Criteria')
0.3887513802020464
      Regressors   Parameters             ERR
0        x1(k-2)   8.9223E-01  8.40239164E-01
1         y(k-1)   1.9664E-01  3.92056438E-02
2  x1(k-1)y(k-1)   8.2952E-02  2.25835069E-03
3   y(k-2)y(k-1)  -5.2039E-02  8.46298991E-04
../_images/information_criteria_examples_14_1.png ../_images/information_criteria_examples_14_2.png ../_images/information_criteria_examples_14_3.png
Text(0, 0.5, 'Information Criteria')
../_images/information_criteria_examples_14_5.png
model.info_values
array([-2721.92797955, -2967.88424637, -2978.0490572 , -2978.4267388 ,
       -2974.4428626 , -2969.14093316, -2963.8254143 , -2958.46051379,
       -2953.13082434, -2947.157979  , -2940.6924782 , -2934.22189138,
       -2927.60316173, -2920.85765958, -2914.08876512])

BIC did a better job in this case! The way it penalizes the model regarding the number of terms ensure that the minimum value here was exact the number of expected terms to compose the model. Good, but not always the best method!

LILC

basis_function = Polynomial(degree=2)

model = FROLS(
    order_selection=True,
    n_info_values=15,
    extended_least_squares=False,
    ylag=2, xlag=2,
    info_criteria='lilc',
    estimator='least_squares',
    basis_function=basis_function
)
model.fit(X=x_train, y=y_train)
yhat = model.predict(X=x_valid, y=y_valid)
rrse = root_relative_squared_error(y_valid, yhat)
print(rrse)

r = pd.DataFrame(
    results(
        model.final_model, model.theta, model.err,
        model.n_terms, err_precision=8, dtype='sci'
        ),
    columns=['Regressors', 'Parameters', 'ERR'])
print(r)
plot_results(y=y_valid, yhat = yhat, n=1000)
ee = compute_residues_autocorrelation(y_valid, yhat)
plot_residues_correlation(data=ee, title="Residues", ylabel="$e^2$")
x1e = compute_cross_correlation(y_valid, yhat, x_valid)
plot_residues_correlation(data=x1e, title="Residues", ylabel="$x_1e$")

xaxis = np.arange(1, model.n_info_values + 1)
plt.plot(xaxis, model.info_values)
plt.xlabel('n_terms')
plt.ylabel('Information Criteria')
0.3887513802020464
      Regressors   Parameters             ERR
0        x1(k-2)   8.9223E-01  8.40239164E-01
1         y(k-1)   1.9664E-01  3.92056438E-02
2  x1(k-1)y(k-1)   8.2952E-02  2.25835069E-03
3   y(k-2)y(k-1)  -5.2039E-02  8.46298991E-04
../_images/information_criteria_examples_18_1.png ../_images/information_criteria_examples_18_2.png ../_images/information_criteria_examples_18_3.png
Text(0, 0.5, 'Information Criteria')
../_images/information_criteria_examples_18_5.png
model.info_values
array([-2724.89425438, -2973.81679602, -2986.94788167, -2990.2918381 ,
       -2989.27423672, -2986.9385821 , -2984.58933807, -2982.19071238,
       -2979.82729776, -2976.82072724, -2973.32150127, -2969.81718927,
       -2966.16473445, -2962.38550712, -2958.58288748])

LILC also includes spurious terms. Like AIC, it fails to automatically select the correct terms but you could select the right number based on the plot above!

FPE

basis_function = Polynomial(degree=2)

model = FROLS(
    order_selection=True,
    n_info_values=15,
    extended_least_squares=False,
    ylag=2, xlag=2,
    info_criteria='fpe',
    estimator='least_squares',
    basis_function=basis_function
)
model.fit(X=x_train, y=y_train)
yhat = model.predict(X=x_valid, y=y_valid)
rrse = root_relative_squared_error(y_valid, yhat)
print(rrse)

r = pd.DataFrame(
    results(
        model.final_model, model.theta, model.err,
        model.n_terms, err_precision=8, dtype='sci'
        ),
    columns=['Regressors', 'Parameters', 'ERR'])
print(r)
plot_results(y=y_valid, yhat = yhat, n=1000)
ee = compute_residues_autocorrelation(y_valid, yhat)
plot_residues_correlation(data=ee, title="Residues", ylabel="$e^2$")
x1e = compute_cross_correlation(y_valid, yhat, x_valid)
plot_residues_correlation(data=x1e, title="Residues", ylabel="$x_1e$")

xaxis = np.arange(1, model.n_info_values + 1)
plt.plot(xaxis, model.info_values)
plt.xlabel('n_terms')
plt.ylabel('Information Criteria')
0.38926724002665514
      Regressors   Parameters             ERR
0        x1(k-2)   8.9197E-01  8.40239164E-01
1         y(k-1)   1.9624E-01  3.92056438E-02
2  x1(k-1)y(k-1)   8.3169E-02  2.25835069E-03
3   y(k-2)y(k-1)  -6.4115E-02  8.46298991E-04
4       y(k-2)^2   2.8534E-02  4.95098470E-04
../_images/information_criteria_examples_22_1.png ../_images/information_criteria_examples_22_2.png ../_images/information_criteria_examples_22_3.png
Text(0, 0.5, 'Information Criteria')
../_images/information_criteria_examples_22_5.png
model.info_values
array([-2726.72814879, -2977.48457989, -2992.44954508, -2997.62736617,
       -2998.4436096 , -2997.94177499, -2997.4263212 , -2996.86145104,
       -2996.33175224, -2995.1588529 , -2993.4932485 , -2991.82250348,
       -2990.0035561 , -2988.0577717 , -2986.08852551])

FPE also failed to automatically select the right number of terms! But, as we pointed out before, Information Criteria is an auxiliary tool! If you look at the plots, all the methods allows you to choose the right numbers of terms!

Important Note

Here we are dealing with a known model structure! Concerning real data, we do not know the right number of terms so the methods above stands as excellent tools to help you out!

If you check the metrics above, even with the models with more terms, you will see excellent metrics! But System Identification always search for the best model structure! Model Structure Selection is the core of NARMAX methods! In this respect, the examples are to show basic concepts and how the algorithms work!