Nowcasting Jamaica GDP¶
Last edit: 2026-04-20
What this script does¶
This notebook runs the full nowcasting pipeline for Bahamas GDP:
- Uses LARS (Least Angle Regression) to identify a strong set of candidate predictors
- Iterates over plausible covariate sets (feature combinations) derived from the candidate pool
- Fits each model on the training sample, generates out-of-sample predictions on the test sample, and produces GDP nowcasts
- Stores outputs in structured results tables (model performance + predictions by vintage/specification)
Models included¶
The toolkit currently estimates and compares:
- OLS
- Ridge (OLS + L2 regularization)
- LASSO (L1 regularization)
- Elastic Net
- Decision Tree
- Random Forest
- Gradient Boosting Tree
- LSTM (Long Short-Term Memory neural network)
Additional packkages¶
Be sure to install:
pip install tqdm-joblib
Import libraries¶
In [1]:
# =========================================================
# User settings (edit here as necessary)
# =========================================================
import pandas as pd
from datetime import datetime
# ---------------------------------------------------------
# 0) DATA/METADATA FILE NAMES
# ---------------------------------------------------------
data_file = "data.csv"
meta_file = "meta.csv"
# ---------------------------------------------------------
# 1) DEFINE TRAIN / TEST WINDOW
# ---------------------------------------------------------
# These dates define the sample splits used for model training and evaluation.
# - Training sample: [train_start_date, test_start_date)
# - Test sample: [test_start_date, test_end_date]
train_start_date = "2015-03-01"
test_start_date = "2023-06-01"
# Convert to Timestamp for safer comparisons later
train_start_date = pd.to_datetime(train_start_date)
test_start_date = pd.to_datetime(test_start_date)
# ---------------------------------------------------------
# 2) DEFINE OUT-OF-SAMPLE FORECAST TARGET (LAST DATE / VINTAGE)
# ---------------------------------------------------------
# This is the date for which you want to produce the final nowcast/forecast.
# Use the first day of the month (or the appropriate GDP reference month).
desired_date = pd.to_datetime("2026-03-01")
# ---------------------------------------------------------
# 3) DEFINE OUTPUT FILENAMES (timestamped)
# ---------------------------------------------------------
# Timestamp output files so each run creates a unique set of results.
today = datetime.today()
date_str = today.strftime("%Y%m%d")
forecast_file = f"{date_str} 3_nwcst Tab fcst.xlsx" # model forecasts / nowcasts
performance_file = f"{date_str} 3_nwcst Tab perf.xlsx" # model performance metrics
outofsample_file = f"{date_str} 3_nwcst Tab oos.xlsx" # full OOS prediction panel
# (Optional) If you want to resume / compare against a previous run,
# set last_performance to the filename of the prior performance workbook.
# Example: last_performance = "perf_20251207.xlsx"
last_performance = None
# ---------------------------------------------------------
# 4) MODEL SETTINGS
# ---------------------------------------------------------
# Target GDP series ID
target_variable = "RGDP0000"
# Prediction horizon (vintages) to evaluate:
# - negative values: backcasts (e.g., -1 = one period back)
# - 0: nowcast (current period)
# - positive values: forecasts (e.g., +1 = one period ahead)
lags = list(range(-2, 3))
# ---------------------------------------------------------
# 5) TARGET MODELS
# ---------------------------------------------------------
# Define models of interest
run = {
'ols' : 1 ,
'olsr' : 1 ,
'enet' : 1 ,
'lasso' : 1 ,
'gbt' : 1 ,
'dt' : 1 ,
'rf' : 1 ,
'lstm' : 1 ,
}
# Parameter: iterations in GBT, DT, RF
_iter_ = 10
# Parameter: LSTM n_models and train_episodes
_iter_lstm = 10
_epoch_lstm = 50
# =========================================================
# =========================================================
# =========================================================
# Nowcasting pipeline: imports, paths, data loading, and preprocessing
# =========================================================
# This section:
# 1) Imports all required libraries for the nowcasting toolkit (ML + time-series + LSTM)
# 2) Defines local folder paths for raw inputs, processed data, and outputs
# 3) Loads the full dataset and metadata
# 4) Deflates nominal variables using CPI (base year = 2018 average)
# 5) Seasonally adjusts series using trend extraction from seasonal decomposition
# 6) Converts all series to growth rates (percent change)
# 7) Cleans the final modeling dataset and prints library versions for reproducibility
# =========================================================
import os, json, time, warnings, re
from copy import deepcopy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
# -----------------------------
# Models / ML utilities
# -----------------------------
from sklearn import linear_model, tree
from sklearn.linear_model import (
LinearRegression, RidgeCV, ElasticNet, Lasso, Lars
)
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
# Optional time-series tools (if used downstream)
import pmdarima as pm
# Deep learning (LSTM nowcasting)
import torch
from nowcast_lstm.LSTM import LSTM
# Statsmodels seasonal adjustment
from statsmodels.tsa.seasonal import seasonal_decompose
# Reduce noisy warnings from some estimators / decomposition
warnings.filterwarnings("ignore", category=UserWarning)
# =========================================================
# Paths
# =========================================================
# Define local project directories (relative to current working directory)
d = os.getcwd() + "/"
PATH_RAW = os.path.join(d, "..", "raw") # raw inputs (downloads / intermediate CSVs)
PATH_DATA = os.path.join(d, "..", "data") # processed modeling tables
# Output folder for model results (forecasts, performance tables)
PATH_OUTPUT = os.path.join(d, "..", "output/", f"{date_str}")
# In case it is run several times in a day:
# PATH_OUTPUT = os.path.join(d, "..", "output/", f"{date_str}_2")
# Create folder if it does not exist
os.makedirs(PATH_OUTPUT, exist_ok=True)
print("# =========================================================")
print("Key Directories:")
print(f"Raw inputs: {PATH_RAW}")
print(f"Modeling data: {PATH_DATA}")
print(f"Model outputs: {PATH_OUTPUT}")
print("# =========================================================")
# =========================================================
# Load data and metadata
# =========================================================
# Main modeling dataset (indexed by date)
data = pd.read_csv(
os.path.join(PATH_DATA, data_file ),
parse_dates=["date"],
index_col="date"
)
test_end_date = data[target_variable].dropna().index.max()
# Metadata describing each variable (frequency, nominal/real flags, etc.)
# NOTE: this uses a Windows-style path separator; works on Windows.
# For portability, consider: os.path.join(PATH_DATA, "meta_full.csv")
metadata = pd.read_csv(os.path.join(PATH_DATA, meta_file))
# Key column names and CPI series ID used for deflation
meta_col = "ticket" # variable identifier column in metadata
cpi = "RCPI0000" # CPI series used to deflate nominal variables
# =========================================================
# Deflate nominal variables using CPI (base year: 2019 average, STATINJA 2019=100)
# =========================================================
# Many series are nominal; deflating them puts everything in real terms.
# We compute a CPI base as the average CPI level in 2019 (STATINJA 2019=100 → base = 100).
base = data[cpi][data.index.year == 2019].mean() # beta11: base year 2019 (Jamaica official)
# Identify variables flagged as nominal in metadata
nominal = metadata.loc[metadata["nominal"] == 1][meta_col].values
# Deflate: Real_t = Nominal_t / CPI_t * CPI_base (corrected: was multiplying)
for column in data.columns:
if column in nominal:
data[column] = data[column] / data[cpi] * base # beta11: divide (deflate) not multiply
# Drop CPI
data.drop(columns=[cpi], inplace=True)
# Drop GDP components
data = data.drop(columns=[col for col in data.columns if 'RGDP' in col and col != 'RGDP0000'])
# ========================================================= Key Directories: Raw inputs: c:\Users\guerr\Dropbox\02_Work\Consulting\2026_NWCST_CAR_Research-Workshop\JAM Workshop\workshop_code/..\raw Modeling data: c:\Users\guerr\Dropbox\02_Work\Consulting\2026_NWCST_CAR_Research-Workshop\JAM Workshop\workshop_code/..\data Model outputs: c:\Users\guerr\Dropbox\02_Work\Consulting\2026_NWCST_CAR_Research-Workshop\JAM Workshop\workshop_code/..\output/20260511 # =========================================================
In [2]:
# =========================================================
# Seasonal adjustment (trend extraction)
# =========================================================
# We apply seasonal decomposition and keep the trend component.
# Test for seasonality in data
# Period depends on data frequency:
# quarterly: 4, monthly: 12, daily: 31 (approx monthly seasonality)
quarterly = metadata.loc[metadata["freq"].str.lower().str.contains("q")][meta_col].values
monthly = metadata.loc[metadata["freq"].str.lower().str.contains("m")][meta_col].values
already_sa = metadata.loc[metadata["sa"]==1][meta_col].values
# Drop columns with fewer than 24 non-NaN observations
data = data.loc[:, data.notna().sum() >= 24]
for column in data.columns:
if column in already_sa:
continue
series = data[column].dropna()
if column in quarterly:
decomp = seasonal_decompose(series, model='additive', extrapolate_trend='freq', period=4)
data[column] = data[column] - decomp.seasonal
else :
decomp = seasonal_decompose(series, model='additive', extrapolate_trend='freq', period=12)
data[column] = data[column] - decomp.seasonal
# =========================================================
# Convert levels to growth rates (percent change)
# =========================================================
# Quarterly series: quarter-over-quarter growth rate
# All other series: default percent change at their observed frequency
for column in data.columns:
if column in quarterly:
# Get non-NaN values with their original index
non_nan = data[column].dropna()
# Calculate percent change on non-NaN values only
pct_change_values = non_nan.pct_change(periods=1, fill_method=None) * 100
# Create a new series filled with NaN, then update with calculated values
result = pd.Series(np.nan, index=data.index)
result.loc[pct_change_values.index] = pct_change_values
data[column] = result
else:
# Store original NaN mask
original_was_nan = data[column].isna()
# For monthly/weekly/daily, calculate normally
pct_change_values = data[column].pct_change(periods=1, fill_method=None) * 100
# Set to NaN where previous value was NaN OR current original value was NaN
prev_was_nan = data[column].shift(1).isna()
pct_change_values[prev_was_nan | original_was_nan] = np.nan
data[column] = pct_change_values
data.replace([np.inf, -np.inf], np.nan, inplace=True)
# =========================================================
# Drop columns in quarterly with more than 75% missing values
# =========================================================
for column in [ col for col in data.columns if col not in quarterly ] :
missing_pct = data[data.index<=test_start_date][column].isna().sum() / len(data[data.index<=test_start_date][column])
if missing_pct > 0.75:
data.drop(column, axis=1, inplace=True)
# =========================================================
# Final cleaning
# =========================================================
# Replace infinities created by pct_change with NaN
data.replace([np.inf, -np.inf], np.nan, inplace=True)
# Drop CPI from predictor set (used only for deflation)
#data.drop(cpi, axis=1, inplace=True)
# Keep data from 2015 onward and remove fully-missing rows
data = data[~(data.index < train_start_date)].reset_index(drop=False)
data = data.set_index("date").dropna(how="all", axis=0)
metadata["months_lag"] = (
pd.to_numeric(metadata["months_lag"], errors="coerce")
.fillna(1)
.astype(int)
)
# Drop those columns
data = data.drop(columns=[col for col in data.columns if 'RGDP' in col and col != 'RGDP0000'])
for col in data.columns :
last = data[col].dropna().index.max().year
if last < 2024 :
data.drop(col, axis=1, inplace=True)
'''
# drop low hierarcy series
# Pattern: ends with 000x OR has x00x in last 4 digits
pattern = r'.*(\d00\d|000\d)$'
hierarchy = [col for col in data.columns if re.match(pattern, col)]
data = data[hierarchy]
'''
# Manage high-corr series from that proxy similar underlying DGP
def deduplicate_correlated(df, threshold=0.85, target=target_variable,
metadata=metadata, meta_col=meta_col):
"""Drop one variable from each highly-correlated pair.
Priority: (1) keep aggregate tickets (numeric suffix % 100 == 0);
(2) keep the variable with more non-null observations.
"""
import re
def _is_agg(ticket):
nums = re.findall(r'\d+', str(ticket))
return bool(nums) and int(nums[-1]) % 100 == 0
agg_set = set(metadata.loc[metadata[meta_col].apply(_is_agg), meta_col])
X = df.drop(columns=[target])
obs = X.notna().sum()
corr = X.corr().abs()
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
to_drop = set()
for col in upper.columns:
if col in to_drop:
continue
peers = upper.index[upper[col] > threshold].tolist()
for peer in peers:
if peer in to_drop:
continue
col_agg = col in agg_set
peer_agg = peer in agg_set
if col_agg and not peer_agg:
to_drop.add(peer)
elif peer_agg and not col_agg:
to_drop.add(col)
break
elif obs[col] >= obs[peer]:
to_drop.add(peer)
else:
to_drop.add(col)
break
kept = [c for c in df.columns if c not in to_drop]
print(f"Removed {len(df.columns) - len(kept)} correlated variables "
f"(|r| > {threshold}). {len(kept)} remain.")
return df[kept]
data = deduplicate_correlated(data, threshold=0.98)
# Drop high-publication lag series
tickets_to_keep = metadata[metadata.months_lag <= 5]['ticket'].tolist()
# Keep only these columns (that exist in data)
data = data[[ col for col in data.columns if col in tickets_to_keep or "RGDP" in col ]]
data.to_csv(os.path.join(PATH_OUTPUT, "data_processed.csv"), index=True)
metadata = metadata[metadata.ticket.isin(data.columns)]
metadata.to_csv(os.path.join(PATH_OUTPUT, "meta_processed.csv"), index=True)
# =========================================================
# Reproducibility: print core library versions
# =========================================================
print("================== Library Versions =================")
import sklearn, numpy, pandas
print("scikit-learn:", sklearn.__version__)
print("numpy:", numpy.__version__)
print("pandas:", pandas.__version__)
# Data dimensions (observations and variables)
print("Data dimensions:", data.shape)
print(f"Training dates: {train_start_date.strftime('%Y-%m-%d')} to {(test_start_date - pd.DateOffset(days=1)).strftime('%Y-%m-%d')}")
T1 = data[data.index < test_start_date].shape[0]
T2 = data[(data.index >= test_start_date) & (data.index <= test_end_date)].shape[0]
print(f"Training periods: {T1} ({T1/(T1+T2) :.2%})")
print(f"Testing dates: {test_start_date.strftime('%Y-%m-%d')} to {test_end_date.strftime('%Y-%m-%d')}")
print(f"Testing periods: {T2} ({T2/(T1+T2) :.2%})")
Removed 173 correlated variables (|r| > 0.98). 687 remain. ================== Library Versions ================= scikit-learn: 1.8.0 numpy: 2.4.1 pandas: 2.3.3 Data dimensions: (134, 245) Training dates: 2015-03-01 to 2023-05-31 Training periods: 99 (76.15%) Testing dates: 2023-06-01 to 2025-12-01 Testing periods: 31 (23.85%)
Visualize variables
In [3]:
fig, axes = plt.subplots(4, 1, figsize=(13, 13), sharex=False)
# GDP (quarterly) — bar chart
ax = axes[0]
gdp = data['RGDP0000'].dropna()
ax.bar(gdp.index, gdp, width=60, color='#003865', alpha=0.8, label='GDP')
ax.axhline(0, color='black', linewidth=0.5, linestyle='--')
ax.set_title('Real GDP Growth (QoQ %)', fontsize=11, weight='bold')
ax.set_ylabel('% change')
ax.legend(frameon=False)
ax.grid(axis='y', linestyle=':', alpha=0.5)
# Payroll (MPAY2000)
ax = axes[1]
pay = data['MPAY2000'].dropna()
ax.plot(pay.index, pay, color='#009CA7', linewidth=1.8, label='Payroll (MPAY2000)')
ax.axhline(0, color='black', linewidth=0.5, linestyle='--')
ax.set_title('Payroll Growth (MoM %)', fontsize=11, weight='bold')
ax.set_ylabel('% change')
ax.legend(frameon=False)
ax.grid(axis='y', linestyle=':', alpha=0.5)
# Google Trends (UGTR0000)
ax = axes[2]
gtr = data['UGTR0000'].dropna()
ax.plot(gtr.index, gtr, color='#F5A623', linewidth=1.8, label='Google Trends (UGTR0000)')
ax.axhline(0, color='black', linewidth=0.5, linestyle='--')
ax.set_title('Google Trends Growth (MoM %)', fontsize=11, weight='bold')
ax.set_ylabel('% change')
ax.legend(frameon=False)
ax.grid(axis='y', linestyle=':', alpha=0.5)
# Nighttime Lights (UNTL0000)
ax = axes[3]
ntl = data['UNTL0000'].dropna()
ax.plot(ntl.index, ntl, color='#8B5CF6', linewidth=1.8, label='Nighttime Lights (UNTL0000)')
ax.axhline(0, color='black', linewidth=0.5, linestyle='--')
ax.set_title('Nighttime Lights Growth (MoM %)', fontsize=11, weight='bold')
ax.set_ylabel('% change')
ax.legend(frameon=False)
ax.grid(axis='y', linestyle=':', alpha=0.5)
fig.suptitle('Data Glimpse — Selected Indicators', fontsize=13, weight='bold')
plt.tight_layout()
plt.savefig(f"{PATH_OUTPUT}/{date_str} 3_nwcst Fig vars_trans.png", dpi=300, bbox_inches='tight')
plt.show()
Functions¶
In [4]:
# ALL HELPER FUNCTIONS AND MODELS ARE DEFINED HERE
# +++++++++++++++++++++++++++++++++++++++++++++++
# +++++++++++++++++++++++++++++++++++++++++++++++
scaler = StandardScaler()
def estimate_factors(X, r_max=10, criterion='IC1'):
'''
Information criteria to PCA.
Bai and Ng (2002)
inputs:
- X: df
- r_max: max number of factors (should be less than T or N)
- criterion: conservative criteria is IC3
'''
T, N = X.shape
r_max = min(r_max, T - 1) # enforce theoretical max
IC_vals = []
for r in range(r_max + 1):
if r == 0:
sigma2_r = np.mean(X.values ** 2)
else:
pca = PCA(n_components=r)
X_scaled = scaler.fit_transform(X)
F_hat = pca.fit_transform(X_scaled)
Lambda_hat = X_scaled.T @ F_hat / T
X_hat = F_hat @ Lambda_hat.T
sigma2_r = np.mean((X_scaled - X_hat).values ** 2)
if criterion == 'IC1':
penalty = r * (N + T) / (N * T) * np.log(N * T / (N + T))
elif criterion == 'IC2':
penalty = r * (N + T) / (N * T) * np.log(min(N, T))
elif criterion == 'IC3':
penalty = r * np.log(min(N, T)) / min(N, T)
else:
raise ValueError("Invalid criterion")
IC_vals.append(np.log(sigma2_r) + penalty)
return np.argmin(IC_vals)
def lars_variable_ranking(df, target, max_vars=None, test=None, factors=5):
"""
Stepwise LARS variable selection: rank variables by their entry into the model.
- df: DataFrame with predictors and target (train period only)
- target: name of the target variable
- max_vars: number of variables to select; if None, continue until all are ranked
- test: DataFrame with test period rows (same columns as df)
Returns:
- selected_vars: ordered list of variable names by entry into the LARS model
- rmse: out-of-sample RMSE on the test period (or None)
Chinn, M. D., Meunier, B., & Stumpner, S. (2023). Nowcasting world trade with machine learning: a three-step approach (No. w31419). National Bureau of Economic Research.
"""
df0 = deepcopy(df)
selected_order = []
count = 0
max_vars = max_vars or (df0.shape[1] - 1)
while count < max_vars:
X = df0.drop(columns=[target])
y = df0[target]
# Stop if no variables remain
if X.shape[1] == 0:
break
# Fit LARS model (drop NaN rows for fitting)
mask = y.notna()
model = Lars(n_nonzero_coefs=X.shape[1], jitter=1e-10, random_state=0)
model.fit(X[mask], y[mask])
# coef_path_ has shape (n_features, n_steps)
coef_path = np.array(model.coef_path_)
# Count how many times each variable is zero
zero_counts = (coef_path == 0).sum(axis=1)
# Summary table
out = pd.DataFrame({'Variable': X.columns, 'ZeroCount': zero_counts})
out = out.sort_values('ZeroCount')
# Take variables that are uniquely at this ZeroCount level
grouped = out.groupby('ZeroCount')
selected = grouped.filter(lambda g: len(g) == 1)
# If nothing uniquely enters, dump all remaining vars
if selected.empty:
remaining = X.columns.tolist()
selected = pd.DataFrame({'Variable': remaining, 'ZeroCount': [count]*len(remaining)})
selected_order.extend(remaining)
print("Warning: Using fallback procedure for LARS")
break
else:
selected_vars = selected['Variable'].tolist()
selected_order.extend(selected_vars)
df0 = df0.drop(columns=selected_vars)
count += len(selected_vars)
top_vars = selected_order[:max_vars]
if test is None:
return top_vars, None
# --- Evaluation: mirror the benchmark OLS setup exactly ---
df_train = df[ [target] + top_vars]
df_test = test[[target] + top_vars]
X_train_raw = df_train.drop(target, axis=1)
X_test_raw = df_test.drop(target, axis=1)
y_train_all = df_train[target]
y_test_all = df_test[target]
# Fit scaler on train only, transform both (no data leakage)
X_train_scaled = pd.DataFrame(
scaler.fit_transform(X_train_raw), index=X_train_raw.index
).dropna(axis=1)
X_test_scaled = pd.DataFrame(
scaler.transform(X_test_raw), index=X_test_raw.index
)[X_train_scaled.columns]
# Fit PCA on train only, transform both
pca_local = PCA(n_components=factors)
X_train_pca = pd.DataFrame(
pca_local.fit_transform(X_train_scaled), index=X_train_raw.index,
columns=[f"PC_{i}" for i in range(factors)]
)
X_test_pca = pd.DataFrame(
pca_local.transform(X_test_scaled), index=X_test_raw.index,
columns=[f"PC_{i}" for i in range(factors)]
)
# Drop NaN GDP rows from training (same as benchmark obs_mask)
obs_mask = y_train_all.notna()
X_train_pca = X_train_pca[obs_mask]
y_train_fit = y_train_all[obs_mask]
# Drop NaN GDP rows from test when computing RMSE
obs_mask_test = y_test_all.notna()
X_test_eval = X_test_pca[obs_mask_test]
y_test_eval = y_test_all[obs_mask_test]
final_model = LinearRegression()
final_model.fit(X_train_pca, y_train_fit)
y_pred = final_model.predict(X_test_eval)
rmse = float(np.sqrt(np.mean((y_test_eval.values - y_pred) ** 2)))
return top_vars, rmse
# +++++++++++++++++++++++++++++++++++++++++++++++
# +++++++++++++++++++++++++++++++++++++++++++++++
# helper function, generate lagged datasets for testing on vintages
def gen_lagged_data(metadata, data, last_date, lag):
# only go up to the last date
lagged_data = data.loc[data.date <= last_date, :].reset_index(drop=True)
for col in lagged_data.columns[1:]:
pub_lag = metadata.loc[metadata.ticket == col, "months_lag"].values[0] # publication lag of this particular variable
# go back as far as needed for the pub_lag of the variable, then + the lag (so -2 for 2 months back), also -1 because 0 lag means in month, last month data available, not current month in
#lagged_data.loc[(len(lagged_data) - pub_lag + lag - 1) :, col] = np.nan
lagged_data.loc[(len(lagged_data) - pub_lag + lag) :, col] = np.nan
return lagged_data
# helper function, flatten a dataset for methods that don't do timeseries, extra columns for each lag
def flatten_data(data, target_variable, n_lags):
flattened_data = data.loc[~pd.isna(data[target_variable]), :]
orig_index = flattened_data.index
for i in range(1, n_lags):
lagged_indices = orig_index - i
lagged_indices = lagged_indices[lagged_indices >= 0]
tmp = data.loc[lagged_indices, :]
tmp.date = tmp.date + pd.DateOffset(months=i)
tmp = tmp.drop([target_variable], axis=1)
tmp.columns = [j + "_" + str(i) if j != "date" else j for j in tmp.columns]
flattened_data = flattened_data.merge(tmp, how="left", on="date")
return flattened_data
# helper function fill missings in a dataset with the mean from the training set
def mean_fill_dataset(training, test):
mean_dict = {}
for col in training.columns[1:]:
mean_dict[col] = np.nanmean(training[col])
filled = test.copy()
for col in training.columns[1:]:
filled.loc[pd.isna(filled[col]), col] = mean_dict[col]
return filled
def lagged_target(data, total_lags, metadata=metadata, target_variable=target_variable) :
if total_lags == 0 :
return data
else :
for l in range(1,total_lags+1) :
data[f"{target_variable}_l{l}"] = data[target_variable]
data.loc[data.index[-metadata[metadata['ticket']==target_variable]['months_lag'].values[0]:], f"{target_variable}_l{l}"] = np.nan
data[f"{target_variable}_l{l}"] = data[f"{target_variable}_l{l}"].ffill()
data[f"{target_variable}_l{l}"] = data[f"{target_variable}_l{l}"].shift(1)
data[f"{target_variable}_l{l}"] = data[f"{target_variable}_l{l}"].fillna(data[f"{target_variable}_l{l}"].mean())
return data
def perform_tab( pred_values , model_name , specification , values , lags=lags) :
# table of RMSE by vintage
table = pd.DataFrame(columns=[ "Vintage", "RMSE", "estimator", "spec" ])
for lag in lags:
tmp = pd.DataFrame({
"Vintage" : lag,
"RMSE" : np.sqrt(np.mean((np.array(values) - np.array(pred_values[lag])) ** 2)) ,
"estimator": model_name ,
"spec" : specification ,
}, index=[0])
table = pd.concat([table, tmp]).reset_index(drop=True)
table.round(4)
table.set_index('Vintage' , inplace=True)
return table
def forecast_table( pred_values , model_name , specification, values , dates ) :
# plot of predictions vs actuals
result = pd.DataFrame(
{
"actuals": values ,
"two_back": pred_values[-2] ,
"one_back": pred_values[-1] ,
"zero_back": pred_values[0] ,
"one_ahead": pred_values[1] ,
"two_ahead": pred_values[2] ,
"estimator": model_name ,
"spec": specification ,
}
)
result.index = pd.to_datetime(dates)
return result
'''
DEFINE ALL OLS FUNCTIONS :
fit: model training
predict: test fit
oos: out of sample prediction
'''
def fit_ols(
ytrain ,
xtrain ,
target_variable = target_variable,
) :
model = LinearRegression()
return model.fit(xtrain, ytrain) , xtrain.columns
def predict_ols(
model ,
X ,
train_vars ,
date ,
pred_dict ,
l ,
target_variable = target_variable ,
) :
X = X[train_vars]
pred = model.predict(X)[0]
pred_dict[l].append(pred)
return pred_dict
def oos_ols(
model_fit ,
new_data ,
train_vars ,
table ,
new_dates,
spec = "model" ,
desired_date = desired_date,
):
new_data = new_data[train_vars]
oos_forecast = model_fit.predict( new_data )
oos_forecast = pd.DataFrame( {
"date": new_dates,
"nowcast": oos_forecast
}).set_index('date')
oos_forecast['estimator'] = 'OLS'
oos_forecast['spec'] = spec
table = pd.concat([ table , oos_forecast ], axis=0)
return table
'''
DEFINE ALL OLS RIDGE FUNCTIONS
'''
def fit_olsr(
ytrain ,
xtrain ,
target_variable = target_variable,
alphas = [0.0001, 0.001, 0.01, 0.1, 1, 10, 20],
) :
model = RidgeCV( alphas = alphas )
return model.fit(xtrain, ytrain) , xtrain.columns
def predict_olsr(
model ,
X ,
train_vars ,
date ,
pred_dict ,
l ,
target_variable = target_variable,
) :
X = X[train_vars]
pred = model.predict(X)[0]
pred_dict[l].append(pred)
return pred_dict
def oos_olsr(
model_fit ,
new_data ,
train_vars ,
table,
new_dates,
spec = "model" ,
desired_date = desired_date,
):
new_data = new_data[train_vars]
oos_forecast = model_fit.predict(new_data)
oos_forecast = pd.DataFrame( {
"date": new_dates,
"nowcast": oos_forecast
}).set_index('date')
oos_forecast['estimator'] = 'OLSR'
oos_forecast['spec'] = spec
table = pd.concat([ table , oos_forecast ], axis=0)
return table
'''
DEFINE ALL ENET FUNCTIONS
'''
def fit_enet(
ytrain,
xtrain ,
target_variable = target_variable,
params = {
'alpha' : 1e-5 ,
'l1_ratio' : 0.25 ,
}
) :
model = ElasticNet(alpha = params['alpha'] , l1_ratio = params['l1_ratio'] )
return model.fit(xtrain, ytrain) , xtrain.columns
def predict_enet(
model ,
X ,
train_vars ,
date ,
pred_dict ,
l ,
target_variable = target_variable,
) :
X = X[train_vars]
pred = model.predict(X)[0]
pred_dict[l].append(pred)
return pred_dict
def oos_enet(
model_fit,
new_data ,
train_vars ,
table,
new_dates,
spec = "model" ,
desired_date = desired_date,
):
new_data = new_data[train_vars]
oos_forecast = model_fit.predict(new_data)
oos_forecast = pd.DataFrame( {
"date": new_dates,
"nowcast": oos_forecast
}).set_index('date')
oos_forecast['estimator'] = 'ENET'
oos_forecast['spec'] = spec
table = pd.concat([ table , oos_forecast ], axis=0)
return table
'''
DEFINE LASSO FUNCTIONS
'''
def fit_lasso(
ytrain,
xtrain ,
target_variable = target_variable,
alpha = 1e-5,
) :
model = Lasso( alpha = alpha )
return model.fit(xtrain, ytrain) , xtrain.columns
def predict_lasso(
model ,
X ,
train_vars ,
date ,
pred_dict ,
l ,
target_variable = target_variable,
) :
X = X[train_vars]
pred = model.predict(X)[0]
pred_dict[l].append(pred)
return pred_dict
def oos_lasso(
model_fit,
new_data ,
train_vars ,
table,
new_dates,
spec = "model" ,
desired_date = desired_date,
):
new_data = new_data[train_vars]
oos_forecast = model_fit.predict(new_data)
oos_forecast = pd.DataFrame( {
"date": new_dates,
"nowcast": oos_forecast
}).set_index('date')
oos_forecast['estimator'] = 'LASSO'
oos_forecast['spec'] = spec
table = pd.concat([ table , oos_forecast ], axis=0)
return table
'''
DEFINE DT FUNCTIONS
'''
def fit_dt(
ytrain ,
xtrain ,
target_variable = target_variable,
ModelN = 200 ,
) :
models = []
for i in range(ModelN):
model = DecisionTreeRegressor(criterion = "absolute_error",
min_samples_split = 6,
min_samples_leaf = 2)
model.fit(xtrain, ytrain)
models.append(model)
return models , xtrain.columns
def predict_dt(
model ,
X ,
train_vars ,
date ,
pred_dict ,
l ,
target_variable = target_variable,
) :
X = X[train_vars]
preds = []
for mod in model:
prediction = mod.predict(X)[0]
preds.append(prediction)
pred_dict[l].append(np.nanmean(preds))
return pred_dict
def oos_dt(
model_fit,
new_data ,
train_vars ,
table,
new_dates,
spec = "model" ,
desired_date = desired_date,
):
new_data = new_data[train_vars]
preds = []
for mod in model_fit :
prediction = mod.predict(new_data)
preds.append(prediction)
oos_forecast = pd.DataFrame(preds).T.mean(axis=1).to_frame(name="nowcast")
oos_forecast.index = new_dates
oos_forecast['estimator'] = 'DT'
oos_forecast['spec'] = spec
table = pd.concat([ table , oos_forecast ], axis=0)
return table
'''
DEFINE RF FUNCTIONS
'''
def fit_rf(
ytrain ,
xtrain ,
target_variable = target_variable,
ModelN = 200 ,
n_estimators = 130 ,
) :
models = []
for i in range(ModelN):
model = RandomForestRegressor(
n_estimators=n_estimators,
criterion = "absolute_error",
max_features=18,
min_samples_split=4,
min_samples_leaf=2
)
model.fit(xtrain, ytrain)
models.append(model)
return models , xtrain.columns
def predict_rf(
model ,
X ,
train_vars ,
date ,
pred_dict ,
l ,
target_variable = target_variable,
) :
X = X[train_vars]
preds = []
for mod in model:
prediction = mod.predict(X)[0]
preds.append(prediction)
pred_dict[l].append(np.nanmean(preds))
return pred_dict
def oos_rf(
model_fit,
new_data ,
train_vars ,
table,
new_dates,
spec = "model" ,
desired_date = desired_date,
):
new_data = new_data[train_vars]
preds = []
for mod in model_fit :
prediction = mod.predict(new_data)
preds.append(prediction)
oos_forecast = pd.DataFrame(preds).T.mean(axis=1).to_frame(name="nowcast")
oos_forecast.index = new_dates
oos_forecast['estimator'] = 'RF'
oos_forecast['spec'] = spec
table = pd.concat([ table , oos_forecast ], axis=0)
return table
'''
DEFINE GBT FUNCTIONS
'''
def fit_gbt(
ytrain,
xtrain ,
target_variable = target_variable,
ModelN = 200 ,
n_estimators = 100 ,
learning = 0.15 ,
) :
models = []
for i in range(ModelN):
model = GradientBoostingRegressor(
n_estimators=100,
learning_rate=learning,
loss='absolute_error',
min_samples_split=6,
min_samples_leaf=3
)
model.fit(xtrain, ytrain)
models.append(model)
return models , xtrain.columns
def predict_gbt(
model ,
X ,
train_vars ,
date ,
pred_dict ,
l ,
target_variable = target_variable,
) :
X = X[train_vars]
preds = []
for mod in model:
prediction = mod.predict(X)[0]
preds.append(prediction)
pred_dict[l].append(np.nanmean(preds))
return pred_dict
def oos_gbt(
model_fit,
new_data ,
train_vars ,
table ,
new_dates,
spec = "model" ,
desired_date = desired_date,
):
new_data = new_data[train_vars]
preds = []
for mod in model_fit:
prediction = mod.predict(new_data)
preds.append(prediction)
oos_forecast = pd.DataFrame(preds).T.mean(axis=1).to_frame(name="nowcast")
oos_forecast.index = new_dates
oos_forecast['estimator'] = 'GBT'
oos_forecast['spec'] = spec
table = pd.concat([ table , oos_forecast ], axis=0)
return table
'''
DEFINE LSTM FUNCTIONS
'''
def fit_lstm(
ttrain ,
gdp_lags = 0,
target_variable = target_variable,
params = {
"n_timesteps": 12,
"fill_na_func": np.nanmean,
"fill_ragged_edges_func": np.nanmean,
"n_models": _iter_lstm,
"train_episodes": _epoch_lstm,
"batch_size": 50,
"decay": 0.98,
"n_hidden": 10,
"n_layers": 1,
"dropout": 0.0,
"criterion": torch.nn.MSELoss(),
"optimizer": torch.optim.Adam,
"optimizer_parameters": {"lr": 1e-2, "weight_decay": 0.0}
}
) :
#train = lagged_target(ttrain, gdp_lags)
model = LSTM(
data = ttrain,
target_variable = target_variable ,
**params
)
model.train(quiet=True)
return model , ttrain.drop(["date", target_variable], axis=1).columns
def predict_lstm(
model ,
X ,
train_vars ,
date ,
pred_dict ,
l ,
gdp_lags = 0 ,
target_variable = target_variable,
) :
#X = lagged_target(X, gdp_lags)
#X = X[train_vars]
pred = model.predict(X).loc[lambda x: x.date == date, "predictions"].values[0]
pred_dict[l].append(pred)
return pred_dict
def oos_lstm(
model_fit,
new_data ,
train_vars ,
table,
new_dates,
gdp_lags = 0 ,
spec = "model" ,
desired_date = desired_date,
):
#new_data = lagged_target(new_data, gdp_lags)
new_data = new_data[new_data['date'] <= desired_date ]
new_dates = new_data['date']
oos_forecast = model_fit.predict(new_data)[['date', 'predictions']].set_index('date')
oos_forecast = oos_forecast[oos_forecast.index <= desired_date]
oos_forecast = oos_forecast[oos_forecast.index > new_data[['date',target_variable]].dropna().date.max()]
oos_forecast = oos_forecast.rename(columns = {'predictions' : 'nowcast'})
oos_forecast = oos_forecast[oos_forecast.index.month.astype(int).isin([3,6,9,12])]
oos_forecast['estimator'] = 'LSTM'
oos_forecast['spec'] = spec
table = pd.concat([ table , oos_forecast ], axis=0)
return table
Pre-Selection¶
In [5]:
import warnings
warnings.filterwarnings('ignore')
# ---------------------------------------------
# LARS-PRESELECTION
# Steps: 1) Prepare df; 2) Estimate benchmark; 3) LARS Estimation; 4) Best-variables
# ---------------------------------------------
# 1. Preparation: LARS cannot handle Missing Values or Mixed-Data
factors = 1
# Aggregate according to methods described in metadata
try :
agg_map = metadata.set_index('ticket')['agg_method'].to_dict()
agg_funcs = {
col: (agg_map[col] if agg_map[col] in ['mean', 'last', 'sum'] else 'mean')
for col in data.columns if col in agg_map
}
pre_selected = data.resample('QS').agg(agg_funcs)
except :
pre_selected = data.resample('QS').mean()
pre_selected.index = pd.to_datetime(pre_selected.index)
pre_selected = pre_selected.dropna(how="all", axis=1)
pre_selected = pre_selected.loc[:, pre_selected.isnull().mean() <= 0.75]
# Mean-fill predictors only — leave target_variable as NaN so we can
# identify quarters without observed GDP and drop them from training.
predictors = [c for c in pre_selected.columns if c != target_variable]
pre_selected[predictors] = pre_selected[predictors].fillna(
pre_selected[predictors].mean(numeric_only=True)
)
# data cannot include periods beyond the GDP we observe
last_gdp = data[target_variable].dropna()
last_gdp = last_gdp.index.max()
pre_selected = pre_selected[pre_selected.index <= last_gdp]
pre_selected.dropna(subset=[target_variable], inplace=True)
del last_gdp
# ---------------------------------------------
# 2. Benchmark RMSE
# ---------------------------------------------
# Split raw data first
X_raw = pre_selected.drop(target_variable, axis=1)
y_all = pre_selected[[target_variable]]
X_train_raw = X_raw[X_raw.index < pd.to_datetime(test_start_date)]
X_test_raw = X_raw[X_raw.index >= pd.to_datetime(test_start_date)]
y_train = y_all[y_all.index < pd.to_datetime(test_start_date)]
y_test = y_all[y_all.index >= pd.to_datetime(test_start_date)]
# Drop quarters where GDP was not observed (NaN target = no real observation)
obs_mask = y_train[target_variable].notna()
y_train = y_train[obs_mask]
X_train_raw = X_train_raw[obs_mask]
# Fit scaler on train only, transform both
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train_raw), index=X_train_raw.index)
X_test_scaled = pd.DataFrame(scaler.transform(X_test_raw), index=X_test_raw.index)
# Fit PCA on train only, transform both
pca = PCA(n_components=factors)
X_train = pd.DataFrame(pca.fit_transform(X_train_scaled), index=X_train_raw.index)
X_test = pd.DataFrame(pca.transform(X_test_scaled), index=X_test_raw.index)
X_train.columns = [f"PC_{i}" for i in range(factors)]
X_test.columns = [f"PC_{i}" for i in range(factors)]
benchmark_model = LinearRegression()
benchmark_model.fit(X_train, y_train)
print(f"Training on {len(y_train)} observed GDP quarters")
y_pred = benchmark_model.predict(X_test).ravel()
y_true = y_test.values.ravel()
# Naive baseline: always predict the training-period mean
naive_pred = float(y_train[target_variable].mean())
naive_rmse = float(np.sqrt(np.mean((y_true - naive_pred) ** 2)))
benchmark = float(np.sqrt(np.mean((y_true - y_pred) ** 2))) # RMSE
print(f"OLS benchmark RMSE : {benchmark:.4f}")
# ---------------------------------------------
# 3. LARS Estimation
# ---------------------------------------------
pre_train = pre_selected[pre_selected.index < test_start_date]
pre_test = pre_selected[pre_selected.index >= test_start_date]
# Estimate LARS over a range of variables included in the estimation
# Returns RMSE and variable choice
d = []
for size in range(10,50) :
selection, rmse = lars_variable_ranking(pre_train, target=target_variable, max_vars=size+1, test = pre_test, factors = factors)
b = {
"size" : size+1 ,
"selected_vars" : selection ,
"rmse" : rmse
}
d.append(b)
d = pd.DataFrame(d)
d.set_index('size', inplace=True)
d['ratio'] = d['rmse']/benchmark
# Minimum RMSE:
# d.loc[d['rmse'].idxmin()]
# ---------------------------------------------
# 4. Best models: define top models and add them to specification list
# ---------------------------------------------
# ---------------------------------------------
# Plot LARS results
# ---------------------------------------------
#threshold = d['ratio'].mean()
# Best threshold is 1: where models outperform non-selected OLS
threshold = 1
d['ratio'].plot(color='black')
plt.axhline(y=threshold, color='black', linestyle=':', linewidth=1)
plt.ylabel("Ratio of RMSE")
plt.xlabel("Number of variables")
plt.legend().set_visible(False)
plt.savefig(f"{PATH_OUTPUT}/{date_str} 3_nwcst Fig lars_ratio.png", dpi=300, bbox_inches='tight')
plt.show()
print(f"OLS benchmark RMSE : {benchmark:.4f}")
print(f"LARS min RMSE : {d['rmse'].min():.4f}")
# ---------------------------------------------
# ---------------------------------------------
# Keep specifications where rmse is below the threshold
# These are the top models
# ---------------------------------------------
spec_list = d[d['ratio']<threshold]
spec_list['selected_vars'] = spec_list['selected_vars'].apply(lambda x: ' '.join(x))
# ---------------------------------------------
# Variables included in best-RMSE LARS:
top_vars = spec_list.loc[spec_list['rmse'].idxmin(), 'selected_vars'].split(" ")
# Variable selection in best-RMSE LARS:
print(f"Total specifications selected: {len(spec_list)}. \n Best-RMSE selected variables {len(top_vars)} out of {len(data.columns)-1}")
# Now define main df using only variables selected
# data = data[[target_variable] + top_vars]
# ---------------------------------------------
# What variables are included in the top selection?
# ---------------------------------------------
del pre_selected
# ---------------------------------------------
# Save estimation parameters for reproducibility
# ---------------------------------------------
parameters = {
"train_start": train_start_date.strftime("%Y-%m-%d"),
"test_start": test_start_date.strftime("%Y-%m-%d"),
"test_end": test_end_date.strftime("%Y-%m-%d"),
"lars_pca": factors,
"model_iterations": int(_iter_),
'lstm n_models' : _iter_lstm ,
'lstm train_episodes' :_epoch_lstm,
**run
}
with open(os.path.join(PATH_OUTPUT, "estimation_parameters.json"), "w") as f:
json.dump(parameters, f, indent=2)
# ---------------------------------------------
# ---------------------------------------------
print("Distribution of frequencies in LARS selection:")
print(spec_list.rmse.describe())
metadata[metadata['ticket'].isin(top_vars)]
Training on 33 observed GDP quarters OLS benchmark RMSE : 2.7344
OLS benchmark RMSE : 2.7344 LARS min RMSE : 2.5824 Total specifications selected: 38. Best-RMSE selected variables 18 out of 244 Distribution of frequencies in LARS selection: count 38.000000 mean 2.674249 std 0.032037 min 2.582414 25% 2.659850 50% 2.674920 75% 2.701012 max 2.732994 Name: rmse, dtype: float64
Out[5]:
| series | ticket | description | description 2 | freq | first_date | last_date | No Obs | months_lag | nominal | sa | unstructured | type | agg_method | file_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 61 | bank_loans_food_and_beverage_services_excludin... | MLOA9803 | Commercial Bank Loans (by sector): Food and Be... | NaN | MS | 4/1/1986 | 2/1/2026 | 479 | 3 | NaN | 0.0 | 0.0 | M | mean | boj_bank_loans_bysector.csv |
| 63 | bank_loans_government_services_local_government | MLOAB002 | Commercial Bank Loans (by sector): Government ... | NaN | MS | 4/1/1986 | 2/1/2026 | 479 | 3 | NaN | 0.0 | 0.0 | M | mean | boj_bank_loans_bysector.csv |
| 64 | bank_loans_government_services_other_public_en... | MLOAB004 | Commercial Bank Loans (by sector): Government ... | NaN | MS | 4/1/1986 | 2/1/2026 | 479 | 3 | NaN | 0.0 | 0.0 | M | mean | boj_bank_loans_bysector.csv |
| 72 | bank_loans_manufacturing_metal_products | MLOA9305 | Commercial Bank Loans (by sector): Manufacturi... | NaN | MS | 4/1/1986 | 2/1/2026 | 479 | 3 | NaN | 0.0 | 0.0 | M | mean | boj_bank_loans_bysector.csv |
| 75 | bank_loans_manufacturing_sugar_rum_molasses | MLOA9306 | Commercial Bank Loans (by sector): Manufacturi... | NaN | MS | 4/1/1986 | 2/1/2026 | 479 | 3 | NaN | 0.0 | 0.0 | M | mean | boj_bank_loans_bysector.csv |
| 89 | bank_nonperf_loans_electricity_gas_and_water | MNPL0004 | Non-Performing Loans (by customer): Electricit... | NaN | MS | 3/1/2017 | 2/1/2026 | 108 | 3 | NaN | 0.0 | 0.0 | M | mean | boj_bank_nonperf.csv |
| 94 | bank_nonperf_loans_mining | MNPL0008 | Non-Performing Loans (by customer): Mining | NaN | MS | 3/1/2017 | 2/1/2026 | 108 | 3 | NaN | 0.0 | 0.0 | M | mean | boj_bank_nonperf.csv |
| 95 | bank_nonperf_loans_nonbank_financial_institutions | MNPL0009 | Non-Performing Loans (by customer): Non-Bank F... | NaN | MS | 3/1/2017 | 2/1/2026 | 108 | 3 | NaN | 0.0 | 0.0 | M | mean | boj_bank_nonperf.csv |
| 99 | bank_nonperf_loans_public_sector | MNPL0012 | Non-Performing Loans (by customer): Public Sector | NaN | MS | 3/1/2017 | 2/1/2026 | 108 | 3 | NaN | 0.0 | 0.0 | M | mean | boj_bank_nonperf.csv |
| 100 | bank_nonperf_loans_tourism | MNPL0013 | Non-Performing Loans (by customer): Tourism | NaN | MS | 3/1/2017 | 2/1/2026 | 108 | 3 | NaN | 0.0 | 0.0 | M | mean | boj_bank_nonperf.csv |
| 101 | bank_nonperf_loans_transport_storage_and_commu... | MNPL0014 | Non-Performing Loans (by customer): Transport,... | NaN | MS | 3/1/2017 | 2/1/2026 | 108 | 3 | NaN | 0.0 | 0.0 | M | mean | boj_bank_nonperf.csv |
| 677 | EDNA_USD_XDC_RATE | MXRT0101 | Nominal Exchange Rate (USD per XDC) | NaN | MS | 1/1/2010 | 3/1/2026 | 195 | 2 | 0.0 | 0.0 | 0.0 | M | mean | boj_forex.csv |
| 863 | foreign_liabilitiesother | MRES0302 | International Reserves: Foreign Liabilities - ... | NaN | MS | 9/1/1992 | 3/1/2026 | 403 | 2 | NaN | 0.0 | 0.0 | M | mean | boj_iirr.csv |
| 959 | exp_eu | XEMP0002 | Exports EU | NaN | MS | 1/1/1988 | 1/1/2026 | 457 | 4 | 0.0 | 0.0 | 0.0 | X | sum | eu_trade.csv |
| 960 | imp_eu | XIMP0002 | Imports EU | NaN | MS | 1/1/1988 | 1/1/2026 | 457 | 4 | 0.0 | 0.0 | 0.0 | X | sum | eu_trade.csv |
| 1006 | gtrends_treasure_beach_jm | UGTR0039 | Google Trends: Treasure Beach (JM) | NaN | MS | 3/1/2008 | 4/1/2026 | 187 | 1 | 0.0 | 0.0 | 1.0 | U | mean | gtrends.csv |
| 1009 | jse_volume | REXC0002 | Jamaica Exchange Volume | NaN | MS | 4/1/2021 | 4/1/2026 | 61 | 1 | 0.0 | 0.0 | 0.0 | R | sum | jamstockex_jse.csv |
| 1011 | imp_uk | XIMP0003 | Imports UK | NaN | MS | 1/1/2018 | 12/1/2025 | 96 | 5 | 0.0 | 0.0 | 0.0 | X | sum | uk_trade.csv |
Specs¶
Generate specification list
In [7]:
import warnings
warnings.filterwarnings('ignore')
# --- Base / benchmark naming ---
spec_list = spec_list.copy()
spec_list["spec"] = "lars" + spec_list.index.astype(str)
bench_idx = spec_list["rmse"].idxmin()
spec_list.loc[bench_idx, "spec"] = "bench" + str(bench_idx)
# --- Base settings (data_flat=3 only) ---
base = spec_list.assign(gdp_lags=0, PCA=0, data_flat=3)
variants = [base]
# --- Flats branch (data_flat=4 only, no PCA/lag variants) ---
flats = base.assign(data_flat=4)
flats["spec"] = "flats" + flats.index.astype(str)
variants.append(flats)
# --- PCA variants ONLY on base (data_flat=3) ---
p5 = base.assign(PCA=5)
p5["spec"] = p5["spec"] + "p5"
variants.append(p5)
p3 = base.assign(PCA=3)
p3["spec"] = p3["spec"] + "p3"
variants.append(p3)
# --- Lags variant ONLY for p5 specs on base (data_flat=3) ---
p5_l2 = p5.copy()
p5_l2["gdp_lags"] = 2
p5_l2["spec"] = p5_l2["spec"] + "l2"
variants.append(p5_l2)
# --- Final spec_list ---
spec_list = pd.concat(variants, axis=0)
print(f"Total iterations: {len(spec_list)}")
spec_list
Total iterations: 95
Out[7]:
| selected_vars | rmse | ratio | spec | gdp_lags | PCA | data_flat | |
|---|---|---|---|---|---|---|---|
| size | |||||||
| 46 | MNPL0009 XIMP0002 MXRT0101 MNPL0014 XEMP0002 R... | 2.704928 | 0.989205 | lars46 | 0 | 0 | 3 |
| 49 | MNPL0009 XIMP0002 MXRT0101 MNPL0014 XEMP0002 R... | 2.707980 | 0.990320 | lars49 | 0 | 0 | 3 |
| 15 | MNPL0009 XIMP0002 MXRT0101 MNPL0014 XEMP0002 R... | 2.633649 | 0.963137 | lars15 | 0 | 0 | 3 |
| 26 | MNPL0009 XIMP0002 MXRT0101 MNPL0014 XEMP0002 R... | 2.669792 | 0.976355 | lars26 | 0 | 0 | 3 |
| 43 | MNPL0009 XIMP0002 MXRT0101 MNPL0014 XEMP0002 R... | 2.697095 | 0.986340 | lars43 | 0 | 0 | 3 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 32 | MNPL0009 XIMP0002 MXRT0101 MNPL0014 XEMP0002 R... | 2.664359 | 0.974368 | lars32p5l2 | 2 | 5 | 3 |
| 21 | MNPL0009 XIMP0002 MXRT0101 MNPL0014 XEMP0002 R... | 2.731081 | 0.998769 | lars21p5l2 | 2 | 5 | 3 |
| 45 | MNPL0009 XIMP0002 MXRT0101 MNPL0014 XEMP0002 R... | 2.706650 | 0.989834 | lars45p5l2 | 2 | 5 | 3 |
| 11 | MNPL0009 XIMP0002 MXRT0101 MNPL0014 XEMP0002 R... | 2.682169 | 0.980881 | lars11p5l2 | 2 | 5 | 3 |
| 38 | MNPL0009 XIMP0002 MXRT0101 MNPL0014 XEMP0002 R... | 2.675263 | 0.978356 | lars38p5l2 | 2 | 5 | 3 |
95 rows × 7 columns
Human models¶
Define any additional model of interest here:
In [8]:
N = [
{
'size' : 9 ,
'selected_vars' : 'UGTR0000 MPAY0001 UBML0000 XEMP0004 XFED0005 XOIL0000 MLOA1000 XMIM0000' ,
'spec' : 'human1' ,
'gdp_lags' : 2,
'PCA' : 3,
'data_flat' : 3,
} ,
]
'''
for d in N :
spec_list = pd.concat(
[
spec_list,
pd.DataFrame( [ d ] ).set_index('size') ] ,
axis=0
)
spec_list
'''
Out[8]:
"\nfor d in N :\n spec_list = pd.concat(\n [\n spec_list,\n pd.DataFrame( [ d ] ).set_index('size') ] ,\n axis=0\n )\nspec_list\n"
Spec Trim¶
Keep only specifications that have shown good past performance.
Con: May exclude specifications that were good in some models but bad overall.
In [9]:
# Read last_performance file into a dataframe
if last_performance is not None :
last_perf = pd.read_excel(f"{PATH_OUTPUT}/{last_performance}")
# group last_performance by spec with the min, max, and mean RMSE
last_perf = last_perf.groupby('spec')['RMSE'].agg(['min', 'max', 'median'])
# Generate a flag for whether the median RMSE is above or below the overall median
last_perf['flag'] = last_perf['median']>last_perf['median'].median()
last_perf = last_perf[['flag']]
# merge spec_list and last_perf on 'spec' column
spec_list = spec_list.merge(last_perf, left_on='spec', right_on='spec', how='left')
# drop selected specifications
spec_list = spec_list[spec_list['flag'] != True]
spec_list
Nowcasting¶
WORKSHOP EDIT:
Reduce specifications due to time constraint
In [10]:
spec_list = spec_list[:2]
spec_list
Out[10]:
| selected_vars | rmse | ratio | spec | gdp_lags | PCA | data_flat | |
|---|---|---|---|---|---|---|---|
| size | |||||||
| 46 | MNPL0009 XIMP0002 MXRT0101 MNPL0014 XEMP0002 R... | 2.704928 | 0.989205 | lars46 | 0 | 0 | 3 |
| 49 | MNPL0009 XIMP0002 MXRT0101 MNPL0014 XEMP0002 R... | 2.707980 | 0.990320 | lars49 | 0 | 0 | 3 |
In [12]:
# =========================================================
# FULL JUPYTER-SAFE PARALLEL VERSION (Windows + Jupyter Lab)
# Parallelize across spec_list rows using joblib (loky)
# Save results EVERY 25 specs and at the end
# WITH A REAL PROGRESS BAR (completed specs + ETA)
# =========================================================
import traceback, timeit, os, warnings
warnings.filterwarnings("ignore")
import pandas as pd
from joblib import Parallel, delayed
from tqdm.auto import tqdm
# --- progress bar integration for joblib ---
# If you don't have it yet: pip install tqdm-joblib
from tqdm_joblib import tqdm_joblib
# ---------------------------------------------------------
# 0) Thread caps (IMPORTANT) — set BEFORE heavy numeric work
# ---------------------------------------------------------
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
torch.set_num_threads(2) # lets each worker use 2 CPU threads for PyTorch ops
# 8 cores -> typically best is 7 workers (leave 1 core free)
# If run['lstm']==1 frequently, set this to 4–6 instead.
MAX_WORKERS = 6 if run.get('lstm', 0) == 1 else max(1, os.cpu_count() - 1)
print(f"Detected {os.cpu_count()} logical cores. Using {MAX_WORKERS} workers")
# ---------------------------------------------------------
# 1) Load existing output tables once
# ---------------------------------------------------------
start = timeit.default_timer()
try:
stop
result_table = pd.read_excel(f"{PATH_OUTPUT}/{forecast_file}", index_col="date")
performance_table = pd.read_excel(f"{PATH_OUTPUT}/{performance_file}")
outofsample_table = pd.read_excel(f"{PATH_OUTPUT}/{outofsample_file}", index_col="date")
print("Existing files loaded.")
except Exception:
print("No former files exist. Starting from spec 1.")
performance_table = pd.DataFrame()
result_table = pd.DataFrame()
outofsample_table = pd.DataFrame()
print(f"Running and saving in: {performance_file}")
# ---------------------------------------------------------
# 2) Build list of specs to run (skip already-done)
# ---------------------------------------------------------
done_specs = set(performance_table["spec"].unique()) if "spec" in performance_table.columns else set()
rows = []
for _, row in spec_list.iterrows():
spec = row["spec"]
if spec in done_specs:
continue
rows.append(row.to_dict())
print(f"Total specs queued: {len(rows)} (skipping {len(done_specs)} already done)")
# ---------------------------------------------------------
# 3) Worker: runs ONE spec and returns its outputs
# ---------------------------------------------------------
def _run_one_spec_worker(row_dict,
data, metadata,
target_variable, lags, run,
train_start_date, test_start_date, test_end_date,
_iter_, desired_date):
"""
Returns:
(spec, perf_df, fcst_df, oos_df, err_str)
where err_str is None if success, else traceback string.
"""
try:
spec = row_dict["spec"]
XVars = row_dict["selected_vars"].split(" ")
# params
gdp_lags = row_dict.get("gdp_lag", 0)
param_pca = row_dict.get("PCA", 3)
data_flat = row_dict.get("data_flat", 3)
dt_n = row_dict.get("dt_n", _iter_)
rf_n = row_dict.get("rf_n", dt_n)
gbt_n = row_dict.get("gbt_n", dt_n)
# local imports inside worker (safe in spawned processes)
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
scaler = StandardScaler()
# slice data for this spec
df = data[[c for c in data.columns if c in XVars or c == target_variable]].copy()
# GENERATE TRAIN/TEST DATASET
df = df.reset_index(drop=False)
df.dropna(axis=1, thresh=23, inplace=True)
test = df.loc[(df.date >= train_start_date) & (df.date <= test_end_date), :].reset_index(drop=True)
# DATES AND ACTUAL VALUES
dates = pd.date_range(test_start_date, test_end_date, freq="3MS").strftime("%Y-%m-%d").tolist()
actuals = list(test.loc[test.date.astype(str).isin(dates), target_variable].values)
pred_ols = {k: [] for k in lags}
pred_olsr = {k: [] for k in lags}
pred_enet = {k: [] for k in lags}
pred_lasso= {k: [] for k in lags}
pred_dt = {k: [] for k in lags}
pred_rf = {k: [] for k in lags}
pred_gbt = {k: [] for k in lags}
pred_lstm = {k: [] for k in lags}
for date in dates:
# define rolling-window training set
train = test.loc[
test.date <= str(pd.to_datetime(date) - pd.tseries.offsets.DateOffset(months=3))[:10],
:
]
transformed_train = mean_fill_dataset(train, train)
# -----------------------------------------------------------------
# PCA scaling (if enabled)
if param_pca not in (False, 0, None):
ncomp = 1 if param_pca is True else int(param_pca)
X = transformed_train.set_index("date").drop(target_variable, axis=1)
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns).dropna(axis=1)
Z = PCA(n_components=ncomp).fit_transform(X)
Z = pd.DataFrame(Z).rename(columns=lambda x: f"PC_{x}")
transformed_train = pd.concat([transformed_train[["date", target_variable]], Z], axis=1)
ttrain = flatten_data(transformed_train, target_variable, data_flat)
# keep quarterly observations and drop early observations
ttrain = (
ttrain.loc[ttrain.date.dt.month.astype(int).isin([3, 6, 9, 12]), :]
.fillna(ttrain.mean(numeric_only=True))
.dropna(axis=1)
.reset_index(drop=True)
)
ttrain = lagged_target(ttrain, gdp_lags)
xt = ttrain.drop(["date", target_variable], axis=1)
yt = ttrain[target_variable]
# -----------------------------------------------------------------
# Apply model in train
if run.get("ols", 0) == 1:
ols_fit, ols_vars = fit_ols(yt, xt)
if run.get("olsr", 0) == 1:
olsr_fit, olsr_vars = fit_olsr(yt, xt)
if run.get("enet", 0) == 1:
enet_fit, enet_vars = fit_enet(yt, xt)
if run.get("lasso", 0) == 1:
lasso_fit, lasso_vars = fit_lasso(yt, xt)
if run.get("dt", 0) == 1:
dt_fit, dt_vars = fit_dt(yt, xt, ModelN=dt_n)
if run.get("rf", 0) == 1:
rf_fit, rf_vars = fit_rf(yt, xt, ModelN=rf_n)
if run.get("gbt", 0) == 1:
gbt_fit, gbt_vars = fit_gbt(yt, xt, ModelN=gbt_n)
if run.get("lstm", 0) == 1:
# If you use torch heavily and see instability, consider:
# import torch; torch.set_num_threads(1); torch.set_num_interop_threads(1)
lstm_fit, lstm_vars = fit_lstm(transformed_train, gdp_lags=gdp_lags)
# -----------------------------------------------------------------
# Publication Lags in Test data
for lag in lags:
tmp_data = gen_lagged_data(metadata, test, date, lag)
tmp_data = mean_fill_dataset(train, tmp_data)
if param_pca not in (False, 0, None):
ncomp = 1 if param_pca is True else int(param_pca)
X = tmp_data.set_index("date").drop(target_variable, axis=1)
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns).dropna(axis=1)
Z = PCA(n_components=ncomp).fit_transform(X)
Z = pd.DataFrame(Z).rename(columns=lambda x: f"PC_{x}")
tmp_data = pd.concat([tmp_data[["date", target_variable]], Z], axis=1)
test_data = flatten_data(tmp_data, target_variable, data_flat)
test_data = lagged_target(test_data, gdp_lags)
test_data = test_data.loc[test_data["date"] == date, :].drop(["date", target_variable], axis=1)
# Predicted values
if run.get("ols", 0) == 1:
pred_ols = predict_ols(ols_fit, test_data, ols_vars, date, pred_ols, lag)
if run.get("olsr", 0) == 1:
pred_olsr = predict_olsr(olsr_fit, test_data, olsr_vars, date, pred_olsr, lag)
if run.get("enet", 0) == 1:
pred_enet = predict_enet(enet_fit, test_data, enet_vars, date, pred_enet, lag)
if run.get("lasso", 0) == 1:
pred_lasso = predict_lasso(lasso_fit, test_data, lasso_vars, date, pred_lasso, lag)
if run.get("dt", 0) == 1:
pred_dt = predict_dt(dt_fit, test_data, dt_vars, date, pred_dt, lag)
if run.get("rf", 0) == 1:
pred_rf = predict_rf(rf_fit, test_data, rf_vars, date, pred_rf, lag)
if run.get("gbt", 0) == 1:
pred_gbt = predict_gbt(gbt_fit, test_data, gbt_vars, date, pred_gbt, lag)
if run.get("lstm", 0) == 1:
pred_lstm = predict_lstm(lstm_fit, tmp_data, lstm_vars, date, pred_lstm, lag, gdp_lags=gdp_lags)
# -----------------------------------------------------------------
# Prepare Tables (ONLY this spec)
perf_parts = []
fcst_parts = []
if run.get("ols", 0) == 1:
perf_parts.append(perform_tab(pred_ols, "OLS", spec, actuals))
fcst_parts.append(forecast_table(pred_ols, "OLS", spec, actuals, dates))
if run.get("olsr", 0) == 1:
perf_parts.append(perform_tab(pred_olsr, "OLSR", spec, actuals))
fcst_parts.append(forecast_table(pred_olsr, "OLSR", spec, actuals, dates))
if run.get("enet", 0) == 1:
perf_parts.append(perform_tab(pred_enet, "ENET", spec, actuals))
fcst_parts.append(forecast_table(pred_enet, "ENET", spec, actuals, dates))
if run.get("lasso", 0) == 1:
perf_parts.append(perform_tab(pred_lasso, "LASSO", spec, actuals))
fcst_parts.append(forecast_table(pred_lasso, "LASSO", spec, actuals, dates))
if run.get("dt", 0) == 1:
perf_parts.append(perform_tab(pred_dt, "DT", spec, actuals))
fcst_parts.append(forecast_table(pred_dt, "DT", spec, actuals, dates))
if run.get("rf", 0) == 1:
perf_parts.append(perform_tab(pred_rf, "RF", spec, actuals))
fcst_parts.append(forecast_table(pred_rf, "RF", spec, actuals, dates))
if run.get("gbt", 0) == 1:
perf_parts.append(perform_tab(pred_gbt, "GBT", spec, actuals))
fcst_parts.append(forecast_table(pred_gbt, "GBT", spec, actuals, dates))
if run.get("lstm", 0) == 1:
perf_parts.append(perform_tab(pred_lstm, "LSTM", spec, actuals))
fcst_parts.append(forecast_table(pred_lstm, "LSTM", spec, actuals, dates))
perf_df = pd.concat(perf_parts, axis=0) if perf_parts else pd.DataFrame()
fcst_df = pd.concat(fcst_parts, axis=0) if fcst_parts else pd.DataFrame()
# OOS block removed here for stability/speed in parallel
oos_parts = []
# -----------------------------------------------------------------
# OUT OF SAMPLE ESTIMATION
# -----------------------------------------------------------------
if desired_date is not None:
# Extend dataframe to desired_date
while desired_date > df['date'].max():
df.loc[len(df), "date"] = df['date'].max() + pd.DateOffset(months=1)
transformed_new_data = mean_fill_dataset(df, df)
# Apply PCA if specified
if param_pca not in (False, 0, None):
X = pd.DataFrame(scaler.fit_transform(transformed_new_data.set_index('date').drop(target_variable, axis=1)))
X.columns = transformed_new_data.set_index('date').drop(target_variable, axis=1).columns
X = X.dropna(axis=1)
pca = PCA(n_components=param_pca)
X = pca.fit_transform(X)
X = pd.DataFrame(X)
X.rename(columns=lambda x: f'PC_{x}', inplace=True)
transformed_new_data = pd.concat([transformed_new_data[['date', target_variable]], X], axis=1)
# Prepare OOS data
transformed_data = flatten_data(transformed_new_data, target_variable, data_flat)
transformed_data = lagged_target(transformed_data, gdp_lags)
transformed_data = transformed_data[transformed_data['date'] > date]
transformed_data = transformed_data[transformed_data['date'] <= desired_date]
transformed_data = transformed_data.loc[transformed_data.date.dt.month.astype(int).isin([3,6,9,12]),:].fillna(transformed_train.mean()).dropna(axis=1)
new_dates = transformed_data['date']
transformed_data = transformed_data.drop(["date", target_variable], axis=1).reset_index(drop=True)
# Initialize empty outofsample_table (assuming it's a list or DataFrame)
outofsample_table = pd.DataFrame()
# Run OOS forecasts
if run.get('ols', 0) == 1:
outofsample_table = oos_ols(ols_fit, transformed_data, train_vars=ols_vars,
table=outofsample_table, new_dates=new_dates, spec=spec, desired_date=desired_date)
if run.get('olsr', 0) == 1:
outofsample_table = oos_olsr(olsr_fit, transformed_data, train_vars=olsr_vars,
table=outofsample_table, new_dates=new_dates, spec=spec, desired_date=desired_date)
if run.get('enet', 0) == 1:
outofsample_table = oos_enet(enet_fit, transformed_data, train_vars=enet_vars,
table=outofsample_table, new_dates=new_dates, spec=spec, desired_date=desired_date)
if run.get('lasso', 0) == 1:
outofsample_table = oos_lasso(lasso_fit, transformed_data, train_vars=lasso_vars,
table=outofsample_table, new_dates=new_dates, spec=spec, desired_date=desired_date )
if run.get('dt', 0) == 1:
outofsample_table = oos_dt(dt_fit, transformed_data, train_vars=dt_vars,
table=outofsample_table, new_dates=new_dates, spec=spec, desired_date=desired_date )
if run.get('rf', 0) == 1:
outofsample_table = oos_rf(rf_fit, transformed_data, train_vars=rf_vars,
table=outofsample_table, new_dates=new_dates, spec=spec, desired_date=desired_date )
if run.get('gbt', 0) == 1:
outofsample_table = oos_gbt(gbt_fit, transformed_data, train_vars=gbt_vars,
table=outofsample_table, new_dates=new_dates , spec=spec, desired_date=desired_date)
if run.get('lstm', 0) == 1:
outofsample_table = oos_lstm(lstm_fit, transformed_new_data, train_vars=lstm_vars,
table=outofsample_table, new_dates=new_dates, spec=spec, desired_date=desired_date)
# Convert to DataFrame if needed
oos_df = pd.DataFrame(outofsample_table) if isinstance(outofsample_table, list) else outofsample_table
return spec, perf_df, fcst_df, oos_df, None
except Exception:
return row_dict.get("spec", "UNKNOWN"), None, None, None, traceback.format_exc()
# ---------------------------------------------------------
# 4) Run in parallel (joblib loky) and collect results
# ---------------------------------------------------------
perf_out, fcst_out, oos_out = [], [], []
failed = []
# NEW: Set batch size for incremental saves
SAVE_BATCH_SIZE = 25
specs_processed = 0
if len(rows) > 0:
'''
results = Parallel(n_jobs=MAX_WORKERS, backend="loky", verbose=10)(
delayed(_run_one_spec_worker)(
row_dict,
data, metadata,
target_variable, lags, run,
train_start_date, test_start_date, test_end_date,
_iter_, desired_date
)
for row_dict in tqdm(rows, desc="Queueing specs")
)
'''
results = []
with tqdm(total=len(rows), desc="Processing specs") as pbar:
for result in Parallel(n_jobs=MAX_WORKERS, backend="loky", verbose=5, return_as="generator")(
delayed(_run_one_spec_worker)(
row_dict, data, metadata, target_variable, lags, run,
train_start_date, test_start_date, test_end_date, _iter_, desired_date
) for row_dict in rows
):
results.append(result)
pbar.update(1)
for spec, perf_df, fcst_df, oos_df, err in results:
if err is not None:
failed.append(spec)
print(f"FAILED spec={spec}\n{err}\n")
continue
if perf_df is not None and not perf_df.empty:
perf_out.append(perf_df)
if fcst_df is not None and not fcst_df.empty:
fcst_out.append(fcst_df)
if oos_df is not None and not oos_df.empty:
oos_out.append(oos_df)
specs_processed += 1
# NEW: Save every SAVE_BATCH_SIZE specs
if specs_processed % SAVE_BATCH_SIZE == 0:
print(f"\n{'='*60}")
print(f"SAVING BACKUP after {specs_processed} specs")
print(f"{'='*60}")
# Merge with existing tables
if perf_out:
temp_perf = pd.concat([performance_table] + perf_out, axis=0)
temp_perf.to_excel(f"{PATH_OUTPUT}/{performance_file}")
print(f"Saved {len(temp_perf)} rows to {performance_file}")
if fcst_out:
temp_fcst = pd.concat([result_table] + fcst_out, axis=0)
temp_fcst.index = pd.to_datetime(temp_fcst.index).strftime("%Y-%m-%d")
temp_fcst.to_excel(f"{PATH_OUTPUT}/{forecast_file}", index_label="date")
print(f"Saved {len(temp_fcst)} rows to {forecast_file}")
if oos_out:
temp_oos = pd.concat([outofsample_table] + oos_out, axis=0)
temp_oos.index = pd.to_datetime(temp_oos.index).strftime("%Y-%m-%d")
temp_oos.to_excel(f"{PATH_OUTPUT}/{outofsample_file}", index_label="date")
print(f"Saved {len(temp_oos)} rows to {outofsample_file}")
print(f"{'='*60}\n")
pbar.update(1)
# ---------------------------------------------------------
# 5) Final merge and save (FAST)
# ---------------------------------------------------------
if perf_out:
performance_table = pd.concat([performance_table] + perf_out, axis=0, ignore_index=False)
if fcst_out:
result_table = pd.concat([result_table] + fcst_out, axis=0)
if oos_out:
outofsample_table = pd.concat([outofsample_table] + oos_out, axis=0)
# Final save
performance_table.to_excel(f"{PATH_OUTPUT}/{performance_file}")
result_table.index = pd.to_datetime(result_table.index).strftime("%Y-%m-%d")
result_table.index = pd.to_datetime(result_table.index)
result_table.to_excel(f"{PATH_OUTPUT}/{forecast_file}", index_label="date")
outofsample_table.index = pd.to_datetime(outofsample_table.index).strftime("%Y-%m-%d")
outofsample_table.index = pd.to_datetime(outofsample_table.index)
outofsample_table.to_excel(f"{PATH_OUTPUT}/{outofsample_file}", index_label="date")
run_time = timeit.default_timer() - start
print(f"\nTotal runtime: {run_time:.2f} seconds")
print(f"Saved in {performance_file}")
if failed:
print(f"Failed specs ({len(failed)}): {failed}")
print(performance_table['RMSE'].describe())
print(f"Original Benchmark: {benchmark}")
performance_table
Detected 16 logical cores. Using 6 workers No former files exist. Starting from spec 1. Running and saving in: 20260511 3_nwcst Tab perf.xlsx Total specs queued: 2 (skipping 0 already done)
Processing specs: 0%| | 0/2 [00:00<?, ?it/s][Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers. Processing specs: 50%|█████ | 1/2 [02:36<02:36, 156.59s/it][Parallel(n_jobs=6)]: Done 2 out of 2 | elapsed: 2.6min finished Processing specs: 100%|██████████| 2/2 [02:37<00:00, 78.79s/it]
Total runtime: 158.05 seconds Saved in 20260511 3_nwcst Tab perf.xlsx count 80.000000 mean 4.976394 std 6.038380 min 2.251663 25% 2.509261 50% 2.561589 75% 3.149853 max 30.802763 Name: RMSE, dtype: float64 Original Benchmark: 2.734448003356505
Out[12]:
| RMSE | estimator | spec | |
|---|---|---|---|
| Vintage | |||
| -2 | 2.298953 | OLS | lars46 |
| -1 | 2.291649 | OLS | lars46 |
| 0 | 2.624861 | OLS | lars46 |
| 1 | 12.721925 | OLS | lars46 |
| 2 | 13.272481 | OLS | lars46 |
| ... | ... | ... | ... |
| -2 | 2.600459 | LSTM | lars49 |
| -1 | 2.520544 | LSTM | lars49 |
| 0 | 2.603486 | LSTM | lars49 |
| 1 | 2.549008 | LSTM | lars49 |
| 2 | 2.649067 | LSTM | lars49 |
80 rows × 3 columns
In [13]:
result_table
Out[13]:
| actuals | two_back | one_back | zero_back | one_ahead | two_ahead | estimator | spec | |
|---|---|---|---|---|---|---|---|---|
| 2023-06-01 | -0.066758 | -0.419722 | -0.332563 | -1.330659 | -35.239000 | -36.206200 | OLS | lars46 |
| 2023-09-01 | 1.988200 | -0.308410 | -0.357085 | -0.358990 | -7.257542 | -1.693524 | OLS | lars46 |
| 2023-12-01 | -0.778219 | -0.228725 | 0.150708 | 0.873922 | 2.111830 | 1.235066 | OLS | lars46 |
| 2024-03-01 | -0.054704 | -0.075191 | 0.255990 | 0.315120 | -1.867632 | 0.578832 | OLS | lars46 |
| 2024-06-01 | -0.997453 | -0.277602 | -0.212376 | 0.435814 | -4.414146 | -5.344480 | OLS | lars46 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2024-12-01 | 1.691583 | -0.187005 | 0.239369 | 0.297889 | 0.241903 | 0.444719 | LSTM | lars49 |
| 2025-03-01 | 1.086629 | -0.026154 | -0.705539 | -1.088655 | -0.908419 | -1.005068 | LSTM | lars49 |
| 2025-06-01 | 0.255034 | 0.926614 | 0.676982 | 0.648158 | 0.395492 | 0.603667 | LSTM | lars49 |
| 2025-09-01 | 0.350465 | 0.315954 | 0.193476 | -0.264647 | -0.701200 | -0.409720 | LSTM | lars49 |
| 2025-12-01 | -7.272064 | 0.345186 | 0.046084 | 0.148406 | -0.016349 | 0.528473 | LSTM | lars49 |
176 rows × 8 columns
In [14]:
outofsample_table
Out[14]:
| nowcast | estimator | spec | |
|---|---|---|---|
| date | |||
| 2026-03-01 | 1.655487 | OLS | lars46 |
| 2026-03-01 | 1.655481 | OLSR | lars46 |
| 2026-03-01 | 2.907320 | ENET | lars46 |
| 2026-03-01 | 2.907275 | LASSO | lars46 |
| 2026-03-01 | 2.794481 | DT | lars46 |
| 2026-03-01 | 0.399659 | RF | lars46 |
| 2026-03-01 | 0.655385 | GBT | lars46 |
| 2026-03-01 | 1.614919 | OLS | lars49 |
| 2026-03-01 | 1.614931 | OLSR | lars49 |
| 2026-03-01 | 1.628272 | ENET | lars49 |
| 2026-03-01 | 1.628290 | LASSO | lars49 |
| 2026-03-01 | 2.794481 | DT | lars49 |
| 2026-03-01 | 0.673723 | RF | lars49 |
| 2026-03-01 | 0.598823 | GBT | lars49 |
In [ ]: