Random Splits vs Temporal Validation ==================================== ``sklearn.model_selection.train_test_split`` is useful when observations can be treated as approximately independent and identically distributed. That is not the question Jano is designed to answer. For time-correlated data, the question is usually operational: How would the model have behaved if it had only seen the past and then had to predict the future? A random split can hide that question because it mixes dates across train and test. The first snippet assumes scikit-learn is installed only to illustrate the common baseline. Jano itself does not require scikit-learn. The scikit-learn way -------------------- Imagine a daily dataset where the target distribution changes near the end of the period: .. code-block:: python import pandas as pd from sklearn.model_selection import train_test_split frame = pd.DataFrame( { "timestamp": pd.date_range("2025-01-01", periods=120, freq="D"), "feature": range(120), "target": [0] * 80 + [1] * 40, } ) train_random, test_random = train_test_split( frame, test_size=0.2, shuffle=True, random_state=7, ) temporal_leakage = ( train_random["timestamp"].max() > test_random["timestamp"].min() ) print(temporal_leakage) # True The problem is not that scikit-learn is wrong. ``train_test_split`` is doing what it is designed to do: random sampling. The problem is that random sampling is the wrong abstraction for production-like temporal validation. In this setup, train can contain observations from dates that are later than some test observations. If the target changes over time, the evaluation can become too optimistic because the model has already seen part of the future regime. The Jano Version ---------------- With Jano, the split is not defined as a random share of rows. It is defined as a temporal policy: .. code-block:: python import pandas as pd from jano import TemporalPartitionSpec, WalkForwardPolicy frame = pd.DataFrame( { "timestamp": pd.date_range("2025-01-01", periods=120, freq="D"), "feature": range(120), "target": [0] * 80 + [1] * 40, } ) policy = WalkForwardPolicy( time_col="timestamp", partition=TemporalPartitionSpec( layout="train_test", train_size="60D", test_size="14D", gap_before_test="1D", ), step="14D", strategy="rolling", ) plan = policy.plan(frame, title="Production-like temporal validation") print( plan.to_frame()[ [ "iteration", "train_start", "train_end", "train_rows", "test_start", "test_end", "test_rows", ] ].head() ) The plan makes the temporal contract explicit before any model is trained: .. code-block:: text iteration train_start train_end train_rows test_start test_end test_rows 0 2025-01-01 2025-03-02 60 2025-03-03 2025-03-17 14 1 2025-01-15 2025-03-16 60 2025-03-17 2025-03-31 14 2 2025-01-29 2025-03-30 60 2025-03-31 2025-04-14 14 3 2025-02-12 2025-04-13 60 2025-04-14 2025-04-28 14 What Changes ------------ The difference is the evaluation contract: - ``train_test_split`` answers: can this model generalize to a random sample from the same mixed period? - Jano answers: how would this model behave as time advances under a specific training and evaluation policy? That gives you: - ordered train and test windows, - explicit train/test duration, - explicit gaps to model label or data availability latency, - repeated folds instead of one static estimate, - a ``plan()`` object that can be inspected, filtered and audited before slicing the dataset. This is the point where Jano enters: not as a replacement for scikit-learn, but as the temporal validation layer that sits before model training.