External datasets

Jano examples should be reproducible without committing large datasets to Git. The repository version-controls dataset metadata and download code, while all downloaded files stay local under data/raw/.

The data/ directory is intentionally ignored by Git.

Registry

Dataset metadata lives in datasets/registry.json. Each entry records the source URL, source page, license or terms note, expected local path, task type, time column and suggested target column.

The current registry includes:

  • bike_sharing_hourly for small regression and walk-forward examples.

  • bts_airline_2024_01 for ordinal delay-cost and retraining examples.

  • nyc_tlc_yellow_2024_01 for larger Parquet-based performance examples.

  • household_power for minute-level time-series examples.

Download locally

List available datasets:

python scripts/download_dataset.py --list

Download a dataset without storing it in Git:

python scripts/download_dataset.py bike_sharing_hourly --extract

By default the file is saved below data/raw/. You can override that location:

python scripts/download_dataset.py nyc_tlc_yellow_2024_01 --data-root /tmp/jano-data

Policy

  • Commit metadata, examples and download scripts.

  • Do not commit downloaded CSV, ZIP, Parquet or cache files.

  • Keep notebooks executable by downloading or reading local files from data/raw/.

  • Keep automated tests independent from network access; use synthetic fixtures or mocked local downloads.

  • Mark any future real-data checks as optional or external-data tests.