Simachew Mengiste

Goals

Provide a clean, automated, reproducible analytical workflow
Support iterative experimentation and extension
Separate processing, modeling, and reporting concerns
Integrate a local LLM for interpretation and reporting (not prediction)

High-level pipeline

The workflow is linear but modular, with explicit artifacts at each stage:

fetch → store → features → model → report

Each step produces versioned outputs consumed by downstream steps, improving transparency, reproducibility, and debugging.

Data ingestion

The ingestion step retrieves Coffee (Arabica) price data from the FRED database and produces a clean, validated time series.

Fetches CSV data via HTTP from the FRED endpoint
Enforces a strict schema: date (timestamp) and value (numeric price)
Removes missing or non-numeric observations
Sorts chronologically

This step is intentionally minimal: no transformations beyond validation and cleaning to keep raw data auditable.

Storage (SQLite checkpoint)

Cleaned data is persisted to a SQLite table, replaced on each run to guarantee reproducibility.

SQLite provides a realistic production-style boundary between ingestion and feature engineering and supports future extensions (multiple assets, metadata tables).

Feature engineering

The time series is transformed into a supervised learning dataset for next-step return prediction.

Log price transformation (positive values enforced)
Log returns (first differences of log prices)
Lagged return features
Rolling statistics (mean and standard deviation)
One-step-ahead log return as the prediction target

Rows with missing values introduced by lags/rolling windows are removed. The feature set is intentionally simple and interpretable while capturing short- and medium-term structure.

Modeling & evaluation

A Ridge regression baseline is trained to predict next-step log returns. The model is kept simple for interpretability, stability, and as a clear reference point for future iterations.

Time-ordered train/test split to avoid leakage
Metrics: MAE and RMSE
Artifacts saved: metrics + pointwise predictions

LLM-assisted reporting

The LLM is used strictly for interpretation and reporting — not for prediction. It never receives raw data directly. Instead, it is given a compact summary bundle containing:

Model metadata and performance metrics
Recent prediction error statistics
A small window of recent predictions

This design keeps reporting grounded and reproducible and reduces risk of speculation. The LLM functions as a constrained narrative layer over validated outputs.

Workflow orchestration

All steps are orchestrated with Snakemake. Each rule declares explicit inputs and outputs, enabling:

Automatic dependency resolution
Incremental recomputation
End-to-end execution with a single command

Run: snakemake -s workflow/Snakefile --cores 1 --latency-wait 30

Limitations & extensions

Richer features or alternative targets
Probabilistic or non-linear models
Uncertainty estimation
Multi-asset support
More structured LLM outputs (e.g., JSON summaries)
Evaluation of LLM-generated text quality

Conclusion

This project serves as a reusable template for reproducible time-series analytics. The design prioritizes clarity, explicit artifacts, and extensibility. The LLM integration demonstrates a practical, constrained approach to generating grounded analytical notes from validated outputs.