Reproducible Coffee Market Analytics Pipeline
Design, implementation rationale, and workflow structure for a modular time-series analytics pipeline orchestrated with Snakemake, with an LLM used strictly as a reporting layer.
Goals
- Provide a clean, automated, reproducible analytical workflow
- Support iterative experimentation and extension
- Separate processing, modeling, and reporting concerns
- Integrate a local LLM for interpretation and reporting (not prediction)
High-level pipeline
The workflow is linear but modular, with explicit artifacts at each stage:
fetch → store → features → model → report
Each step produces versioned outputs consumed by downstream steps, improving transparency, reproducibility, and debugging.
Data ingestion
The ingestion step retrieves Coffee (Arabica) price data from the FRED database and produces a clean, validated time series.
- Fetches CSV data via HTTP from the FRED endpoint
- Enforces a strict schema:
date(timestamp) andvalue(numeric price) - Removes missing or non-numeric observations
- Sorts chronologically
This step is intentionally minimal: no transformations beyond validation and cleaning to keep raw data auditable.
Storage (SQLite checkpoint)
Cleaned data is persisted to a SQLite table, replaced on each run to guarantee reproducibility.
SQLite provides a realistic production-style boundary between ingestion and feature engineering and supports future extensions (multiple assets, metadata tables).
Feature engineering
The time series is transformed into a supervised learning dataset for next-step return prediction.
- Log price transformation (positive values enforced)
- Log returns (first differences of log prices)
- Lagged return features
- Rolling statistics (mean and standard deviation)
- One-step-ahead log return as the prediction target
Rows with missing values introduced by lags/rolling windows are removed. The feature set is intentionally simple and interpretable while capturing short- and medium-term structure.
Modeling & evaluation
A Ridge regression baseline is trained to predict next-step log returns. The model is kept simple for interpretability, stability, and as a clear reference point for future iterations.
- Time-ordered train/test split to avoid leakage
- Metrics: MAE and RMSE
- Artifacts saved: metrics + pointwise predictions
LLM-assisted reporting
The LLM is used strictly for interpretation and reporting — not for prediction. It never receives raw data directly. Instead, it is given a compact summary bundle containing:
- Model metadata and performance metrics
- Recent prediction error statistics
- A small window of recent predictions
This design keeps reporting grounded and reproducible and reduces risk of speculation. The LLM functions as a constrained narrative layer over validated outputs.
Workflow orchestration
All steps are orchestrated with Snakemake. Each rule declares explicit inputs and outputs, enabling:
- Automatic dependency resolution
- Incremental recomputation
- End-to-end execution with a single command
Run: snakemake -s workflow/Snakefile --cores 1 --latency-wait 30
Limitations & extensions
- Richer features or alternative targets
- Probabilistic or non-linear models
- Uncertainty estimation
- Multi-asset support
- More structured LLM outputs (e.g., JSON summaries)
- Evaluation of LLM-generated text quality
Conclusion
This project serves as a reusable template for reproducible time-series analytics. The design prioritizes clarity, explicit artifacts, and extensibility. The LLM integration demonstrates a practical, constrained approach to generating grounded analytical notes from validated outputs.