Lineage & Reconciliation
Note: Automated lineage capture, run logging, and SLO scoring are available in the OSS release. Full orchestration is still on the roadmap.
In a mission-critical data lakehouse, you must be able to prove that nothing was lost and everything came from somewhere. LakeLogic provides built-in tools for Data Reconciliation and System-Level Lineage.
1. Automated metadata capture
LakeLogic can inject lineage columns into every record as it moves from Bronze to Silver (and beyond). This happens for both Good Data and Quarantined Data.
lineage:
enabled: true
capture_source_path: true
capture_timestamp: true
source_column_name: "_bronze_file_name"
If you only want a run id in tables, set explicit capture flags and keep the rest in run logs:
lineage:
enabled: true
capture_run_id: true
capture_timestamp: false
capture_source_path: false
capture_domain: false
capture_system: false
You can preserve upstream lineage columns before stamping the current run:
And for Gold, you can use a pipeline-wide id for _lakelogic_run_id:
2. Reconciliation: the count rule
To ensure that Bronze = Silver + Quarantine, use the counts LakeLogic logs on every run.
| Layer | Records | Status |
|---|---|---|
| Bronze | 1,000 | Ingested |
| Silver | 995 | Cleaned |
| Quarantine | 5 | Isolated |
| Total | 1,000 | Reconciled |
Run logs include:
- counts_source (source rows before transforms)
- counts_total (post-transform total)
- counts_pre_transform_dropped
Reconciliation is simply checking that counts_total == counts_good + counts_quarantined.
3. Incremental manifests and watermarks
When you ingest from file globs, the run log captures:
- source_files_json: list of files processed in that batch
- max_source_mtime: max file timestamp in that batch
This supports lightweight watermarks for incremental processing and auditability.
4. Gold key roll-up (traceability for aggregates)
When you aggregate data in the Gold layer, you lose direct row-to-row traceability. Use a rollup transform to keep source keys.
transformations:
- rollup:
group_by: ["sale_date"]
aggregations:
total_sales: "SUM(amount)"
keys: "sale_id"
rollup_keys_column: "_lakelogic_rollup_keys"
rollup_keys_count_column: "_lakelogic_rollup_keys_count" # optional
upstream_run_id_column: "_upstream_run_id" # optional
upstream_run_ids_column: "_upstream_lakelogic_run_ids" # optional
Why do this?
By keeping the rollup keys in Gold: 1. Drill-down: a business user can trace a total back to its component rows. 2. Audit: you can prove each Gold row is backed by specific Silver records. 3. Trust: aggregates become transparent and explainable.
Summary: business value
- Auditability: verifiable evidence for regulators and internal governance.
- Traceability: fast root-cause analysis when a metric looks wrong.
- Operational confidence: clear reconciliation across medallion layers.
- Transparent aggregates: Gold tables remain explainable and defensible.