Skip to content

Pipelines, Dependencies, and Parallelism

Note: The OSS release focuses on contract execution. Use your orchestrator to schedule DAGs. The upstream field defines dependencies.

Data engineering is a network of tables. LakeLogic helps you define the network, while your orchestrator runs it safely.

1. Contract Registry (Source of Truth)

Store all contracts in a single registry so dependencies are discoverable.

Example structure:

contracts/
  bronze/
    bronze_crm_customers.yaml
    bronze_erp_products.yaml
    bronze_pos_orders.yaml
  silver/
    silver_crm_customers.yaml
    silver_erp_products.yaml
    silver_pos_orders.yaml
  gold/
    gold_dim_customers.yaml
    gold_dim_products.yaml
    gold_fact_sales.yaml
    gold_sales_mart.yaml

Each contract should declare a unique dataset and optional layer metadata:

info:
  title: Silver Customers (CRM)
  target_layer: silver

dataset: silver_crm_customers
upstream:
  - bronze_crm_customers

2. Layered Pipeline Strategy (Bronze -> Silver -> Gold)

Recommended pattern:

  • Bronze: raw ingestion, schema gate, quarantine bad rows.
  • Silver: cleansed, standardized entities (layer_system_entity naming like silver_crm_customers, silver_pos_orders).
  • Gold: shared dimensions/facts across systems (e.g., gold_dim_customers, gold_fact_sales), plus optional marts/aggregates.

Example dependency graph (explicit layers):

graph TD
  subgraph Bronze
    B1[bronze_crm_customers]
    B2[bronze_erp_products]
    B3[bronze_pos_orders]
  end
  subgraph Silver
    S1[silver_crm_customers]
    S2[silver_erp_products]
    S3[silver_pos_orders]
  end
  subgraph Gold
    D1[gold_dim_customers]
    D2[gold_dim_products]
    F1[gold_fact_sales]
    G1[gold_sales_mart]
  end

  B1 --> S1
  B2 --> S2
  B3 --> S3
  S1 --> D1
  S2 --> D2
  S3 --> F1
  D1 --> F1
  D2 --> F1
  F1 --> G1

In practice, this lets you run all Bronze contracts in parallel, then all Silver entity cleanses in parallel, then Gold dimensions/facts, and finally Gold marts/aggregates.

3. Parallelism

  • Orchestrators handle task-level parallelism.
  • LakeLogic engines are multi-threaded by default (Polars and DuckDB) so each task is efficient.

4. Missing Upstreams and Log Tables

When --window last_success is used, LakeLogic checks the run log table for upstream freshness.

  • Missing log table/entry: The driver logs a warning and falls back to a full load.
  • Missing upstream: The driver skips the downstream contract and records the reason in the run summary.

This keeps pipelines safe by default without silently producing stale Gold data.

5. Gating Rules (When to Block Downstream)

Common policy:

  • Block downstream if dataset rules fail.
  • Block downstream if quarantine ratio exceeds a threshold.
  • Continue if only row-level quarantines exist and the threshold is acceptable.

Use the run log table to enforce this logic in your orchestrator.

6. Example Orchestrator Flow (Pseudo)

registry = load_all_contracts("contracts/")
DAG = build_dag_from_upstream(registry)

for node in topo_sort(DAG):
    report = run_contract(node)
    if report.dataset_rules_failed:
        stop_downstream(node)

You can implement this flow in Airflow, Dagster, Prefect, or any workflow engine you already use.

See Job Templates for concrete examples across Databricks, Synapse, Fabric, AWS, and more.

7. Run Summaries

Use --summary-path to write a JSON file with per-run metrics and per-contract statuses. This is useful for dashboards, alerting, and audit trails.