Pipelines & Parallelism

Think of your data pipeline like a car assembly line. You can't install seats before the chassis is built. LakeLogic defines the dependency order (chassis → body → seats), while your orchestrator runs each station as fast as possible.

LakeLogic helps you define the network of table dependencies. Your orchestrator (Airflow, Dagster, Prefect, Databricks Workflows) handles the scheduling and execution.

Contract Dependencies

Each contract declares its upstream dependencies, creating a clear lineage graph:

info:
  title: Silver Customers (CRM)
  target_layer: silver

dataset: silver_crm_customers
upstream:
  - bronze_crm_customers

Why this matters: When bronze_crm_customers fails, your orchestrator automatically holds back silver_crm_customers — no stale downstream data, no manual intervention.

Layered Pipeline Strategy

Organize your pipeline by medallion layers:

  BRONZE                   SILVER                   GOLD
  ──────                   ──────                   ────
  bronze_crm_customers ──→ silver_crm_customers ──→ gold_dim_customers ─┐
  bronze_erp_products  ──→ silver_erp_products  ──→ gold_dim_products  ─┼─→ gold_fact_sales
  bronze_pos_orders    ──→ silver_pos_orders    ────────────────────────┘

Layer	What happens	Parallelism
Bronze	Raw ingestion, schema gate	✅ All Bronze contracts run in parallel
Silver	Clean, validate, standardize	✅ All Silver contracts run in parallel
Gold	Aggregate, join, enrich	Dimensions first, then facts

Why this matters: Bronze and Silver contracts within the same layer have no dependencies on each other — run them all at once and cut your pipeline time dramatically.

Naming Convention

Use a consistent {layer}_{system}_{entity} naming pattern:

Contract	Meaning
`bronze_crm_customers`	Raw CRM customer data
`silver_crm_customers`	Validated, cleansed CRM customers
`gold_dim_customers`	Analytics-ready customer dimension
`gold_fact_sales`	Aggregated sales facts

Why this matters: Anyone can read a contract name and instantly know its layer, source system, and entity — no documentation lookup needed.

Parallelism

Orchestrators handle task-level parallelism (running multiple contracts at once)
LakeLogic engines are multi-threaded by default — Polars and DuckDB maximize your hardware on each task

You get parallelism at two levels: across contracts (orchestrator) and within each contract (engine).

Gating Rules

Use quality results to decide whether downstream contracts should run:

Condition	Action
Dataset rules fail	❌ Block downstream — something is fundamentally wrong
Quarantine ratio > threshold	❌ Block downstream — too much bad data
Only a few row quarantines	✅ Continue — within acceptable tolerance

registry = load_all_contracts("contracts/")
DAG = build_dag_from_upstream(registry)

for node in topo_sort(DAG):
    report = run_contract(node)
    if report.dataset_rules_failed:
        stop_downstream(node)

Why this matters: Your pipeline self-heals by stopping before bad data cascades into Gold tables and dashboards.

Missing Upstreams

When using --window last_success, LakeLogic checks the run log for upstream freshness:

Scenario	Behaviour
Missing log table	Warning + falls back to full load
Missing upstream entry	Skips the downstream contract, records the reason

This keeps pipelines safe by default — no silently stale Gold data.

What's Next?

Reprocessing & Partitioning — Handling late data and backfills
Deployment Patterns — CI/CD and production deployment