Skip to content

LakeLogic Data Contracts. Reliable Pipelines.

Open-source data contract framework for reliable data pipelines.
Validate at runtime. Block bad merges in CI/CD. Ship trusted data faster.

Define schema, data quality rules, lineage, SLOs, PII handling, and materialization once in YAML.

Write once. Run on Spark, Polars, or DuckDB.
Data contracts as code for Delta Lake, Iceberg, and modern lakehouse pipelines.

contract.yaml
# 1. Read incrementally from cloud storage
source:
  path: s3://landing/customers/*.json
  load_mode: incremental
  watermark_strategy: pipeline_log  # Only process files newer than last run

# 2. Enforce schema & PII masking
model:
  fields:
    - name: cus_id
      type: string
      required: true
    - name: email
      required: true
      pii: true
      masking: "encrypt"            # AES-256 via LAKELOGIC_PII_KEY env var

# 3. Apply SQL transformations
transformations:
  - sql: "LOWER(TRIM(email)) AS email"

# 4. Enforce quality & SLO guarantees
quality:
  row_rules:
    - sql: "email LIKE '%@%.%'"
service_levels:
  freshness_hours: 24

# 5. Write 100% clean data directly to Catalog
materialization:
  strategy: merge
  primary_key: [cus_id]
  target_path: catalog.silver.customers
  format: iceberg  # natively supports iceberg, delta, parquet, csv
Standard CLI
lakelogic run contract.yaml
Python / Databricks
from lakelogic import DataProcessor

proc = DataProcessor("contract.yaml")

# Executes the contract end-to-end
result = proc.run()
Execution Logs
LakeLogic Alert: 2 records quarantined in 'customers'. Total: 4
[2026-03-28 12:00:01] INFO  | Wrote 2 quarantined rows to catalog.quarantine.silver_customers
[2026-03-28 12:00:02] INFO  | Wrote 2 valid rows to catalog.silver.customers
[2026-03-28 12:00:03] INFO  | Run complete [layer=silver] | Total: 4 | Good: 2 | Quarantine: 2 | Ratio: 50.0%

βœ… result.good (Passed Quality Gate & PII Masked)

cus_id email
C100 enc:a1F3bG9nZ2VkQGV4...
C101 enc:dXNlcjEwMUBjb3Jw...

🚨 result.bad (Quarantined by LakeLogic)

cus_id email _lakelogic_categories _lakelogic_errors
C102 not_an_email ["correctness"] ["Rule failed: email LIKE '%@%.%'"]
C103 null ["completeness"] ["Rule failed: email is required"]
Pre-deployment validation (no data needed)
lakelogic validate \
  --contract contract.yaml \
  --gates breaking_change,pii_classification,lineage_break
.github/workflows/contracts.yml
- name: Validate Contracts
  run: |
    lakelogic validate \
      --contract contracts/customers.yaml \
      --gates breaking_change,pii_classification

Block PRs that introduce schema breaks, unmasked PII, or broken lineage β€” before they reach production.


Quick Start

pip install "lakelogic"

Next step: Jump straight into the Run 5-Minute Quickstart in Google Colab β€” run your first pipeline in 5 minutes (no local files required, it downloads sample data automatically).


Data Contracts That Ship Business Outcomes

LakeLogic turns data contracts into executable controls for quality, governance, cost, and delivery speed. Instead of rewriting validation and materialization logic in every pipeline, teams declare the rules once and enforce them everywhere.

Outcome LakeLogic Capability Business Value
Trusted dashboards Row-level quality rules, schema checks, and quarantine Bad records are isolated before they reach reports, ML features, or downstream teams
Faster delivery YAML contracts, SQL-first transformations, and CI/CD gates Engineers ship reliable pipelines without repeating boilerplate validation code
Lower platform cost Spark, Polars, and DuckDB execution from the same contract Use lightweight engines where they fit instead of defaulting every job to a cluster
Audit-ready data Lineage stamps, ownership metadata, SLOs, and PII handling Every dataset carries the evidence needed for governance, compliance, and incident review
Portable lakehouse pipelines Delta Lake, Iceberg, Parquet, CSV, and catalog-aware materialization Contracts move across tools and storage formats without rewriting business rules

Data Quality & Trust

  • Prevent bad data from reaching production β€” enforce schema, required fields, SQL rules, SLOs, anomaly checks, and quarantine behavior from the contract.
  • Account for every row β€” reconcile source = good + bad so records are validated, accepted, or quarantined without silent loss.
  • Catch broken contracts early β€” strict model validation and CI/CD gates block unsafe schema, PII, and lineage changes before deployment.

Run the Data Quality & Trust Guide in Google Colab

Compliance & Governance

  • Make ownership explicit β€” domain, system, and contract files define owners, contacts, SLOs, lineage, and governance rules.
  • Protect sensitive data by default β€” declarative PII handling, masking, encryption, and subject-forgetting workflows are part of the pipeline.
  • Track operating cost β€” attribute compute cost by entity and domain so governance includes spend, not just policy.

Run the Compliance & Governance Guide in Google Colab

Engine & Scale

  • Avoid engine lock-in β€” run the same contract on Spark, Polars, or DuckDB.
  • Materialize without hand-coded merge logic β€” declare append, merge/upsert, SCD2, snapshots, partitioning, Delta Lake, and Iceberg behavior in YAML.
  • Operate production pipelines β€” use incremental loads, CDC watermarks, backfills, retries, timeouts, circuit breakers, and dependency-aware execution.

Run the Engine & Scale Guide in Google Colab

Developer Experience

  • Debug faster β€” structured logs, dry runs, execution previews, DAG views, and reset/reload commands reduce pipeline triage time.
  • Automate schema operations β€” generate and apply DDL from contracts without running the full pipeline.
  • Route alerts to the right owners β€” send contract-aware notifications through Slack, email, Teams, webhooks, and other Apprise targets.

Run the Developer Experience Guide in Google Colab

Data Generation & AI

  • Test quality rules before production β€” generate realistic synthetic records, edge cases, and referentially valid datasets from the contract.
  • Bootstrap contracts faster β€” infer contracts from sample data with AI-assisted descriptions, PII detection, and quality rule suggestions.
  • Apply contracts to unstructured data β€” validate LLM extraction from PDFs, images, and audio with the same lineage and quality controls.

Run the Data Generation & AI Guide in Google Colab

Integrations

  • Reuse existing definitions β€” import dbt schema.yml models and sources as LakeLogic contracts.
  • Ingest from operational systems β€” connect to dlt sources, REST APIs, databases, Kafka, webhooks, cloud storage, and streaming feeds.
  • Extract only what the contract declares β€” apply projection pushdown, chunked reads, CDC filters, and cloud credential resolution automatically.

Run the Integrations Guide in Google Colab


Define Once. Enforce Everywhere.

LakeLogic makes your Data Contract the Single Source of Truth. One YAML file replaces hundreds of lines of ingestion, validation, and materialization code, and it runs on any engine.

Contracts enforce on two surfaces:

  • Runtime β€” processor.run() validates rows, quarantines bad data, emits a quality_score, and writes lineage stamps automatically.
  • CI/CD β€” lakelogic validate --gates runs static analysis on contract changes and blocks PRs that break the schema, expose PII, or sever upstream lineage.

Think of a contract like a building code. The architect (data engineer) writes the spec once. Every builder (Spark, Polars, DuckDB) follows the same code β€” no matter which team or tool runs the pipeline.

What the Contract Defines Why It Matters
Schema (fields, types, PII flags) Catches type mismatches and schema drift before they hit your dashboard
Source (where to read, how to load) Declarative ingestion β€” no boilerplate code
Transformations (SQL-first) Business logic lives in the contract, not scattered across notebooks
Quality rules (row + dataset) Bad data quarantined automatically, never silently dropped
Materialization (merge, append, SCD2) Write strategy declared, not coded
SLOs (freshness, completeness, anomalies, schedule) Data reliability promises enforced and tracked
Lineage (source, run_id, timestamps) Every row stamped automatically for audit trails
Compliance (GDPR, EU AI Act) Regulatory metadata baked into the data layer

See the full contract reference Β· Complete annotated template


Delta Lake & Catalog Support β€” Lightweight Mode

LakeLogic automatically resolves catalog table names and uses Delta-RS for fast, lightweight Delta Lake operations β€” no Spark cluster required.

from lakelogic import DataProcessor

# Use Unity Catalog table names directly β€” lightweight mode
processor = DataProcessor(
    engine="polars", 
    contract="contracts/customers.yaml"
)

good_df, bad_df = processor.run_source(
    "main.default.customers"
)

# LakeLogic automatically:
# 1. Resolves table name to storage path
# 2. Uses Delta-RS for fast, Spark-free operations
# 3. Validates data with your contract rules

print(f"Valid: {len(good_df)} | Invalid: {len(bad_df)}")
from lakelogic import DataProcessor

# Use Fabric table names directly
processor = DataProcessor(
    engine="polars",
    contract="contracts/sales.yaml"
)

good_df, bad_df = processor.run_source(
    "myworkspace.sales_lakehouse.customers"
)

print(f"Valid: {len(good_df)} | Invalid: {len(bad_df)}")
from lakelogic import DataProcessor

# Use Synapse table names directly
processor = DataProcessor(
    engine="polars",
    contract="contracts/sales.yaml"
)

good_df, bad_df = processor.run_source(
    "salesdb.dbo.customers"
)

print(f"Valid: {len(good_df)} | Invalid: {len(bad_df)}")

Why LakeLogic?

Stop the "Fragmented Truth" Problem

In a traditional data stack, moving from a Warehouse (SQL) to a Lakehouse (PySpark) means rewriting your validation rules. This duplication creates Logic Drift β€” where your data quality standards differ depending on which tool is running the code.

With LakeLogic, your Data Contract is the Source of Truth.

  • SQL-First Simplicity: Define your constraints and business logic in standard SQLβ€”the language your team already speaks.
  • Zero-Friction Portability: Move your pipelines from dbt/Snowflake to Databricks/Spark to Local/Polars with zero changes to your contract.
  • True Ownership: Your business logic is a portable asset, independent of your cloud provider or execution engine.

Business Impact: Trust, Speed, and ROI

Slash Compute Costs

Not every job needs a massive Spark cluster. Reduce compute spend by up to 80% for maintenance tasks and small-to-medium datasets by using high-performance engines like Polars or DuckDB.

Guaranteed Integrity

LakeLogic detours bad data into a Safe Quarantine zone with absolute precision. This ensures downstream dashboards are never poisoned by "dirty" data, maintaining stakeholder trust.

Full Pipeline Transparency

Eliminate the "Black Box" problem. LakeLogic provides visual drill-downs from board-level KPIs back to the raw source records, ensuring every number is auditable and explainable.


Data Mesh Without the Slideware

LakeLogic makes data mesh practical by turning ownership, quality, SLOs, and governance into files that run in production. The domain β†’ system β†’ contract hierarchy keeps responsibility close to the teams that know the data.

Data Mesh Principle LakeLogic Implementation
Domain Ownership _domain.yaml declares owners, contacts, alerts, budgets, and SLOs
Data as a Product Each contract defines schema, quality rules, lineage, and service guarantees
Self-Serve Platform Teams write YAML and run pipelines without waiting on central platform tickets
Federated Governance Shared rules inherit across domains and systems without creating a bottleneck

Learn how to organize contracts at scale Β· Read the data mesh guide

Go Further with LakeLogic

LakeLogic is the open-source engine that enforces your data contracts. Here's how to get the most out of it:

Bootstrap contracts from raw data with --ai β€” descriptions, PII detection, and SQL rules generated in seconds.

Learn how to organize your contracts for 1,000s of tables using Domain-First ownership and Registries.

Explore the complete template showing every available configuration option for Bronze, Silver, and Gold layers.

Explore how LakeLogic enforces Quality Gates across the Medallion Architecture including Quarantine logic.

Generate realistic edge-case data from your contracts to stress-test quarantine rules before production.

Visit lakelogic.org for the latest guides, blog posts, and community resources.


From the Blog

Latest Posts


Quickstart | How It Works | Patterns | CLI Usage