Your Data Estate. Under Contract.

A declarative, contract-driven medallion
pipeline engine for data mesh architectures.

Describe your data products in YAML — LakeLogic materializes them as Delta/Iceberg tables with lineage, quality, and SCD2 built in.

Write once. Run on Spark, Polars, or DuckDB.
The vendor-neutral alternative to Databricks Lakeflow Pipelines.

Run 5-Minute Quickstart in Google Colab View on GitHub

1. Define Contract2. Run Pipeline3. View Output

contract.yaml

# 1. Read incrementally from cloud storage
source:
  path: s3://landing/customers/*.json
  load_mode: incremental
  watermark_strategy: pipeline_log  # Only process files newer than last run

# 2. Enforce schema & PII masking
model:
  fields:
    - name: cus_id
      type: string
      required: true
    - name: email
      required: true
      pii: true
      masking: "encrypt"            # AES-256 via LAKELOGIC_PII_KEY env var

# 3. Apply SQL transformations
transformations:
  - sql: "LOWER(TRIM(email)) AS email"

# 4. Enforce quality & SLO guarantees
quality:
  row_rules:
    - sql: "email LIKE '%@%.%'"
service_levels:
  freshness_hours: 24

# 5. Write 100% clean data directly to Catalog
materialization:
  strategy: merge
  primary_key: [cus_id]
  target_path: catalog.silver.customers
  format: iceberg  # natively supports iceberg, delta, parquet, csv

Standard CLI

lakelogic run contract.yaml

Python / Databricks

from lakelogic import DataProcessor

proc = DataProcessor("contract.yaml")

# Executes the contract end-to-end
result = proc.run()

Execution Logs

LakeLogic Alert: 2 records quarantined in 'customers'. Total: 4
[2026-03-28 12:00:01] INFO  | Wrote 2 quarantined rows to catalog.quarantine.silver_customers
[2026-03-28 12:00:02] INFO  | Wrote 2 valid rows to catalog.silver.customers
[2026-03-28 12:00:03] INFO  | Run complete [layer=silver] | Total: 4 | Good: 2 | Quarantine: 2 | Ratio: 50.0%

✅ result.good (Passed Quality Gate & PII Masked)

cus_id	email
C100	`enc:a1F3bG9nZ2VkQGV4...`
C101	`enc:dXNlcjEwMUBjb3Jw...`

🚨 result.bad (Quarantined by LakeLogic)

cus_id	email	_lakelogic_categories	_lakelogic_errors
C102	not_an_email	`["correctness"]`	`["Rule failed: email LIKE '%@%.%'"]`
C103	null	`["completeness"]`	`["Rule failed: email is required"]`

Quick Start

pip install "lakelogic"

Next step: Jump straight into the Run 5-Minute Quickstart in Google Colab — run your first pipeline in 5 minutes (no local files required, it downloads sample data automatically).

Data Mesh Is Structural — Not Just a Principle

Data mesh isn't a buzzword in LakeLogic — it's the architecture. The domain → system → contract hierarchy enforces ownership boundaries at every level:

🏢 Domain (Marketing, Sales, Finance)
│   "Who owns this data?"
│   → _domain.yaml — ownership, SLOs, contacts, alerts
│
├── 🏗️ System (Google Analytics, Salesforce, SAP)
│   "Where does this data come from?"
│   → _system.yaml — storage, environments, settings
│
└── 📄 Data Product (events, customers, orders)
    "What does this specific table look like?"
    → entity_v1.0.yaml — schema, quality rules, transforms

Analogy: A domain is like a department (Marketing). A system is like a tool that department uses (Google Analytics). A data product is like a specific report from that tool (website sessions).

Data Mesh Principle	What It Means (Plain English)	How LakeLogic Enables This
Domain Ownership	The people closest to the data own it	`_domain.yaml` names the owner, their contacts, and cost centre
Data as a Product	Treat each dataset like a product with quality guarantees	Each contract declares schema, quality rules, and SLOs
Self-Serve Platform	Give teams tools so they don't wait on a central team	Write YAML → run pipeline. No tickets, no handoffs
Federated Governance	Consistent rules without a bottleneck	Domain-level SLOs inherited automatically by every table

Define Once. Enforce Everywhere.

LakeLogic makes your Data Contract the Single Source of Truth. One YAML file replaces hundreds of lines of ingestion, validation, and materialization code, and it runs on any engine.

Think of a contract like a building code. The architect (data engineer) writes the spec once. Every builder (Spark, Polars, DuckDB) follows the same code — no matter which team or tool runs the pipeline.

What the Contract Defines	Why It Matters
Schema (fields, types, PII flags)	Catches type mismatches and schema drift before they hit your dashboard
Source (where to read, how to load)	Declarative ingestion — no boilerplate code
Transformations (SQL-first)	Business logic lives in the contract, not scattered across notebooks
Quality rules (row + dataset)	Bad data quarantined automatically, never silently dropped
Materialization (merge, append, SCD2)	Write strategy declared, not coded
SLOs (freshness, completeness, anomalies, schedule)	Data reliability promises enforced and tracked
Lineage (source, run_id, timestamps)	Every row stamped automatically for audit trails
Compliance (GDPR, EU AI Act)	Regulatory metadata baked into the data layer

See the full contract reference · Complete annotated template

Technical Capabilities

Data Quality & Trust

100% Reconciliation — Mathematically guaranteed: source = good + bad. Every row is accounted for — nothing silently dropped
Pydantic-Powered Validation — Every contract, system & domain configs are parsed through strict Pydantic models with Literal type enforcement — invalid YAML is caught at load time, not at runtime
SQL-First Rules — Define business logic in the language your team already speaks — no SDK, no custom DSL
SLO Monitoring & Anomaly Detection — Native freshness, row count, and statistical anomaly detection with automatic multi-channel alerting when thresholds breach
Schema Drift Protection — Configurable schema_policy controls how the pipeline reacts to unknown columns and schema evolution — default "allow" for frictionless prototyping, opt in to "strict" / "quarantine" to lock down production contracts

Run the Data Quality & Trust Guide in Google Colab

Compliance & Governance

GDPR & HIPAA Compliance — Contract-driven forget_subjects() with nullify, hash, or redact strategies and immutable audit trail
Zero-Retention Architecture — Built-in zero_retention_days enforcement for transient data layers, automatically purging micro-batches after successful downstream processing
Automated PII Handling — Declarative encryption and hashing (pii: true, masking: "encrypt") applied at the Bronze layer before data even reaches rest
Pipeline Cost Intelligence — Per-entity compute cost attribution with domain-level budget governance, autoscaling-aware estimation, and Databricks Unity Catalog billing integration

Run the Compliance & Governance Guide in Google Colab

Engine & Scale

Engine Agnostic — Write once, run on Spark, Polars, or DuckDB — same contract, zero code changes
Multi-Format Materialization — Natively output validated data to Apache Iceberg or Delta Lake open-table formats without requiring pipeline rewrites
Dimensional Modeling — Native SCD Type 2 (slowly changing dimensions), merge/upsert (SCD1), append-only fact tables, periodic snapshot overwrites, and partition-aware writes — all declared in YAML, no manual MERGE INTO SQL required
Incremental-First — Built-in watermarking, CDC, and file-mtime tracking
Parallel Processing — Concurrent multi-contract execution with data-layer-aware orchestration and topological dependency ordering
Backfill & Reprocessing — Targeted late-arriving data reprocessing with partition-aware filters — no full reload required
External Logic — Plug in custom Python scripts or notebooks for complex Gold-layer transformations while preserving full contract validation and lineage
Production Resilience — Built-in exponential-backoff retries, per-entity timeouts, and circuit-breaker thresholds (max_consecutive_failures) — pipelines self-heal transient failures without operator intervention

Run the Engine & Scale Guide in Google Colab

Developer Experience

Structured Diagnostics & Observability — Deep contextual logging out-of-the-box (powered by loguru) featuring precise timestamps, severity levels, exact function paths, and execution tags to drastically cut troubleshooting time
Dry Run Mode — Validate contracts, resolve dependencies, and preview execution plans without touching any data
DDL-Only Mode — Generate and apply schema DDL (CREATE/ALTER) from contracts without running the pipeline — perfect for CI/CD migrations
DAG Dependency Viewer — Visualize cross-contract lineage and execution order before running — understand your pipeline graph at a glance
Data Reset & Reload — Surgically reset and reload specific entities or data layers (Bronze/Silver/Gold) without impacting the rest of the lakehouse
Multi-Channel Alerts — Powered by Apprise for Slack, Email (SMTP/SendGrid), Teams, and Webhook notifications with ownership-based auto-routing and full Jinja2 templating support for custom formatting

Run the Developer Experience Guide in Google Colab

Data Generation & AI

Synthetic Data — Built-in DataGenerator (powered by Faker) with streaming simulation, time-windowed output, referential integrity, and edge case injection — generate realistic error rows (SQL injection, type confusion, boundary values) for stress testing and quarantine validation
Descriptive AI Test Data — Steer synthetic data generation with natural language prompts (e.g. "Generate users who are French or Japanese only, enterprise-tier, over 60 years old with SQL injection attempts in email fields") — output strictly adheres to the YAML contract schema
AI Contract Onboarding — lakelogic infer auto-generates contracts from sample data with LLM-powered enrichment: automatic PII detection, column labelling, and quality rule suggestions
Unstructured Processing — LLM extraction from PDFs, images, audio with same contract validation + lineage
Automated Run Logs — Every pipeline run emits structured JSON with row counts, quality scores, durations, and error details — queryable as a Delta table

Run the Data Generation & AI Guide in Google Colab

Integrations

dbt Adapter — Import dbt schema.yml models and sources as LakeLogic contracts — reuse existing dbt definitions without rewriting
dlt (Data Load Tool) — Native DltAdapter supporting 100+ verified sources (Stripe, Shopify, SQL databases, Google Analytics, and more) plus declarative REST API ingestion — all with contract-driven quality gates on arrival
Native Streaming Connectors — Built-in WebSocketConnector, SSEConnector, KafkaConnector, WebhookConnector (plus Azure Event Grid, Service Bus, AWS SQS, GCP Pub/Sub) for real-time data feeds piped directly into contract validation with pre-validation rename transformations
Native Database Ingestion — High-performance SQL extraction via Polars/ConnectorX and DuckDB — supports PostgreSQL, MySQL, SQL Server, SQLite, and more with automatic dialect detection
Incremental CDC — Watermark-based change data capture with automatic state tracking — injects WHERE updated_at > last_watermark into the SQL engine before data leaves the database
Batch Processing — Memory-safe chunked ingestion via fetch_size for massive initial loads — processes 100GB+ tables without OOM errors
Column Projection Pushdown — Automatically constructs precise SELECT "col1", "col2" queries from your contract's model.fields — only extracts what the contract declares, zero configuration
Cloud Data Sources — Native abfss://, s3://, gs:// URI support with automatic credential resolution via CloudCredentialResolver — Azure AD, AWS IAM roles, GCP ADC, service principals, and Databricks secret scopes all work out of the box

Run the Integrations Guide in Google Colab

Delta Lake & Catalog Support — Lightweight Mode

LakeLogic automatically resolves catalog table names and uses Delta-RS for fast, lightweight Delta Lake operations — no Spark cluster required.

Unity Catalog (Databricks)Fabric LakeDB (Microsoft)Synapse Analytics (Azure)

from lakelogic import DataProcessor

# Use Unity Catalog table names directly — lightweight mode
processor = DataProcessor(
    engine="polars", 
    contract="contracts/customers.yaml"
)

good_df, bad_df = processor.run_source(
    "main.default.customers"
)

# LakeLogic automatically:
# 1. Resolves table name to storage path
# 2. Uses Delta-RS for fast, Spark-free operations
# 3. Validates data with your contract rules

print(f"Valid: {len(good_df)} | Invalid: {len(bad_df)}")

from lakelogic import DataProcessor

# Use Fabric table names directly
processor = DataProcessor(
    engine="polars",
    contract="contracts/sales.yaml"
)

good_df, bad_df = processor.run_source(
    "myworkspace.sales_lakehouse.customers"
)

print(f"Valid: {len(good_df)} | Invalid: {len(bad_df)}")

from lakelogic import DataProcessor

# Use Synapse table names directly
processor = DataProcessor(
    engine="polars",
    contract="contracts/sales.yaml"
)

good_df, bad_df = processor.run_source(
    "salesdb.dbo.customers"
)

print(f"Valid: {len(good_df)} | Invalid: {len(bad_df)}")

Why LakeLogic?

Stop the "Fragmented Truth" Problem

In a traditional data stack, moving from a Warehouse (SQL) to a Lakehouse (PySpark) means rewriting your validation rules. This duplication creates Logic Drift — where your data quality standards differ depending on which tool is running the code.

With LakeLogic, your Data Contract is the Source of Truth.

SQL-First Simplicity: Define your constraints and business logic in standard SQL—the language your team already speaks.
Zero-Friction Portability: Move your pipelines from dbt/Snowflake to Databricks/Spark to Local/Polars with zero changes to your contract.
True Ownership: Your business logic is a portable asset, independent of your cloud provider or execution engine.

Business Impact: Trust, Speed, and ROI

Slash Compute Costs

Not every job needs a massive Spark cluster. Reduce compute spend by up to 80% for maintenance tasks and small-to-medium datasets by using high-performance engines like Polars or DuckDB.

Guaranteed Integrity

LakeLogic detours bad data into a Safe Quarantine zone with absolute precision. This ensures downstream dashboards are never poisoned by "dirty" data, maintaining stakeholder trust.

Full Pipeline Transparency

Eliminate the "Black Box" problem. LakeLogic provides visual drill-downs from board-level KPIs back to the raw source records, ensuring every number is auditable and explainable.

Go Further with LakeLogic

LakeLogic is the open-source engine that enforces your data contracts. Here's how to get the most out of it:

AI-Powered Contract Generation

Bootstrap contracts from raw data with --ai — descriptions, PII detection, and SQL rules generated in seconds.

Governance at Scale

Learn how to organize your contracts for 1,000s of tables using Domain-First ownership and Registries.

Contract Reference

Explore the complete template showing every available configuration option for Bronze, Silver, and Gold layers.

Detailed Architecture

Explore how LakeLogic enforces Quality Gates across the Medallion Architecture including Quarantine logic.

Synthetic Test Data

Generate realistic edge-case data from your contracts to stress-test quarantine rules before production.

Project Hub

Visit lakelogic.org for the latest guides, blog posts, and community resources.