Skip to content

Your Data Estate. Under Contract.

A declarative, contract-driven medallion
pipeline engine for data mesh architectures.

Describe your data products in YAML — LakeLogic materializes them as Delta/Iceberg tables with lineage, quality, and SCD2 built in.

Write once. Run on Spark, Polars, or DuckDB.
The vendor-neutral alternative to Databricks Lakeflow Pipelines.

contract.yaml
# 1. Read incrementally from cloud storage
source:
  path: s3://landing/customers/*.json
  load_mode: incremental
  watermark_strategy: pipeline_log  # Only process files newer than last run

# 2. Enforce schema & PII masking
model:
  fields:
    - name: cus_id
      type: string
      required: true
    - name: email
      required: true
      pii: true
      masking: "encrypt"            # AES-256 via LAKELOGIC_PII_KEY env var

# 3. Apply SQL transformations
transformations:
  - sql: "LOWER(TRIM(email)) AS email"

# 4. Enforce quality & SLO guarantees
quality:
  row_rules:
    - sql: "email LIKE '%@%.%'"
service_levels:
  freshness_hours: 24

# 5. Write 100% clean data directly to Catalog
materialization:
  strategy: merge
  primary_key: [cus_id]
  target_path: catalog.silver.customers
Standard CLI
lakelogic run contract.yaml
Python / Databricks
from lakelogic import DataProcessor

proc = DataProcessor("contract.yaml")

# Executes the contract end-to-end
result = proc.run()
Execution Logs
LakeLogic Alert: 2 records quarantined in 'customers'. Total: 4
[2026-03-28 12:00:01] INFO  | Wrote 2 quarantined rows to catalog.quarantine.silver_customers
[2026-03-28 12:00:02] INFO  | Wrote 2 valid rows to catalog.silver.customers
[2026-03-28 12:00:03] INFO  | Run complete [layer=silver] | Total: 4 | Good: 2 | Quarantine: 2 | Ratio: 50.0%

result.good (Passed Quality Gate & PII Masked)

cus_id email
C100 enc:a1F3bG9nZ2VkQGV4...
C101 enc:dXNlcjEwMUBjb3Jw...

🚨 result.bad (Quarantined by LakeLogic)

cus_id email _lakelogic_categories _lakelogic_errors
C102 not_an_email ["correctness"] ["Rule failed: email LIKE '%@%.%'"]
C103 null ["completeness"] ["Rule failed: email is required"]

Quick Start

pip install "lakelogic"

Next step: Jump straight into the Run 5-Minute Quickstart in Google Colab — run your first pipeline in 5 minutes (no local files required, it downloads sample data automatically).


Data Mesh Is Structural — Not Just a Principle

Data mesh isn't a buzzword in LakeLogic — it's the architecture. The domain → system → contract hierarchy enforces ownership boundaries at every level:

🏢 Domain (Marketing, Sales, Finance)
│   "Who owns this data?"
│   → _domain.yaml — ownership, SLOs, contacts, alerts
├── 🏗️ System (Google Analytics, Salesforce, SAP)
│   "Where does this data come from?"
│   → _system.yaml — storage, environments, settings
└── 📄 Data Product (events, customers, orders)
    "What does this specific table look like?"
    → entity_v1.0.yaml — schema, quality rules, transforms

Analogy: A domain is like a department (Marketing). A system is like a tool that department uses (Google Analytics). A data product is like a specific report from that tool (website sessions).

Data Mesh Principle What It Means (Plain English) How LakeLogic Enables This
Domain Ownership The people closest to the data own it _domain.yaml names the owner, their contacts, and cost centre
Data as a Product Treat each dataset like a product with quality guarantees Each contract declares schema, quality rules, and SLOs
Self-Serve Platform Give teams tools so they don't wait on a central team Write YAML → run pipeline. No tickets, no handoffs
Federated Governance Consistent rules without a bottleneck Domain-level SLOs inherited automatically by every table

Define Once. Enforce Everywhere.

LakeLogic makes your Data Contract the Single Source of Truth. One YAML file replaces hundreds of lines of validation code, and it runs on any engine.

Think of a contract like a building code. The architect (data engineer) writes the spec once. Every builder (Spark, Polars, DuckDB) follows the same code — no matter which team or tool runs the pipeline.

What the Contract Defines Why It Matters
Schema (fields, types, PII flags) Catches type mismatches and schema drift before they hit your dashboard
Source (where to read, how to load) Declarative ingestion — no boilerplate code
Transformations (SQL-first) Business logic lives in the contract, not scattered across notebooks
Quality rules (row + dataset) Bad data quarantined automatically, never silently dropped
Materialization (merge, append, SCD2) Write strategy declared, not coded
SLOs (freshness, completeness, anomalies, schedule) Data reliability promises enforced and tracked
Lineage (source, run_id, timestamps) Every row stamped automatically for audit trails
Compliance (GDPR, EU AI Act) Regulatory metadata baked into the data layer

See the full contract reference · Complete annotated template


Technical Capabilities

Data Quality & Trust

  • 100% Reconciliation — Mathematically guaranteed: source = good + bad. Every row is accounted for — nothing silently dropped
  • Pydantic-Powered Validation — Every contract, system & domain configs are parsed through strict Pydantic models with Literal type enforcement — invalid YAML is caught at load time, not at runtime
  • SQL-First Rules — Define business logic in the language your team already speaks — no SDK, no custom DSL
  • SLO Monitoring & Anomaly Detection — Native freshness, row count, and statistical anomaly detection with automatic multi-channel alerting when thresholds breach
  • Schema Drift Protection — Configurable schema_policy controls how the pipeline reacts to unknown columns and schema evolution — default "allow" for frictionless prototyping, opt in to "strict" / "quarantine" to lock down production contracts

Run the Data Quality & Trust Guide in Google Colab

Compliance & Governance

  • GDPR & HIPAA Compliance — Contract-driven forget_subjects() with nullify, hash, or redact strategies and immutable audit trail
  • Automatic Lineage — Every row stamped with Run IDs and source paths — traceable from landing zone to Gold layer
  • Pipeline Cost Intelligence — Per-entity compute cost attribution with domain-level budget governance, autoscaling-aware estimation, and Databricks Unity Catalog billing integration

Run the Compliance & Governance Guide in Google Colab

Engine & Scale

  • Engine Agnostic — Write once, run on Spark, Polars, or DuckDB — same contract, zero code changes
  • Dimensional Modeling — Native SCD Type 2 (slowly changing dimensions), merge/upsert (SCD1), append-only fact tables, periodic snapshot overwrites, and partition-aware writes — all declared in YAML, no manual MERGE INTO SQL required
  • Incremental-First — Built-in watermarking, CDC, and file-mtime tracking
  • Parallel Processing — Concurrent multi-contract execution with data-layer-aware orchestration and topological dependency ordering
  • Backfill & Reprocessing — Targeted late-arriving data reprocessing with partition-aware filters — no full reload required
  • External Logic — Plug in custom Python scripts or notebooks for complex Gold-layer transformations while preserving full contract validation and lineage
  • Production Resilience — Built-in exponential-backoff retries, per-entity timeouts, and circuit-breaker thresholds (max_consecutive_failures) — pipelines self-heal transient failures without operator intervention

Run the Engine & Scale Guide in Google Colab

Developer Experience

  • Structured Diagnostics & Observability — Deep contextual logging out-of-the-box (powered by loguru) featuring precise timestamps, severity levels, exact function paths, and execution tags to drastically cut troubleshooting time
  • Dry Run Mode — Validate contracts, resolve dependencies, and preview execution plans without touching any data
  • DDL-Only Mode — Generate and apply schema DDL (CREATE/ALTER) from contracts without running the pipeline — perfect for CI/CD migrations
  • DAG Dependency Viewer — Visualize cross-contract lineage and execution order before running — understand your pipeline graph at a glance
  • Data Reset & Reload — Surgically reset and reload specific entities or data layers (Bronze/Silver/Gold) without impacting the rest of the lakehouse
  • Multi-Channel Alerts — Powered by Apprise for Slack, Email (SMTP/SendGrid), Teams, and Webhook notifications with ownership-based auto-routing and full Jinja2 templating support for custom formatting

Run the Developer Experience Guide in Google Colab

Data Generation & AI

  • Synthetic Data — Built-in DataGenerator (powered by Faker) with streaming simulation, time-windowed output, referential integrity, and edge case injection — generate realistic error rows (SQL injection, type confusion, boundary values) for stress testing and quarantine validation
  • Descriptive AI Test Data — Steer synthetic data generation with natural language prompts (e.g. "Generate users who are French or Japanese only, enterprise-tier, over 60 years old with SQL injection attempts in email fields") — output strictly adheres to the YAML contract schema
  • AI Contract Onboardinglakelogic infer auto-generates contracts from sample data with LLM-powered enrichment: automatic PII detection, column labelling, and quality rule suggestions
  • Unstructured Processing — LLM extraction from PDFs, images, audio with same contract validation + lineage
  • Automated Run Logs — Every pipeline run emits structured JSON with row counts, quality scores, durations, and error details — queryable as a Delta table

Run the Data Generation & AI Guide in Google Colab

Integrations

  • dbt Adapter — Import dbt schema.yml models and sources as LakeLogic contracts — reuse existing dbt definitions without rewriting
  • dlt (Data Load Tool) — Native DltAdapter supporting 100+ verified sources (Stripe, Shopify, SQL databases, Google Analytics, and more) plus declarative REST API ingestion — all with contract-driven quality gates on arrival
  • Native Streaming Connectors — Built-in WebSocketConnector, SSEConnector, KafkaConnector, WebhookConnector (plus Azure Event Grid, Service Bus, AWS SQS, GCP Pub/Sub) for real-time data feeds piped directly into contract validation with pre-validation rename transformations
  • Native Database Ingestion — High-performance SQL extraction via Polars/ConnectorX and DuckDB — supports PostgreSQL, MySQL, SQL Server, SQLite, and more with automatic dialect detection
  • Incremental CDC — Watermark-based change data capture with automatic state tracking — injects WHERE updated_at > last_watermark into the SQL engine before data leaves the database
  • Batch Processing — Memory-safe chunked ingestion via fetch_size for massive initial loads — processes 100GB+ tables without OOM errors
  • Column Projection Pushdown — Automatically constructs precise SELECT "col1", "col2" queries from your contract's model.fields — only extracts what the contract declares, zero configuration
  • Cloud Data Sources — Native abfss://, s3://, gs:// URI support with automatic credential resolution via CloudCredentialResolver — Azure AD, AWS IAM roles, GCP ADC, service principals, and Databricks secret scopes all work out of the box

Run the Integrations Guide in Google Colab


Delta Lake & Catalog Support — Lightweight Mode

LakeLogic automatically resolves catalog table names and uses Delta-RS for fast, lightweight Delta Lake operations — no Spark cluster required.

from lakelogic import DataProcessor

# Use Unity Catalog table names directly — lightweight mode
processor = DataProcessor(
    engine="polars", 
    contract="contracts/customers.yaml"
)

good_df, bad_df = processor.run_source(
    "main.default.customers"
)

# LakeLogic automatically:
# 1. Resolves table name to storage path
# 2. Uses Delta-RS for fast, Spark-free operations
# 3. Validates data with your contract rules

print(f"Valid: {len(good_df)} | Invalid: {len(bad_df)}")
from lakelogic import DataProcessor

# Use Fabric table names directly
processor = DataProcessor(
    engine="polars",
    contract="contracts/sales.yaml"
)

good_df, bad_df = processor.run_source(
    "myworkspace.sales_lakehouse.customers"
)

print(f"Valid: {len(good_df)} | Invalid: {len(bad_df)}")
from lakelogic import DataProcessor

# Use Synapse table names directly
processor = DataProcessor(
    engine="polars",
    contract="contracts/sales.yaml"
)

good_df, bad_df = processor.run_source(
    "salesdb.dbo.customers"
)

print(f"Valid: {len(good_df)} | Invalid: {len(bad_df)}")

Why LakeLogic?

Stop the "Fragmented Truth" Problem

In a traditional data stack, moving from a Warehouse (SQL) to a Lakehouse (PySpark) means rewriting your validation rules. This duplication creates Logic Drift — where your data quality standards differ depending on which tool is running the code.

With LakeLogic, your Data Contract is the Source of Truth.

  • SQL-First Simplicity: Define your constraints and business logic in standard SQL—the language your team already speaks.
  • Zero-Friction Portability: Move your pipelines from dbt/Snowflake to Databricks/Spark to Local/Polars with zero changes to your contract.
  • True Ownership: Your business logic is a portable asset, independent of your cloud provider or execution engine.

Business Impact: Trust, Speed, and ROI

Slash Compute Costs

Not every job needs a massive Spark cluster. Reduce compute spend by up to 80% for maintenance tasks and small-to-medium datasets by using high-performance engines like Polars or DuckDB.

Guaranteed Integrity

LakeLogic detours bad data into a Safe Quarantine zone with absolute precision. This ensures downstream dashboards are never poisoned by "dirty" data, maintaining stakeholder trust.

Full Pipeline Transparency

Eliminate the "Black Box" problem. LakeLogic provides visual drill-downs from board-level KPIs back to the raw source records, ensuring every number is auditable and explainable.

Go Further with LakeLogic

LakeLogic is the open-source engine that enforces your data contracts. Here's how to get the most out of it:

Bootstrap contracts from raw data with --ai — descriptions, PII detection, and SQL rules generated in seconds.

Learn how to organize your contracts for 1,000s of tables using Domain-First ownership and Registries.

Explore the complete template showing every available configuration option for Bronze, Silver, and Gold layers.

Explore how LakeLogic enforces Quality Gates across the Medallion Architecture including Quarantine logic.

Generate realistic edge-case data from your contracts to stress-test quarantine rules before production.

Visit lakelogic.org for the latest guides, blog posts, and community resources.


From the Blog

Latest Posts


Quickstart | How It Works | Patterns | CLI Usage