Skip to content

Trust Your Data. Scale Your Logic.

Write Once. Run Anywhere. — SQL-first quality gates from Polars to petabytes.


One Contract. Four Engines. Zero Rewrites.

from lakelogic import DataProcessor

# Blazing-fast local processing
processor = DataProcessor(
    contract="contract.yaml",
    engine="polars"
)

result = processor.run_source("data.csv")

print(f"✅ {len(result.good)} validated")
print(f"❌ {len(result.bad)} quarantined")
from lakelogic import DataProcessor

# Petabyte-scale distributed processing
processor = DataProcessor(
    contract="contract.yaml",
    engine="spark"
)

# Works with Delta Lake, Unity Catalog
result = processor.run_source("catalog.schema.table")

processor.materialize(result.good, result.bad)
from lakelogic import DataProcessor

# Fast analytical SQL engine
processor = DataProcessor(
    contract="contract.yaml",
    engine="duckdb"
)

result = processor.run_source("data.parquet")

# 100% Reconciliation guaranteed
assert len(result.raw) == len(result.good) + len(result.bad)
from lakelogic import DataProcessor

# Direct warehouse execution
processor = DataProcessor(
    engine="snowflake",
    contract="contract.yaml"
)

result = processor.run_source("ANALYTICS.SILVER.CUSTOMERS")

Interactive Examples

Jump straight into executable Jupyter notebooks that demonstrate LakeLogic's capabilities:


Delta Lake & Catalog Support (Spark-Free!)

LakeLogic automatically resolves catalog table names and uses Delta-RS for fast, Spark-free Delta Lake operations.

from lakelogic import DataProcessor

# Use Unity Catalog table names directly (no Spark required!)
processor = DataProcessor(
    engine="polars", 
    contract="contracts/customers.yaml"
)

good_df, bad_df = processor.run_source(
    "main.default.customers"
)

# LakeLogic automatically:
# 1. Resolves table name to storage path
# 2. Uses Delta-RS for fast, Spark-free operations
# 3. Validates data with your contract rules

print(f"Valid: {len(good_df)} | Invalid: {len(bad_df)}")
from lakelogic import DataProcessor

# Use Fabric table names directly
processor = DataProcessor(
    engine="polars",
    contract="contracts/sales.yaml"
)

good_df, bad_df = processor.run_source(
    "myworkspace.sales_lakehouse.customers"
)

print(f"Valid: {len(good_df)} | Invalid: {len(bad_df)}")
from lakelogic import DataProcessor

# Use Synapse table names directly
processor = DataProcessor(
    engine="polars",
    contract="contracts/sales.yaml"
)

good_df, bad_df = processor.run_source(
    "salesdb.dbo.customers"
)

print(f"Valid: {len(good_df)} | Invalid: {len(bad_df)}")

How It Works (In a Nutshell)

LakeLogic enforces Data Contracts as Quality Gates at every layer of your medallion architecture:

┌──────────────────────────────────────────────────┐
│  📂 DATA SOURCE                                  │
│  CSV · Parquet · Delta · JSON · XML · Excel      │
│  APIs · URLs · Databases · Cloud Storage         │
└───────────────────────┬──────────────────────────┘
┌──────────────────────────────────────────────────┐
│  📜 CONTRACT.YAML                                │
│  Schema · Types · Nullability · Quality Rules    │
└───────────────────────┬──────────────────────────┘
              ┌─────────┴─────────┐
              │  DataProcessor    │
              │  .run_source()    │
              └─────────┬─────────┘
          ┌─────────────┼─────────────┐
          ▼             ▼             ▼
  ┌───────────┐ ┌───────────┐ ┌───────────┐
  │  Polars   │ │  Spark    │ │  DuckDB   │  Same contract,
  │  (local)  │ │  (cluster)│ │  (in-proc)│  any engine
  └─────┬─────┘ └─────┬─────┘ └─────┬─────┘
        └──────────────┼──────────────┘
          ┌────────────┼────────────┐
          ▼                         ▼
┌──────────────────┐     ┌──────────────────┐
│  ✅ good_df      │     │  ❌ bad_df       │
│  ────────────    │     │  ────────────    │
│  Schema valid    │     │  🛑 QUARANTINE   │
│  Rules passed    │     │  Every failed    │
│  Types correct   │     │  row saved with  │
│  Ready for next  │     │  failure reason  │
│  layer           │     │  ↻ Fix & replay  │
└────────┬─────────┘     └──────────────────┘
┌──────────────────────────────────────────────────┐
│  📊 PIPELINE ENRICHMENT                          │
│  ✓ Lineage injection (run_id, timestamps)        │
│  ✓ SLO checks (freshness, completeness)          │
│  ✓ Schema drift detection                        │
│  ✓ External logic (Python scripts / notebooks)   │
│  ✓ Materialization (Delta, Parquet, DB)           │
│  ✓ Run log (DuckDB audit trail)                  │
│  ✓ Notifications (alerts on quarantine/failure)   │
└──────────────────────────────────────────────────┘

Each layer in the medallion uses its own contract:

  🟤 BRONZE → Capture everything, catch obvious junk
  ⚪ SILVER → Full validation, business rules, dedup
  🟡 GOLD   → Aggregations, KPIs, analytics-ready

✨ Key Guarantees:
  • 100% Reconciliation: source_count = good_count + bad_count
  • Engine Agnostic: Same contract on Polars, Spark, DuckDB
  • No Silent Failures: Every bad row quarantined with reasons
  • Full Lineage: Source → Bronze → Silver → Gold, all traced

See detailed architecture


Meet the engines

  • Polars


    Blazing-fast local engine for single-node processing. Best for development, testing, and production workloads under 100GB.

    Learn more

  • Spark


    Distributed processing for petabyte-scale data. Native support for Delta Lake, Iceberg, and Unity Catalog.

    Learn more

  • DuckDB


    Fast analytical SQL engine with native Iceberg and Delta support. Perfect for local development and CI/CD.

    Learn more

  • Snowflake & BigQuery


    Direct warehouse execution with SQL pushdown. Table-only adapters for cloud data warehouses.

    Learn more


Why LakeLogic?

Write Once. Run Anywhere.

Stop paying the "Re-adaptation Tax." In a traditional stack, moving from a Warehouse (SQL) to a Lakehouse (PySpark) means rewriting your validation rules. With LakeLogic, your Data Contract is the Source of Truth.

  • SQL-First: Define your constraints, rules, and logic in standard SQL—the language your team already speaks.
  • Zero Adaptation: Move your pipelines from dbt/Snowflake to Databricks/Spark to Local/Polars with zero changes to your contract.
  • No Vendor Lock-in: Your business logic is a portable asset, independent of your cloud provider or execution engine.

Business ROI: Cost, Risk, & Trust

Eliminate the Spark Tax

Cut compute spend by up to 80% for maintenance and small-to-medium datasets by using Polars or DuckDB instead of Spark.

100% Reconciliation

Mathematically provable data integrity. Bad data is detoured into a Safe Quarantine area, ensuring production dashboards are never poisoned.

Visual Traceability

Gold-layer metrics should never be "Black Boxes." LakeLogic supports aggregate roll-ups that preserve source keys, providing business users with a visual drill-down from board-level KPIs back to the raw source records.


Technical Capabilities

Feature Description
Declarative Contracts Human-readable YAML defines schema, rules, and transforms.
Engine Agnostic Auto-discovers and optimizes for Spark, Polars, DuckDB, or Pandas.
SQL-First Rules Use standard SQL for Completeness, Correctness, and Consistency checks.
Safe Quarantine Isolate bad rows without crashing the pipeline, with built-in reason codes.
Lineage Injection Automatically audit every record with Run IDs, Timestamps, and Source paths.
Registry Orchestration A generic driver to run Bronze → Silver → Gold layers with parallel execution.

Quick Start

The fastest way to get started is with uv:

# Install with all engines
uv pip install "lakelogic[all]"

# Run your first contract (auto-discovers the best engine)
lakelogic run --contract my_contract.yaml --source raw_data.parquet

Go Further with LakeLogic

LakeLogic is the open-source engine that enforces your data contracts. Here's how to get the most out of it:

  • AI-Powered Contract Generation: Bootstrap contracts from raw data with --ai — field descriptions, PII detection, and SQL quality rules generated in seconds. Works with OpenAI, Anthropic, Azure OpenAI, or local Ollama.
  • Synthetic Test Data: Generate realistic edge-case data from your contracts to stress-test quarantine rules before production.
  • Project Hub: Visit lakelogic.org for the latest guides, blog posts, and community resources.

From the Blog

Latest Posts


Quickstart | How It Works | Patterns | CLI Usage