Trust Your Data. Scale Your Logic.
Write Once. Run Anywhere. — SQL-first quality gates from Polars to petabytes.
One Contract. Four Engines. Zero Rewrites.
Interactive Examples
Jump straight into executable Jupyter notebooks that demonstrate LakeLogic's capabilities:
- Hello World - Remote data ingestion in 60 seconds
- Database Governance - Quarantine dirty data
- HIPAA & GDPR Compliance - PII masking, consent tracking, and multi-regulation governance
- AI Contract Enrichment - Generate field descriptions, PII flags, and quality rules with AI
Delta Lake & Catalog Support (Spark-Free!)
LakeLogic automatically resolves catalog table names and uses Delta-RS for fast, Spark-free Delta Lake operations.
from lakelogic import DataProcessor
# Use Unity Catalog table names directly (no Spark required!)
processor = DataProcessor(
engine="polars",
contract="contracts/customers.yaml"
)
good_df, bad_df = processor.run_source(
"main.default.customers"
)
# LakeLogic automatically:
# 1. Resolves table name to storage path
# 2. Uses Delta-RS for fast, Spark-free operations
# 3. Validates data with your contract rules
print(f"Valid: {len(good_df)} | Invalid: {len(bad_df)}")
How It Works (In a Nutshell)
LakeLogic enforces Data Contracts as Quality Gates at every layer of your medallion architecture:
┌──────────────────────────────────────────────────┐
│ 📂 DATA SOURCE │
│ CSV · Parquet · Delta · JSON · XML · Excel │
│ APIs · URLs · Databases · Cloud Storage │
└───────────────────────┬──────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ 📜 CONTRACT.YAML │
│ Schema · Types · Nullability · Quality Rules │
└───────────────────────┬──────────────────────────┘
│
┌─────────┴─────────┐
│ DataProcessor │
│ .run_source() │
└─────────┬─────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Polars │ │ Spark │ │ DuckDB │ Same contract,
│ (local) │ │ (cluster)│ │ (in-proc)│ any engine
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
└──────────────┼──────────────┘
│
┌────────────┼────────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ ✅ good_df │ │ ❌ bad_df │
│ ──────────── │ │ ──────────── │
│ Schema valid │ │ 🛑 QUARANTINE │
│ Rules passed │ │ Every failed │
│ Types correct │ │ row saved with │
│ Ready for next │ │ failure reason │
│ layer │ │ ↻ Fix & replay │
└────────┬─────────┘ └──────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ 📊 PIPELINE ENRICHMENT │
│ ✓ Lineage injection (run_id, timestamps) │
│ ✓ SLO checks (freshness, completeness) │
│ ✓ Schema drift detection │
│ ✓ External logic (Python scripts / notebooks) │
│ ✓ Materialization (Delta, Parquet, DB) │
│ ✓ Run log (DuckDB audit trail) │
│ ✓ Notifications (alerts on quarantine/failure) │
└──────────────────────────────────────────────────┘
Each layer in the medallion uses its own contract:
🟤 BRONZE → Capture everything, catch obvious junk
⚪ SILVER → Full validation, business rules, dedup
🟡 GOLD → Aggregations, KPIs, analytics-ready
✨ Key Guarantees:
• 100% Reconciliation: source_count = good_count + bad_count
• Engine Agnostic: Same contract on Polars, Spark, DuckDB
• No Silent Failures: Every bad row quarantined with reasons
• Full Lineage: Source → Bronze → Silver → Gold, all traced
Meet the engines
-
Polars
Blazing-fast local engine for single-node processing. Best for development, testing, and production workloads under 100GB.
-
Spark
Distributed processing for petabyte-scale data. Native support for Delta Lake, Iceberg, and Unity Catalog.
-
DuckDB
Fast analytical SQL engine with native Iceberg and Delta support. Perfect for local development and CI/CD.
-
Snowflake & BigQuery
Direct warehouse execution with SQL pushdown. Table-only adapters for cloud data warehouses.
Why LakeLogic?
Write Once. Run Anywhere.
Stop paying the "Re-adaptation Tax." In a traditional stack, moving from a Warehouse (SQL) to a Lakehouse (PySpark) means rewriting your validation rules. With LakeLogic, your Data Contract is the Source of Truth.
- SQL-First: Define your constraints, rules, and logic in standard SQL—the language your team already speaks.
- Zero Adaptation: Move your pipelines from dbt/Snowflake to Databricks/Spark to Local/Polars with zero changes to your contract.
- No Vendor Lock-in: Your business logic is a portable asset, independent of your cloud provider or execution engine.
Business ROI: Cost, Risk, & Trust
Eliminate the Spark Tax
Cut compute spend by up to 80% for maintenance and small-to-medium datasets by using Polars or DuckDB instead of Spark.
100% Reconciliation
Mathematically provable data integrity. Bad data is detoured into a Safe Quarantine area, ensuring production dashboards are never poisoned.
Visual Traceability
Gold-layer metrics should never be "Black Boxes." LakeLogic supports aggregate roll-ups that preserve source keys, providing business users with a visual drill-down from board-level KPIs back to the raw source records.
Technical Capabilities
| Feature | Description |
|---|---|
| Declarative Contracts | Human-readable YAML defines schema, rules, and transforms. |
| Engine Agnostic | Auto-discovers and optimizes for Spark, Polars, DuckDB, or Pandas. |
| SQL-First Rules | Use standard SQL for Completeness, Correctness, and Consistency checks. |
| Safe Quarantine | Isolate bad rows without crashing the pipeline, with built-in reason codes. |
| Lineage Injection | Automatically audit every record with Run IDs, Timestamps, and Source paths. |
| Registry Orchestration | A generic driver to run Bronze → Silver → Gold layers with parallel execution. |
Quick Start
The fastest way to get started is with uv:
# Install with all engines
uv pip install "lakelogic[all]"
# Run your first contract (auto-discovers the best engine)
lakelogic run --contract my_contract.yaml --source raw_data.parquet
Go Further with LakeLogic
LakeLogic is the open-source engine that enforces your data contracts. Here's how to get the most out of it:
- AI-Powered Contract Generation: Bootstrap contracts from raw data with
--ai— field descriptions, PII detection, and SQL quality rules generated in seconds. Works with OpenAI, Anthropic, Azure OpenAI, or local Ollama. - Synthetic Test Data: Generate realistic edge-case data from your contracts to stress-test quarantine rules before production.
- Project Hub: Visit lakelogic.org for the latest guides, blog posts, and community resources.
From the Blog
Latest Posts
- Data Quality Management Without the Platform Tax — Why YAML contracts beat enterprise DQM platforms on cost, flexibility, and version control.
- Row-Level Data Quality in Polars — Without Writing Validation Code — One YAML file replaces 200 lines of Polars validation boilerplate.
- Data Mesh Without the Chaos — How data contracts make domain ownership work at enterprise scale.
- Stop the Spark Tax — One data contract, any engine — eliminate logic drift between Spark, Polars, and DuckDB.