Your Data Estate. Under Contract.
A declarative, contract-driven medallion
pipeline engine for data mesh architectures.
Describe your data products in YAML — LakeLogic materializes them as Delta/Iceberg tables with lineage, quality, and SCD2 built in.
Write once. Run on Spark, Polars, or DuckDB.
The vendor-neutral alternative to Databricks Lakeflow Pipelines.
# 1. Read incrementally from cloud storage
source:
path: s3://landing/customers/*.json
load_mode: incremental
watermark_strategy: pipeline_log # Only process files newer than last run
# 2. Enforce schema & PII masking
model:
fields:
- name: cus_id
type: string
required: true
- name: email
required: true
pii: true
masking: "encrypt" # AES-256 via LAKELOGIC_PII_KEY env var
# 3. Apply SQL transformations
transformations:
- sql: "LOWER(TRIM(email)) AS email"
# 4. Enforce quality & SLO guarantees
quality:
row_rules:
- sql: "email LIKE '%@%.%'"
service_levels:
freshness_hours: 24
# 5. Write 100% clean data directly to Catalog
materialization:
strategy: merge
primary_key: [cus_id]
target_path: catalog.silver.customers
LakeLogic Alert: 2 records quarantined in 'customers'. Total: 4
[2026-03-28 12:00:01] INFO | Wrote 2 quarantined rows to catalog.quarantine.silver_customers
[2026-03-28 12:00:02] INFO | Wrote 2 valid rows to catalog.silver.customers
[2026-03-28 12:00:03] INFO | Run complete [layer=silver] | Total: 4 | Good: 2 | Quarantine: 2 | Ratio: 50.0%
✅ result.good (Passed Quality Gate & PII Masked)
| cus_id | |
|---|---|
| C100 | enc:a1F3bG9nZ2VkQGV4... |
| C101 | enc:dXNlcjEwMUBjb3Jw... |
🚨 result.bad (Quarantined by LakeLogic)
| cus_id | _lakelogic_categories | _lakelogic_errors | |
|---|---|---|---|
| C102 | not_an_email | ["correctness"] |
["Rule failed: email LIKE '%@%.%'"] |
| C103 | null | ["completeness"] |
["Rule failed: email is required"] |
Quick Start
Next step: Jump straight into the Run 5-Minute Quickstart in Google Colab — run your first pipeline in 5 minutes (no local files required, it downloads sample data automatically).
Data Mesh Is Structural — Not Just a Principle
Data mesh isn't a buzzword in LakeLogic — it's the architecture. The domain → system → contract hierarchy enforces ownership boundaries at every level:
🏢 Domain (Marketing, Sales, Finance)
│ "Who owns this data?"
│ → _domain.yaml — ownership, SLOs, contacts, alerts
│
├── 🏗️ System (Google Analytics, Salesforce, SAP)
│ "Where does this data come from?"
│ → _system.yaml — storage, environments, settings
│
└── 📄 Data Product (events, customers, orders)
"What does this specific table look like?"
→ entity_v1.0.yaml — schema, quality rules, transforms
Analogy: A domain is like a department (Marketing). A system is like a tool that department uses (Google Analytics). A data product is like a specific report from that tool (website sessions).
| Data Mesh Principle | What It Means (Plain English) | How LakeLogic Enables This |
|---|---|---|
| Domain Ownership | The people closest to the data own it | _domain.yaml names the owner, their contacts, and cost centre |
| Data as a Product | Treat each dataset like a product with quality guarantees | Each contract declares schema, quality rules, and SLOs |
| Self-Serve Platform | Give teams tools so they don't wait on a central team | Write YAML → run pipeline. No tickets, no handoffs |
| Federated Governance | Consistent rules without a bottleneck | Domain-level SLOs inherited automatically by every table |
Define Once. Enforce Everywhere.
LakeLogic makes your Data Contract the Single Source of Truth. One YAML file replaces hundreds of lines of validation code, and it runs on any engine.
Think of a contract like a building code. The architect (data engineer) writes the spec once. Every builder (Spark, Polars, DuckDB) follows the same code — no matter which team or tool runs the pipeline.
| What the Contract Defines | Why It Matters |
|---|---|
| Schema (fields, types, PII flags) | Catches type mismatches and schema drift before they hit your dashboard |
| Source (where to read, how to load) | Declarative ingestion — no boilerplate code |
| Transformations (SQL-first) | Business logic lives in the contract, not scattered across notebooks |
| Quality rules (row + dataset) | Bad data quarantined automatically, never silently dropped |
| Materialization (merge, append, SCD2) | Write strategy declared, not coded |
| SLOs (freshness, completeness, anomalies, schedule) | Data reliability promises enforced and tracked |
| Lineage (source, run_id, timestamps) | Every row stamped automatically for audit trails |
| Compliance (GDPR, EU AI Act) | Regulatory metadata baked into the data layer |
See the full contract reference · Complete annotated template
Technical Capabilities
Data Quality & Trust
- 100% Reconciliation — Mathematically guaranteed:
source = good + bad. Every row is accounted for — nothing silently dropped - Pydantic-Powered Validation — Every contract, system & domain configs are parsed through strict Pydantic models with
Literaltype enforcement — invalid YAML is caught at load time, not at runtime - SQL-First Rules — Define business logic in the language your team already speaks — no SDK, no custom DSL
- SLO Monitoring & Anomaly Detection — Native freshness, row count, and statistical anomaly detection with automatic multi-channel alerting when thresholds breach
- Schema Drift Protection — Configurable
schema_policycontrols how the pipeline reacts to unknown columns and schema evolution — default"allow"for frictionless prototyping, opt in to"strict"/"quarantine"to lock down production contracts
Run the Data Quality & Trust Guide in Google Colab
Compliance & Governance
- GDPR & HIPAA Compliance — Contract-driven
forget_subjects()with nullify, hash, or redact strategies and immutable audit trail - Automatic Lineage — Every row stamped with Run IDs and source paths — traceable from landing zone to Gold layer
- Pipeline Cost Intelligence — Per-entity compute cost attribution with domain-level budget governance, autoscaling-aware estimation, and Databricks Unity Catalog billing integration
Run the Compliance & Governance Guide in Google Colab
Engine & Scale
- Engine Agnostic — Write once, run on Spark, Polars, or DuckDB — same contract, zero code changes
- Dimensional Modeling — Native SCD Type 2 (slowly changing dimensions), merge/upsert (SCD1), append-only fact tables, periodic snapshot overwrites, and partition-aware writes — all declared in YAML, no manual
MERGE INTOSQL required - Incremental-First — Built-in watermarking, CDC, and file-mtime tracking
- Parallel Processing — Concurrent multi-contract execution with data-layer-aware orchestration and topological dependency ordering
- Backfill & Reprocessing — Targeted late-arriving data reprocessing with partition-aware filters — no full reload required
- External Logic — Plug in custom Python scripts or notebooks for complex Gold-layer transformations while preserving full contract validation and lineage
- Production Resilience — Built-in exponential-backoff retries, per-entity timeouts, and circuit-breaker thresholds (
max_consecutive_failures) — pipelines self-heal transient failures without operator intervention
Run the Engine & Scale Guide in Google Colab
Developer Experience
- Structured Diagnostics & Observability — Deep contextual logging out-of-the-box (powered by
loguru) featuring precise timestamps, severity levels, exact function paths, and execution tags to drastically cut troubleshooting time - Dry Run Mode — Validate contracts, resolve dependencies, and preview execution plans without touching any data
- DDL-Only Mode — Generate and apply schema DDL (CREATE/ALTER) from contracts without running the pipeline — perfect for CI/CD migrations
- DAG Dependency Viewer — Visualize cross-contract lineage and execution order before running — understand your pipeline graph at a glance
- Data Reset & Reload — Surgically reset and reload specific entities or data layers (Bronze/Silver/Gold) without impacting the rest of the lakehouse
- Multi-Channel Alerts — Powered by Apprise for Slack, Email (SMTP/SendGrid), Teams, and Webhook notifications with ownership-based auto-routing and full Jinja2 templating support for custom formatting
Run the Developer Experience Guide in Google Colab
Data Generation & AI
- Synthetic Data — Built-in
DataGenerator(powered by Faker) with streaming simulation, time-windowed output, referential integrity, and edge case injection — generate realistic error rows (SQL injection, type confusion, boundary values) for stress testing and quarantine validation - Descriptive AI Test Data — Steer synthetic data generation with natural language prompts (e.g. "Generate users who are French or Japanese only, enterprise-tier, over 60 years old with SQL injection attempts in email fields") — output strictly adheres to the YAML contract schema
- AI Contract Onboarding —
lakelogic inferauto-generates contracts from sample data with LLM-powered enrichment: automatic PII detection, column labelling, and quality rule suggestions - Unstructured Processing — LLM extraction from PDFs, images, audio with same contract validation + lineage
- Automated Run Logs — Every pipeline run emits structured JSON with row counts, quality scores, durations, and error details — queryable as a Delta table
Run the Data Generation & AI Guide in Google Colab
Integrations
- dbt Adapter — Import dbt
schema.ymlmodels and sources as LakeLogic contracts — reuse existing dbt definitions without rewriting - dlt (Data Load Tool) — Native
DltAdaptersupporting 100+ verified sources (Stripe, Shopify, SQL databases, Google Analytics, and more) plus declarative REST API ingestion — all with contract-driven quality gates on arrival - Native Streaming Connectors — Built-in
WebSocketConnector,SSEConnector,KafkaConnector,WebhookConnector(plus Azure Event Grid, Service Bus, AWS SQS, GCP Pub/Sub) for real-time data feeds piped directly into contract validation with pre-validation rename transformations - Native Database Ingestion — High-performance SQL extraction via Polars/ConnectorX and DuckDB — supports PostgreSQL, MySQL, SQL Server, SQLite, and more with automatic dialect detection
- Incremental CDC — Watermark-based change data capture with automatic state tracking — injects
WHERE updated_at > last_watermarkinto the SQL engine before data leaves the database - Batch Processing — Memory-safe chunked ingestion via
fetch_sizefor massive initial loads — processes 100GB+ tables without OOM errors - Column Projection Pushdown — Automatically constructs precise
SELECT "col1", "col2"queries from your contract'smodel.fields— only extracts what the contract declares, zero configuration - Cloud Data Sources — Native
abfss://,s3://,gs://URI support with automatic credential resolution viaCloudCredentialResolver— Azure AD, AWS IAM roles, GCP ADC, service principals, and Databricks secret scopes all work out of the box
Run the Integrations Guide in Google Colab
Delta Lake & Catalog Support — Lightweight Mode
LakeLogic automatically resolves catalog table names and uses Delta-RS for fast, lightweight Delta Lake operations — no Spark cluster required.
from lakelogic import DataProcessor
# Use Unity Catalog table names directly — lightweight mode
processor = DataProcessor(
engine="polars",
contract="contracts/customers.yaml"
)
good_df, bad_df = processor.run_source(
"main.default.customers"
)
# LakeLogic automatically:
# 1. Resolves table name to storage path
# 2. Uses Delta-RS for fast, Spark-free operations
# 3. Validates data with your contract rules
print(f"Valid: {len(good_df)} | Invalid: {len(bad_df)}")
Why LakeLogic?
Stop the "Fragmented Truth" Problem
In a traditional data stack, moving from a Warehouse (SQL) to a Lakehouse (PySpark) means rewriting your validation rules. This duplication creates Logic Drift — where your data quality standards differ depending on which tool is running the code.
With LakeLogic, your Data Contract is the Source of Truth.
- SQL-First Simplicity: Define your constraints and business logic in standard SQL—the language your team already speaks.
- Zero-Friction Portability: Move your pipelines from dbt/Snowflake to Databricks/Spark to Local/Polars with zero changes to your contract.
- True Ownership: Your business logic is a portable asset, independent of your cloud provider or execution engine.
Business Impact: Trust, Speed, and ROI
Slash Compute Costs
Not every job needs a massive Spark cluster. Reduce compute spend by up to 80% for maintenance tasks and small-to-medium datasets by using high-performance engines like Polars or DuckDB.
Guaranteed Integrity
LakeLogic detours bad data into a Safe Quarantine zone with absolute precision. This ensures downstream dashboards are never poisoned by "dirty" data, maintaining stakeholder trust.
Full Pipeline Transparency
Eliminate the "Black Box" problem. LakeLogic provides visual drill-downs from board-level KPIs back to the raw source records, ensuring every number is auditable and explainable.
Go Further with LakeLogic
LakeLogic is the open-source engine that enforces your data contracts. Here's how to get the most out of it:
Bootstrap contracts from raw data with --ai — descriptions, PII detection, and SQL rules generated in seconds.
Learn how to organize your contracts for 1,000s of tables using Domain-First ownership and Registries.
Explore the complete template showing every available configuration option for Bronze, Silver, and Gold layers.
Explore how LakeLogic enforces Quality Gates across the Medallion Architecture including Quarantine logic.
Generate realistic edge-case data from your contracts to stress-test quarantine rules before production.
Visit lakelogic.org for the latest guides, blog posts, and community resources.
From the Blog
Latest Posts
- Data Quality Management Without the Platform Tax — Why YAML contracts beat enterprise DQM platforms on cost, flexibility, and version control.
- Row-Level Data Quality in Polars — Without Writing Validation Code — One YAML file replaces 200 lines of Polars validation boilerplate.
- Data Mesh Without the Chaos — How data contracts make domain ownership work at enterprise scale.
- Stop the Spark Tax — One data contract, any engine — eliminate logic drift between Spark, Polars, and DuckDB.