Deployment Patterns
Note: These patterns describe how LakeLogic is used in production architectures. Local materialization is available; full orchestration remains on the roadmap.
LakeLogic is flexible. You can choose to process your data in discrete batches (Layer by Layer) or flow it through the entire architecture in a single pass (End-to-End).
1. Pattern A: The Decoupled Medallion (Recommended)
In this pattern, each layer is a separate "Job" with its own Data Contract. This is the most common approach for large-scale Lakehouses.
The Workflow: 1. Job 1: Ingest Raw -> Bronze. (Focus: Schema Protection). 2. Job 2: Process Bronze -> Silver. (Focus: Quality Rules & PII). 3. Job 3: Aggregate Silver -> Gold. (Focus: Business Logic & Fact Strategies).
Why use this-
- Isolation: If the Gold job fails, your Silver data is still safe and available.
- Independent Scaling: You can run Ingestion every 5 minutes, but only run Gold aggregates once an hour.
- Easier Debugging: You can see exactly which layer failed.
- Multi-Platform Orchestration: You can run Bronze in Azure Data Factory, Silver in Databricks, and Gold in Spark-based lakehouses (Fabric/Synapse/Databricks) or Snowflake/BigQuery (table-only).
2. Pattern B: The End-to-End Pipe (Low Latency)
For smaller datasets or real-time requirements, you can flow data from Raw all the way to Gold in one single LakeLogic execution.
The Workflow: You define a single "Pipeline Contract" that includes both the Ingestion settings and the final Gold materialization logic.
# crm_full_pipeline.yaml
info:
title: CRM End-to-End
target_layer: gold # The final destination
server:
type: s3
mode: ingest
# In-memory transformation from Bronze to Silver logic
transformations:
- sql: |
SELECT *, LOWER(email) AS cleaner_email
FROM source
phase: post
# Final Materialization into Gold
primary_key: [user_id]
materialization:
strategy: merge
Why use this-
- Speed: No "rest stops" at Bronze or Silver. Data is ready for business faster.
- Simplicity: Only one YAML file and one CLI command to manage.
3. Pattern C: Streaming & Event-Driven (Real-Time)
For modern data stacks, data doesn't wait for a batch window. It flows continuously. LakeLogic can be integrated into streaming pipelines to provide real-time quality gating.
The Workflow:
1. Event Trigger: A file lands in S3 or a message arrives in Kafka.
2. Serverless execution: An AWS Lambda or Azure Function triggers a LakeLogic execution on that specific record or micro-batch.
3. Spark Streaming: LakeLogic is used inside a foreachBatch sink in Spark Structured Streaming to validate every window before it is committed to the Delta/Iceberg table.
Why use this-
- Immediate Alerts: Get a Slack notification for bad data seconds after it is generated.
- Incremental Cost: Only process the new data, keeping compute costs low.
- Clean Live Dashboards: Ensures that live "Streaming Gold" tables are never poisoned by malformed events.
4. Gold Best Practices (Aggregates and Traceability)
Gold tables are where business decisions happen. A few defaults help keep them trustworthy and explainable:
- Use merge with a primary key when Gold is updated incrementally.
- Keep lineage lean: often store only
_lakelogic_run_idin Gold and rely on run logs for the rest. - Preserve upstream run ids when needed: capture
_upstream_lakelogic_run_idsfor full traceability. - Roll up source keys when aggregating to enable drill-down.
- Add rollup key counts so you can validate scale without scanning arrays.
Example (Gold rollup with optional traceability):
lineage:
enabled: true
capture_run_id: true
capture_timestamp: false
capture_source_path: false
capture_domain: false
capture_system: false
run_id_source: pipeline_run_id
preserve_upstream: ["_lakelogic_run_id"]
upstream_prefix: "_upstream"
primary_key: ["sale_date"]
materialization:
strategy: merge
transformations:
- rollup:
group_by: ["sale_date"]
aggregations:
total_sales: "SUM(amount)"
keys: "sale_id"
rollup_keys_column: "_lakelogic_rollup_keys"
rollup_keys_count_column: "_lakelogic_rollup_keys_count"
upstream_run_id_column: "_upstream_run_id"
upstream_run_ids_column: "_upstream_lakelogic_run_ids"
Comparison: Which one is right for you-
| Feature | Decoupled (Layered) | End-to-End (Single Pass) | Streaming (Micro-Batch) |
|---|---|---|---|
| Recovery | Easy: Re-run failed layer. | Harder: Re-run whole pipe. | Automatic (Checkpointing). |
| Complexity | Medium (Needs Orchestrator). | Low (Standalone). | High (Needs Kafka/Lambda). |
| Latency | High (Minutes/Hours). | Low (Minutes). | Near-Instant (Seconds). |
| Use Case | Financial Reporting. | Simple ETL/ELT. | Fraud Detection / IoT. |
Summary
Most companies start with Pattern B for their first project and grow into Pattern A as their Lakehouse matures into a "Data Mesh." Real-time environments leverage Pattern C to ensure that streaming tables remain business-ready. LakeLogic provides the building blocks for all three, with full orchestration support planned.
Scaling Up- LakeLogic.org provides visual lineage, and AI-powered contract generation on top of LakeLogic.