Skip to content

Deployment Patterns

Note: These patterns describe how LakeLogic is used in production architectures. Local materialization is available; full orchestration remains on the roadmap.

LakeLogic is flexible. You can choose to process your data in discrete batches (Layer by Layer) or flow it through the entire architecture in a single pass (End-to-End).

In this pattern, each layer is a separate "Job" with its own Data Contract. This is the most common approach for large-scale Lakehouses.

The Workflow: 1. Job 1: Ingest Raw -> Bronze. (Focus: Schema Protection). 2. Job 2: Process Bronze -> Silver. (Focus: Quality Rules & PII). 3. Job 3: Aggregate Silver -> Gold. (Focus: Business Logic & Fact Strategies).

Why use this-

  • Isolation: If the Gold job fails, your Silver data is still safe and available.
  • Independent Scaling: You can run Ingestion every 5 minutes, but only run Gold aggregates once an hour.
  • Easier Debugging: You can see exactly which layer failed.
  • Multi-Platform Orchestration: You can run Bronze in Azure Data Factory, Silver in Databricks, and Gold in Spark-based lakehouses (Fabric/Synapse/Databricks) or Snowflake/BigQuery (table-only).

2. Pattern B: The End-to-End Pipe (Low Latency)

For smaller datasets or real-time requirements, you can flow data from Raw all the way to Gold in one single LakeLogic execution.

The Workflow: You define a single "Pipeline Contract" that includes both the Ingestion settings and the final Gold materialization logic.

# crm_full_pipeline.yaml
info:
 title: CRM End-to-End
 target_layer: gold # The final destination

server:
 type: s3
 mode: ingest

# In-memory transformation from Bronze to Silver logic
transformations:
 - sql: |
 SELECT *, LOWER(email) AS cleaner_email
 FROM source
 phase: post

# Final Materialization into Gold
primary_key: [user_id]
materialization:
 strategy: merge

Why use this-

  • Speed: No "rest stops" at Bronze or Silver. Data is ready for business faster.
  • Simplicity: Only one YAML file and one CLI command to manage.

3. Pattern C: Streaming & Event-Driven (Real-Time)

For modern data stacks, data doesn't wait for a batch window. It flows continuously. LakeLogic can be integrated into streaming pipelines to provide real-time quality gating.

The Workflow: 1. Event Trigger: A file lands in S3 or a message arrives in Kafka. 2. Serverless execution: An AWS Lambda or Azure Function triggers a LakeLogic execution on that specific record or micro-batch. 3. Spark Streaming: LakeLogic is used inside a foreachBatch sink in Spark Structured Streaming to validate every window before it is committed to the Delta/Iceberg table.

Why use this-

  • Immediate Alerts: Get a Slack notification for bad data seconds after it is generated.
  • Incremental Cost: Only process the new data, keeping compute costs low.
  • Clean Live Dashboards: Ensures that live "Streaming Gold" tables are never poisoned by malformed events.

4. Gold Best Practices (Aggregates and Traceability)

Gold tables are where business decisions happen. A few defaults help keep them trustworthy and explainable:

  • Use merge with a primary key when Gold is updated incrementally.
  • Keep lineage lean: often store only _lakelogic_run_id in Gold and rely on run logs for the rest.
  • Preserve upstream run ids when needed: capture _upstream_lakelogic_run_ids for full traceability.
  • Roll up source keys when aggregating to enable drill-down.
  • Add rollup key counts so you can validate scale without scanning arrays.

Example (Gold rollup with optional traceability):

lineage:
 enabled: true
 capture_run_id: true
 capture_timestamp: false
 capture_source_path: false
 capture_domain: false
 capture_system: false
 run_id_source: pipeline_run_id
 preserve_upstream: ["_lakelogic_run_id"]
 upstream_prefix: "_upstream"

primary_key: ["sale_date"]
materialization:
 strategy: merge

transformations:
 - rollup:
 group_by: ["sale_date"]
 aggregations:
 total_sales: "SUM(amount)"
 keys: "sale_id"
 rollup_keys_column: "_lakelogic_rollup_keys"
 rollup_keys_count_column: "_lakelogic_rollup_keys_count"
 upstream_run_id_column: "_upstream_run_id"
 upstream_run_ids_column: "_upstream_lakelogic_run_ids"

Comparison: Which one is right for you-

Feature Decoupled (Layered) End-to-End (Single Pass) Streaming (Micro-Batch)
Recovery Easy: Re-run failed layer. Harder: Re-run whole pipe. Automatic (Checkpointing).
Complexity Medium (Needs Orchestrator). Low (Standalone). High (Needs Kafka/Lambda).
Latency High (Minutes/Hours). Low (Minutes). Near-Instant (Seconds).
Use Case Financial Reporting. Simple ETL/ELT. Fraud Detection / IoT.

Summary

Most companies start with Pattern B for their first project and grow into Pattern A as their Lakehouse matures into a "Data Mesh." Real-time environments leverage Pattern C to ensure that streaming tables remain business-ready. LakeLogic provides the building blocks for all three, with full orchestration support planned.


Scaling Up- LakeLogic.org provides visual lineage, and AI-powered contract generation on top of LakeLogic.