Skip to content

Deployment Patterns

Think of deployment like a restaurant kitchen. You can run one chef who does everything (end-to-end), a brigade where each station handles one course (decoupled layers), or a conveyor belt that never stops (streaming). LakeLogic supports all three.


Which Pattern Is Right for You?

Feature Decoupled (Layered) End-to-End (Single Pass) Streaming (Micro-Batch)
Recovery Re-run the failed layer only Re-run the entire pipeline Automatic (checkpointing)
Complexity Medium — needs an orchestrator Low — single YAML High — needs Kafka/Lambda
Latency Minutes to hours Minutes Seconds
Best for Enterprise lakehouses Small projects, POCs Fraud detection, IoT

Most teams start with End-to-End for their first project, then graduate to Decoupled as their data mesh grows.


Each layer is a separate job with its own contract. This is the standard for large-scale lakehouses.

Job 1: Raw → Bronze   (schema protection, lineage stamping)
Job 2: Bronze → Silver (quality rules, PII masking, dedup)
Job 3: Silver → Gold   (aggregations, business logic)

Why this works

  • Isolation — if Gold fails, Bronze and Silver data is safe
  • Independent scaling — ingest every 5 minutes, aggregate hourly
  • Easier debugging — pinpoint exactly which layer failed
  • Multi-platform — mix engines per layer to optimise cost and performance

Deployment Topologies

Topology Bronze Silver Gold Best For
All-in-one Databricks (Spark) Databricks (Spark) Databricks (Spark) Enterprises already on Databricks
Serverless (Azure) Azure Functions (Polars) Azure Functions (Polars) Databricks / Spark Cost-effective — serverless for ingestion, Spark only for heavy aggregations
Serverless (AWS) Lambda (Polars) Lambda (Polars) EMR / Databricks Same pattern on AWS — pay-per-invocation for Bronze & Silver
Microsoft Fabric Fabric Notebooks (Polars) Fabric Notebooks (Polars) Fabric / Spark Native Fabric integration

The key insight: Bronze and Silver are often lightweight (schema check, dedup, quality rules) — they don't need a Spark cluster. Run them on Polars in serverless functions for pennies, and reserve Spark for Gold aggregations where you actually need distributed compute.

Example contracts

# bronze_customers.yaml — Job 1
dataset: bronze_crm_customers
source:
  path: "abfss://landing@acct.dfs.core.windows.net/crm/customers/"
  load_mode: incremental
  watermark_strategy: pipeline_log

model:
  fields:
    - name: customer_id
      type: string
      required: true
    - name: email
      type: string
    - name: updated_at
      type: timestamp

materialization:
  strategy: append
  target_path: "abfss://lake@acct.dfs.core.windows.net/bronze/crm_customers"
# silver_customers.yaml — Job 2
dataset: silver_crm_customers
source:
  type: table
  path: "abfss://lake@acct.dfs.core.windows.net/bronze/crm_customers"
  load_mode: incremental
  watermark_strategy: max_target
  watermark_field: "_lakelogic_processed_at"

transformations:
  - deduplicate: [customer_id]
  - sql: "SELECT *, LOWER(TRIM(email)) AS email FROM source"
    phase: pre

model:
  fields:
    - name: customer_id
      type: string
      required: true
    - name: email
      type: string
      required: true
      pii: true
      masking: "hash"

quality:
  row_rules:
    - sql: "email LIKE '%@%.%'"

materialization:
  strategy: merge
  primary_key: [customer_id]
  target_path: "abfss://lake@acct.dfs.core.windows.net/silver/crm_customers"

Pattern B: End-to-End (Quick Start)

Single contract, single execution. Data flows from source through validation to materialization in one pass.

# crm_pipeline.yaml — everything in one contract
dataset: silver_crm_customers

source:
  path: "data/customers.csv"

transformations:
  - deduplicate: [customer_id]

model:
  fields:
    - name: customer_id
      type: string
      required: true
    - name: email
      type: string
      pii: true
      masking: "hash"

quality:
  row_rules:
    - sql: "email LIKE '%@%.%'"

materialization:
  strategy: merge
  primary_key: [customer_id]
  target_path: output/silver_customers

Why this works

  • Simplicity — one YAML file, one CLI command
  • Speed — no intermediate writes, data flows in memory
  • Great for POCs — prove value before building the full architecture

Pattern C: Streaming & Event-Driven (Real-Time)

For continuous data flows, LakeLogic integrates into streaming pipelines to provide real-time quality gating.

Spark Structured Streaming

from lakelogic import DataProcessor

proc = DataProcessor(contract="events.yaml")

# Use LakeLogic inside foreachBatch
def validate_batch(batch_df, batch_id):
    result = proc.run_dataframe(batch_df)
    result.good.write.format("delta").mode("append").save(silver_path)
    result.bad.write.format("delta").mode("append").save(quarantine_path)

stream.writeStream.foreachBatch(validate_batch).start()

Serverless (Lambda / Azure Functions)

# Triggered by S3/ADLS file arrival
def handler(event):
    from lakelogic import DataProcessor
    file_path = event["Records"][0]["s3"]["object"]["key"]
    result = DataProcessor(contract="events.yaml").run_source(file_path)
    # Bad data triggers Slack alert automatically (via contract notifications)

Why this works

  • Immediate alerts — bad data flagged within seconds
  • Incremental cost — only process new data
  • Clean live dashboards — streaming Gold tables are never poisoned

Orchestrator Integration

LakeLogic works with any orchestrator. Use the CLI or Python API:

lakelogic run contracts/crm/customers_silver.yaml \
  --window last_success \
  --summary-table run_logs.pipeline_summaries
from airflow.operators.bash import BashOperator

validate_customers = BashOperator(
    task_id="validate_customers",
    bash_command="lakelogic run contracts/crm/customers_silver.yaml",
)
# Databricks notebook cell
from lakelogic import DataProcessor

result = DataProcessor(
    contract="/Workspace/contracts/customers_silver.yaml",
    engine="spark"
).run()
from prefect import flow
from lakelogic import DataProcessor

@flow
def validate_customers():
    return DataProcessor("contracts/customers.yaml").run()

CLI Flags for Production

Flag Purpose
--window last_success Resume from last successful watermark
--summary-path ./run.json Emit per-run JSON summary
--summary-table db.run_logs Write summaries to a table for dashboards
--continue-on-error Best-effort run, report all failures in one pass
--metrics-backend prometheus Expose /metrics endpoint for monitoring

Gold Layer Best Practices

Gold tables are where business decisions happen. A few defaults keep them trustworthy:

  • Use merge with a primary key when Gold is updated incrementally
  • Keep lineage lean — often store only _lakelogic_run_id in Gold; rely on run logs for the full trail
  • Roll up source keys when aggregating, to enable drill-down
  • Capture upstream run IDs via preserve_upstream for full cross-layer traceability
# Gold rollup example
materialization:
  strategy: merge
  primary_key: [sale_date]

lineage:
  enabled: true
  capture_run_id: true
  preserve_upstream: ["_lakelogic_run_id"]

transformations:
  - rollup:
      group_by: [sale_date]
      aggregations:
        total_sales: "SUM(amount)"       # → output column: total_sales
        order_count: "COUNT(*)"          # → output column: order_count
        avg_order: "AVG(amount)"         # → output column: avg_order
      keys: "sale_id"