Skip to content

System Configuration (_system.yaml)

The system registry defines storage, environments, and contract listings for a single source system within a domain.

Think of it like a building's utility plan. The domain config says "this is the Marketing building," and the system config says "here's where the electricity, water, and data pipes connect for the Google Analytics floor."

domains_retail/marketing/
└── google_analytics/
    ├── _system.yaml          ← This file
    ├── bronze/
    │   └── events_v1.0.yaml
    └── silver/
        └── sessions_v1.0.yaml

Core Structure

Every _system.yaml follows this pattern. Customise the values to match your environment:

Example: Minimal system registry

domain: marketing
system: google_analytics

# ── Metadata & Run Logs ──────────────────────────────────────
metadata:
  run_log_table: "{log_path}"
  run_log_backend: "delta"

# ── Server (per layer, contract overrides) ───────────────────
# Applied to all contracts in this system unless overridden.
server:
  bronze:
    mode: "ingest"
    format: "delta"
    schema_policy:
      evolution: "append"
      unknown_fields: "allow"
    cast_to_string: true
  silver:
    mode: "validate"
    format: "delta"
    schema_policy:
      evolution: "strict"
      unknown_fields: "quarantine"
  gold:
    mode: "validate"
    format: "delta"
    schema_policy:
      evolution: "strict"
      unknown_fields: "quarantine"

# ── Materialization Defaults ─────────────────────────────────
materialization:
  bronze:
    strategy: append
    format: delta
  silver:
    strategy: merge
    format: delta
    merge_dedup_guard: true
  gold:
    strategy: merge
    format: delta

# ── Lineage ──────────────────────────────────────────────────
lineage:
  enabled: true
  source_column_name: "_lakelogic_source"
  timestamp_column_name: "_lakelogic_processed_at"

# ── Quarantine ───────────────────────────────────────────────
quarantine:
  enabled: true
  fail_on_quarantine: false
  target: "{quarantine_root}/{domain}/{system}"

# ── Extraction Defaults ─────────────────────────────────────
# System-level LLM extraction defaults (override domain defaults).
extraction_defaults:
  provider: "azure_openai"
  model: "gpt-4o"
  temperature: 0.0
  max_cost_per_run: 25.00
  redact_pii_before_llm: true

# ── Notifications ───────────────────────────────────────────
# System-specific channels (concatenated with domain channels).
notifications_enabled: true   # Global switch to disable all system, domain, and contract notifications
notifications:
  - target: "https://hooks.slack.com/services/YOUR/SYSTEM-SPECIFIC/WEBHOOK"
    on_events: ["failure", "quarantine"]

# ── Cost Observability ──────────────────────────────────
# Track estimated compute cost per pipeline run.
# Budget limits and currency are inherited from _domain.yaml
cost:
  provider: "manual"
  attribution: "duration_proportional"
  currency: "USD"

  rates:
    dbu_per_hour: 0.22
    storage_per_gb_month: 0.023

  # Optional: account for autoscaling clusters
  cluster:
    min_nodes: 2
    max_nodes: 8
    scaling_assumption: "avg"

# ── Storage ──────────────────────────────────────────────────
storage:
  # ── Table Resolution ──────────────────────────────────────
  # UC mode (Databricks):  tables resolve via domain_catalog
  # Direct mode (DuckDB/Polars/Fabric/Synapse/EMR): tables resolve via external_location_root
  domain_catalog: "`{catalog}`.{domain}"
  external_location_root: "{data_root}"

  # ── Operational Paths (UC mode — Databricks Volumes) ──────
  # Only used when storage_mode="uc". Direct mode ignores them.
  contract_root: "/Workspace/Shared/data_platform/domains/{domain}/{system}"
  landing_root: "/Volumes/{catalog}/nondelta/landing_{domain}/{system}"
  log_root: "/Volumes/{catalog}/nondelta/_logs"

  # ── Storage Paths (direct mode — cloud/local) ─────────────
  # Driven by {storage_root}, {data_root}, {quarantine_root}
  # which are defined per-environment below.
  landing_path: "{storage_root}/_data/{domain}/{system}"
  contract_path: "{storage_root}/_contracts/{domain}/{system}"
  log_path: "{storage_root}/_logs/{domain}"
  quarantine_path: "{quarantine_root}/{domain}/{system}"

# ── Cloud Storage Anchors (DRY) ─────────────────────────────
x-azure-storage: &azure_storage
  storage_root: "abfss://nondelta@{storage_account}.dfs.core.windows.net"
  data_root: "abfss://{domain}@{storage_account}.dfs.core.windows.net"
  quarantine_root: "abfss://quarantine@{storage_account}.dfs.core.windows.net"

# ── Environments ────────────────────────────────────────────
environments:
  dev:
    catalog: "lakelogic-lakehouse-dev-001"
    storage_account: "salakelogicdevadls001"
    <<: *azure_storage
  local:
    catalog: "local"
    storage_root: "./lakehouse"
    data_root: "./lakehouse/{domain}"
    quarantine_root: "./lakehouse/_quarantine"
  colab:
    catalog: "colab"
    storage_root: "/content/lake"
    data_root: "/content/lake/{domain}"
    quarantine_root: "/content/lake/_quarantine"

# ── Contracts ───────────────────────────────────────────────
contracts:
  # ── Bronze Layer ────────────────────────────────────────────
  - layer: bronze
    entity: events
    path: "contracts/{bronze_layer}/{bronze_layer}_{system}_events_v1.0.yaml"
    enabled: true

  - layer: bronze
    entity: sessions
    path: "contracts/{bronze_layer}/{bronze_layer}_{system}_sessions_v1.0.yaml"
    enabled: true

  # ── Silver Layer ────────────────────────────────────────────
  - layer: silver
    entity: sessions_cleaned
    path: "contracts/{silver_layer}/{silver_layer}_{system}_sessions_v1.0.yaml"
    enabled: true

  - layer: silver
    entity: events_cleaned
    path: "contracts/{silver_layer}/{silver_layer}_{system}_events_v1.0.yaml"
    depends_on: [sessions]
    enabled: true

  # ── Gold Layer ──────────────────────────────────────────────
  - layer: gold
    entity: fact_aggregate_channel_performance
    path: "contracts/{gold_layer}/{gold_layer}_{system}_fact_aggregate_channel_performance_v1.0.yaml"
    enabled: true

  - layer: gold
    entity: dim_events_scd2
    path: "contracts/{gold_layer}/{gold_layer}_{system}_dim_events_scd2_v1.0.yaml"
    depends_on: [fact_aggregate_channel_performance]
    enabled: true

  - layer: gold
    entity: fact_accumulating_snapshot_session_funnel
    path: "contracts/{gold_layer}/{gold_layer}_{system}_fact_accumulating_snapshot_session_funnel_v1.0.yaml"
    depends_on: [dim_events_scd2]
    enabled: true

  - layer: gold
    entity: fact_periodic_snapshot_user_daily
    path: "contracts/{gold_layer}/{gold_layer}_{system}_fact_periodic_snapshot_user_daily_v1.0.yaml"
    depends_on: [dim_events_scd2]
    enabled: true

  - layer: gold
    entity: fact_factless_user_conversions
    path: "contracts/{gold_layer}/{gold_layer}_{system}_fact_factless_user_conversions_v1.0.yaml"
    depends_on: [dim_events_scd2]
    enabled: false

Global Defaults vs. Local Overrides

LakeLogic uses a powerful inheritance model to keep your data contracts clean. It explicitly separates Global Defaults from Local Overrides:

1. server in _system.yaml (The Global Template)

This block lives in your _system.yaml registry. It acts as the blanket rule for the entire system layer. If you define evolution: strict here under bronze:, you are telling the engine: "Unless told otherwise, treat every single Bronze contract in this network as strictly locked down."

This saves you from copying and pasting the exact same server: boilerplate into 50 different contract YAML files!

2. server in a Contract (The Local Override)

This block lives inside a specific individual Data Contract file (e.g., bronze_google_analytics_events_v1.0.yaml). Any setting you define here overrides whatever was set in the global system defaults.

Example Use Case: Imagine you have 50 Bronze tables. You want 49 of them to be highly regulated and locked down, but one specific API is notoriously messy and you just want to let it drift safely.

You would set your _system.yaml to:

server:
  bronze:
    schema_policy:
      evolution: strict   # 49 tables inherit this

And then, strictly inside that one messy contract file, you would define:

server:
  schema_policy:
    evolution: allow      # This specific contract overrides the default

LakeLogic automatically merges them at runtime: inheriting the broad infrastructure defaults from the system registry, while respecting the fine-grained custom behaviors of an individual contract.


Cost Observability

The cost: block enables automatic compute cost estimation for every pipeline run in this system. This system-level configuration handles how cost is measured, while the budget and authoritative reporting currency are defined centrally in _domain.yaml.

Field Required Default Description
provider No "none" "none" (disabled), "manual" (duration × rate), "databricks_uc" (billing API)
attribution No "duration_proportional" "duration_proportional", "row_proportional", or "direct"
currency No Inherited Must match cost.currency in _domain.yaml. Mismatches log a warning.
rates.dbu_per_hour No 0.22 Databricks Jobs Compute DBU rate per hour
rates.storage_per_gb_month No 0.023 Delta storage cost per GB per month
cluster.min_nodes No 1 Minimum nodes in the cluster
cluster.max_nodes No min_nodes Maximum nodes in the cluster
cluster.scaling_assumption No "avg" How to estimate node count during run. Options: "avg", "peak", "min", "p75"

Provider Options

  • "none": Cost tracking disabled.
  • "manual": Estimates cost using the formula: run_duration_seconds × dbu_per_hour × avg_nodes / 3600.
  • "databricks_uc": Queries system.billing.usage by run_id tag. Falls back to manual if no billing row is found.

Cluster Scaling Assumptions

When using autoscaling clusters (min_nodes < max_nodes), the manual provider uses a scaling assumption to blend the hourly rate.

  • "avg" – Uses (min + max) / 2. Most common default for varied workloads.
  • "peak" – Uses max. Conservative estimation, assumes worst-case.
  • "min" – Uses min. Optimistic estimation for steady-state workloads.
  • "p75" – Uses min + 0.75 × (max - min). Good for near-peak, spiky workloads.

Tip: Start with provider: "manual" to get immediate cost visibility using duration-based estimates. Upgrade to "databricks_uc" when you need exact cost attribution from the Unity Catalog billing tables.

Cost data is recorded in the run log as estimated_cost, cost_currency, and cost_confidence columns. See the Observability docs for analytical queries and SaaS integration.


Platform-Portable Storage

Use YAML anchors to define storage patterns once and reference them across environments. This means you write your cloud connection details once and reuse them everywhere.

Example: Multi-cloud storage with YAML anchors

# ── Storage Anchors (define once) ────────────────────────────
storage:
  azure: &azure_storage
    storage_account: "youraccount"
    container: "lakehouse"
    storage_root: "abfss://lakehouse@youraccount.dfs.core.windows.net"
    data_root: "{storage_root}/{domain}"
    quarantine_root: "{storage_root}/_quarantine"
    log_path: "{data_root}/_run_logs"

  aws: &aws_storage
    bucket: "your-data-lake"
    storage_root: "s3://your-data-lake"
    data_root: "{storage_root}/{domain}"
    quarantine_root: "{storage_root}/_quarantine"
    log_path: "{data_root}/_run_logs"

  gcp: &gcp_storage
    bucket: "your-data-lake"
    storage_root: "gs://your-data-lake"
    data_root: "{storage_root}/{domain}"

  local: &local_storage
    storage_root: "./lakehouse"
    data_root: "{storage_root}/{domain}"
    quarantine_root: "{storage_root}/_quarantine"
    log_path: "{data_root}/_run_logs"

# ── Environments ─────────────────────────────────────────────
environments:
  dev:
    <<: *azure_storage
    catalog: "dev_catalog"
  prod:
    <<: *azure_storage
    catalog: "prod_catalog"
  aws:
    <<: *aws_storage
    catalog: "glue_catalog"
  local:
    <<: *local_storage
    catalog: "local"

Supported Platforms

Environment Platform URI Scheme
dev/staging/prod Databricks UC (Azure) abfss://...dfs.core.windows.net
fabric Microsoft Fabric OneLake abfss://...onelake.dfs.fabric.microsoft.com
synapse Microsoft Synapse Spark abfss://...dfs.core.windows.net
aws Amazon EMR / S3 s3://
gcp Google Cloud / GCS gs://
local Local filesystem ./lakehouse
colab Google Colab /content/lake

Placeholder Variables

Contracts use {placeholder} syntax that resolves from the system registry. This keeps your contracts portable — change the storage path in one place, all contracts update automatically.

Placeholder Source Example Value
{domain} domain: marketing
{system} system: google_analytics
{bronze_layer} bronze_layer: (or inherited from domain) bronze
{silver_layer} silver_layer: silver
{gold_layer} gold_layer: gold
{domain_catalog} Environment-specific catalog: retail_marketing
{storage_root} Environment-specific abfss://...
{data_root} Computed {storage_root}/{domain}
{log_path} Computed {data_root}/_run_logs

Usage in Contracts

Example: Placeholder usage in a contract

source:
  path: "{data_root}/{bronze_layer}_{system}_events"

materialization:
  target_path: "{data_root}/{silver_layer}_{system}_sessions"

metadata:
  run_log_table: "{log_path}"

Contract Listing

List which contracts belong to this system. Set active: false to disable a contract without deleting it.

Example: Contract registry

contracts:
  - path: bronze/events_v1.0.yaml
    entity: events
    layer: bronze
    active: true

  - path: silver/sessions_v1.0.yaml
    entity: sessions
    layer: silver
    active: true

  - path: gold/dim_users_v1.0.yaml
    entity: dim_users
    layer: gold
    active: false    # Disabled — won't run in pipeline

External Sources (Cross-Domain Lineage)

When your domain consumes tables from another domain's pipeline, declare them here so the DAG shows the full data lineage across teams.

Why this matters: Without this, your data lineage stops at your domain boundary. With it, you can trace data from its origin all the way through to your final dashboard — even across team boundaries.

Example: Cross-domain source declaration

external_sources:
  - name: "silver_crm_customers"
    catalog_path: "catalog.silver.crm_customers"
    source_domain: "sales/crm"
    consumed_by: ["gold_customer_360"]

External nodes appear in the DAG with dashed borders. LakeLogic does not orchestrate the external pipeline — this is metadata-only for lineage tracking.