System Configuration (`_system.yaml`)

The system registry defines storage, environments, and contract listings for a single source system within a domain.

Think of it like a building's utility plan. The domain config says "this is the Marketing building," and the system config says "here's where the electricity, water, and data pipes connect for the Google Analytics floor."

domains_retail/marketing/
└── google_analytics/
    ├── _system.yaml          ← This file
    ├── bronze/
    │   └── events_v1.0.yaml
    └── silver/
        └── sessions_v1.0.yaml

Core Structure

Every _system.yaml follows this pattern. Customise the values to match your environment:

Example: Minimal system registry

domain: marketing
system: google_analytics

# ── Metadata & Run Logs ──────────────────────────────────────
metadata:
  run_log_table: "{log_path}"
  run_log_backend: "delta"

# ── Server (per layer, contract overrides) ───────────────────
# Applied to all contracts in this system unless overridden.
server:
  bronze:
    mode: "ingest"
    format: "delta"
    schema_policy:
      evolution: "append"
      unknown_fields: "allow"
    cast_to_string: true
  silver:
    mode: "validate"
    format: "delta"
    schema_policy:
      evolution: "strict"
      unknown_fields: "quarantine"
  gold:
    mode: "validate"
    format: "delta"
    schema_policy:
      evolution: "strict"
      unknown_fields: "quarantine"

# ── Materialization Defaults ─────────────────────────────────
materialization:
  bronze:
    strategy: append
    format: delta
  silver:
    strategy: merge
    format: delta
    merge_dedup_guard: true
  gold:
    strategy: merge
    format: delta

# ── Lineage ──────────────────────────────────────────────────
lineage:
  enabled: true
  source_column_name: "_lakelogic_source"
  timestamp_column_name: "_lakelogic_processed_at"

# ── Quarantine ───────────────────────────────────────────────
quarantine:
  enabled: true
  fail_on_quarantine: false
  target: "{quarantine_root}/{domain}/{system}"

# ── Extraction Defaults ─────────────────────────────────────
# System-level LLM extraction defaults (override domain defaults).
extraction_defaults:
  provider: "azure_openai"
  model: "gpt-4o"
  temperature: 0.0
  max_cost_per_run: 25.00
  redact_pii_before_llm: true

# ── Notifications ───────────────────────────────────────────
# System-specific channels (concatenated with domain channels).
notifications_enabled: true   # Global switch to disable all system, domain, and contract notifications
notifications:
  - target: "https://hooks.slack.com/services/YOUR/SYSTEM-SPECIFIC/WEBHOOK"
    on_events: ["failure", "quarantine"]

# ── Cost Observability ──────────────────────────────────
# Track estimated compute cost per pipeline run.
# Budget limits and currency are inherited from _domain.yaml
cost:
  provider: "manual"
  attribution: "duration_proportional"
  currency: "USD"

  rates:
    dbu_per_hour: 0.22
    storage_per_gb_month: 0.023

  # Optional: account for autoscaling clusters
  cluster:
    min_nodes: 2
    max_nodes: 8
    scaling_assumption: "avg"

# ── Storage ──────────────────────────────────────────────────
storage:
  # ── Table Resolution ──────────────────────────────────────
  # UC mode (Databricks):  tables resolve via domain_catalog
  # Direct mode (DuckDB/Polars/Fabric/Synapse/EMR): tables resolve via external_location_root
  domain_catalog: "`{catalog}`.{domain}"
  external_location_root: "{data_root}"

  # ── Operational Paths (UC mode — Databricks Volumes) ──────
  # Only used when storage_mode="uc". Direct mode ignores them.
  contract_root: "/Workspace/Shared/data_platform/domains/{domain}/{system}"
  landing_root: "/Volumes/{catalog}/nondelta/landing_{domain}/{system}"
  log_root: "/Volumes/{catalog}/nondelta/_logs"

  # ── Storage Paths (direct mode — cloud/local) ─────────────
  # Driven by {storage_root}, {data_root}, {quarantine_root}
  # which are defined per-environment below.
  landing_path: "{storage_root}/_data/{domain}/{system}"
  contract_path: "{storage_root}/_contracts/{domain}/{system}"
  log_path: "{storage_root}/_logs/{domain}"
  quarantine_path: "{quarantine_root}/{domain}/{system}"

# ── Cloud Storage Anchors (DRY) ─────────────────────────────
x-azure-storage: &azure_storage
  storage_root: "abfss://nondelta@{storage_account}.dfs.core.windows.net"
  data_root: "abfss://{domain}@{storage_account}.dfs.core.windows.net"
  quarantine_root: "abfss://quarantine@{storage_account}.dfs.core.windows.net"

# ── Environments ────────────────────────────────────────────
# Use ${ENV_VAR} to resolve secrets from environment variables.
# This keeps infrastructure names out of source control.
environments:
  dev:
    catalog: "${LAKELOGIC_DEV_CATALOG}"
    storage_account: "${LAKELOGIC_DEV_STORAGE_ACCOUNT}"
    <<: *azure_storage
  staging:
    catalog: "${LAKELOGIC_STG_CATALOG}"
    storage_account: "${LAKELOGIC_STG_STORAGE_ACCOUNT}"
    <<: *azure_storage
  prod:
    catalog: "${LAKELOGIC_PROD_CATALOG}"
    storage_account: "${LAKELOGIC_PROD_STORAGE_ACCOUNT}"
    <<: *azure_storage
  local:
    catalog: "local"
    storage_root: "./lakehouse"
    data_root: "./lakehouse/{domain}"
    quarantine_root: "./lakehouse/_quarantine"
  colab:
    catalog: "colab"
    storage_root: "/content/lake"
    data_root: "/content/lake/{domain}"
    quarantine_root: "/content/lake/_quarantine"

# ── Contracts ───────────────────────────────────────────────
contracts:
  # ── Bronze Layer ────────────────────────────────────────────
  - layer: bronze
    entity: events
    path: "contracts/{bronze_layer}/{bronze_layer}_{system}_events_v1.0.yaml"
    enabled: true

  - layer: bronze
    entity: sessions
    path: "contracts/{bronze_layer}/{bronze_layer}_{system}_sessions_v1.0.yaml"
    enabled: true

  # ── Silver Layer ────────────────────────────────────────────
  - layer: silver
    entity: sessions_cleaned
    path: "contracts/{silver_layer}/{silver_layer}_{system}_sessions_v1.0.yaml"
    enabled: true

  - layer: silver
    entity: events_cleaned
    path: "contracts/{silver_layer}/{silver_layer}_{system}_events_v1.0.yaml"
    depends_on: [sessions]
    enabled: true

  # ── Gold Layer ──────────────────────────────────────────────
  - layer: gold
    entity: fact_aggregate_channel_performance
    path: "contracts/{gold_layer}/{gold_layer}_{system}_fact_aggregate_channel_performance_v1.0.yaml"
    enabled: true

  - layer: gold
    entity: dim_events_scd2
    path: "contracts/{gold_layer}/{gold_layer}_{system}_dim_events_scd2_v1.0.yaml"
    depends_on: [fact_aggregate_channel_performance]
    enabled: true

  - layer: gold
    entity: fact_accumulating_snapshot_session_funnel
    path: "contracts/{gold_layer}/{gold_layer}_{system}_fact_accumulating_snapshot_session_funnel_v1.0.yaml"
    depends_on: [dim_events_scd2]
    enabled: true

  - layer: gold
    entity: fact_periodic_snapshot_user_daily
    path: "contracts/{gold_layer}/{gold_layer}_{system}_fact_periodic_snapshot_user_daily_v1.0.yaml"
    depends_on: [dim_events_scd2]
    enabled: true

  - layer: gold
    entity: fact_factless_user_conversions
    path: "contracts/{gold_layer}/{gold_layer}_{system}_fact_factless_user_conversions_v1.0.yaml"
    depends_on: [dim_events_scd2]
    enabled: false

Explicit vs. Implicit Dependencies (`depends_on`)

LakeLogic pipelines execute sequentially by layer by default (Landing → Bronze → Silver → Gold). Because of this inherent layer-by-layer progression, a Silver contract will always process after its upstream Bronze contract has finished.

Rule of Thumb: * Cross-layer edges are always inferred automatically. The DAG generator reads each contract's source.path and links: block to draw the Bronze→Silver and Silver→Gold data-flow lines. You never need to manually declare these. * depends_on is for intra-layer ordering only. Use it to ensure dimensional tables (e.g., silver_rideflow_driver_profiles) are processed before fact tables (e.g., silver_rideflow_trips) within the same Silver layer. The DAG will draw these as blue dependency arrows alongside the automatically inferred cross-layer edges.

Therefore, depends_on should primarily be reserved for orchestrating dependencies within a specific layer. You do not need to list upstream Bronze tables in a Silver contract's depends_on array — LakeLogic handles inter-layer sequencing and DAG visualization automatically.

Global Defaults vs. Local Overrides

LakeLogic uses a powerful inheritance model to keep your data contracts clean. It explicitly separates Global Defaults from Local Overrides:

1. `server` in `_system.yaml` (The Global Template)

This block lives in your _system.yaml registry. It acts as the blanket rule for the entire system layer. If you define evolution: strict here under bronze:, you are telling the engine: "Unless told otherwise, treat every single Bronze contract in this network as strictly locked down."

This saves you from copying and pasting the exact same server: boilerplate into 50 different contract YAML files!

2. `server` in a Contract (The Local Override)

This block lives inside a specific individual Data Contract file (e.g., bronze_google_analytics_events_v1.0.yaml). Any setting you define here overrides whatever was set in the global system defaults.

Example Use Case: Imagine you have 50 Bronze tables. You want 49 of them to be highly regulated and locked down, but one specific API is notoriously messy and you just want to let it drift safely.

You would set your _system.yaml to:

server:
  bronze:
    schema_policy:
      evolution: strict   # 49 tables inherit this

And then, strictly inside that one messy contract file, you would define:

server:
  schema_policy:
    evolution: allow      # This specific contract overrides the default

LakeLogic automatically merges them at runtime: inheriting the broad infrastructure defaults from the system registry, while respecting the fine-grained custom behaviors of an individual contract.

Property Reference

`domain` / `system`

Field	Type	Required	Description
`domain`	`string`	Yes	Domain name (inherited from `_domain.yaml` if not set)
`system`	`string`	Yes	Source system identifier (e.g. `"google_analytics"`, `"rideflow"`)

`metadata`

Controls run logging and pipeline observability.

Field	Type	Default	Description
`run_log_table`	`string`	`null`	Path or table name for the run log (e.g. `"{log_path}"`)
`run_log_backend`	`string`	`"delta"`	Storage backend for run logs (`"delta"`, `"json"`)

`server`

Per-layer server configuration inherited by all contracts. Each layer (bronze, silver, gold) can have its own settings.

Field	Type	Default	Description
`mode`	`string`	`"validate"`	`"ingest"` (raw-to-Bronze) or `"validate"` (Quality Gate)
`format`	`string`	`"delta"`	Output format (`"delta"`, `"parquet"`, `"iceberg"`, `"csv"`, `"json"`)
`cast_to_string`	`bool`	`false`	Cast all fields to string (Bronze raw ingestion)

`server.<layer>.schema_policy`

Field	Type	Default	Description
`evolution`	`string`	`"allow"`	`"strict"`, `"append"`, `"merge"`, `"overwrite"`, `"compatible"`, `"allow"`
`unknown_fields`	`string`	`"allow"`	`"quarantine"` (route to quarantine), `"drop"` (silently discard), `"allow"` (keep)

`server.bronze.post_ingestion`

Landing zone lifecycle policy — what to do with source files after a successful Bronze commit. See Zero-Retention Architecture for full details.

Field	Type	Default	Description
`action`	`string`	`"retain"`	`"delete"` (zero-retention), `"archive"` (move to archive), `"retain"` (no-op)
`cleanup_is_blocking`	`bool`	`false`	If `true`, cleanup failure fails the pipeline
`retry_orphaned_files`	`bool`	`true`	Retry cleanup of previously failed files on next run

Example: GDPR zero-retention for all Bronze contracts

server:
  bronze:
    cast_to_string: true
    schema_policy:
      evolution: append
      unknown_fields: allow
    post_ingestion:
      action: delete
      cleanup_is_blocking: false
      retry_orphaned_files: true

`materialization`

Per-layer materialization defaults controlling how data is written to the target.

Field	Type	Default	Description
`strategy`	`string`	—	`"append"`, `"merge"`, `"overwrite"`, `"snapshot"`
`format`	`string`	`"delta"`	Output format
`merge_dedup_guard`	`bool`	`false`	Deduplicate rows before merge (Silver/Gold)

`lineage`

System-level lineage tracking defaults. Adds metadata columns to every processed table.

Field	Type	Default	Description
`enabled`	`bool`	`true`	Enable lineage column injection
`source_column_name`	`string`	`"_lakelogic_source"`	Column name for source file/table path
`timestamp_column_name`	`string`	`"_lakelogic_processed_at"`	Column name for processing timestamp

`quarantine`

System-level quarantine defaults for rows that fail validation rules.

Field	Type	Default	Description
`enabled`	`bool`	`true`	Enable quarantine for failed rows
`fail_on_quarantine`	`bool`	`false`	If `true`, pipeline fails when any rows are quarantined
`include_error_reason`	`bool`	`false`	Include the validation error in quarantine records
`target`	`string`	—	Quarantine table/path (e.g. `"{quarantine_path}"`)
`format`	`string`	`"delta"`	Output format for quarantine table
`mode`	`string`	`"append"`	Write mode for quarantine table

`compliance`

System-level compliance defaults (GDPR, data residency). Inherited by all contracts.

Field	Type	Default	Description
`data_residency`	`string`	`null`	Required data residency region (e.g. `"EU"`, `"US"`)
`gdpr.enabled`	`bool`	`false`	Enable GDPR right-to-erasure support
`gdpr.erasure_strategy`	`string`	`"nullify"`	`"nullify"`, `"hash"`, `"delete"`

Residency enforcement: If a contract requires data_residency: EU but the target environment's region is US, the engine logs a compliance violation warning at load time.

Cost Observability

The cost: block enables automatic compute cost estimation for every pipeline run in this system. This system-level configuration handles how cost is measured, while the budget and authoritative reporting currency are defined centrally in _domain.yaml.

Field	Required	Default	Description
`provider`	No	`"none"`	`"none"` (disabled), `"manual"` (duration × rate), `"databricks_uc"` (billing API)
`attribution`	No	`"duration_proportional"`	`"duration_proportional"`, `"row_proportional"`, or `"direct"`
`currency`	No	Inherited	Must match `cost.currency` in `_domain.yaml`. Mismatches log a warning.
`rates.dbu_per_hour`	No	`0.22`	Databricks Jobs Compute DBU rate per hour
`rates.storage_per_gb_month`	No	`0.023`	Delta storage cost per GB per month
`cluster.min_nodes`	No	`1`	Minimum nodes in the cluster
`cluster.max_nodes`	No	`min_nodes`	Maximum nodes in the cluster
`cluster.scaling_assumption`	No	`"avg"`	How to estimate node count during run. Options: `"avg"`, `"peak"`, `"min"`, `"p75"`

Provider Options

"none": Cost tracking disabled.
"manual": Estimates cost using the formula: run_duration_seconds × dbu_per_hour × avg_nodes / 3600.
"databricks_uc": Queries system.billing.usage by run_id tag. Falls back to manual if no billing row is found.

Cluster Scaling Assumptions

When using autoscaling clusters (min_nodes < max_nodes), the manual provider uses a scaling assumption to blend the hourly rate.

"avg" – Uses (min + max) / 2. Most common default for varied workloads.
"peak" – Uses max. Conservative estimation, assumes worst-case.
"min" – Uses min. Optimistic estimation for steady-state workloads.
"p75" – Uses min + 0.75 × (max - min). Good for near-peak, spiky workloads.

Tip: Start with provider: "manual" to get immediate cost visibility using duration-based estimates. Upgrade to "databricks_uc" when you need exact cost attribution from the Unity Catalog billing tables.

Cost data is recorded in the run log as estimated_cost, cost_currency, and cost_confidence columns. See the Observability docs for analytical queries and SaaS integration.

Platform-Portable Storage

Use YAML anchors to define storage patterns once and reference them across environments. This means you write your cloud connection details once and reuse them everywhere.

Example: Multi-cloud storage with YAML anchors

# ── Storage Anchors (define once) ────────────────────────────
storage:
  azure: &azure_storage
    storage_account: "youraccount"
    container: "lakehouse"
    storage_root: "abfss://lakehouse@youraccount.dfs.core.windows.net"
    data_root: "{storage_root}/{domain}"
    quarantine_root: "{storage_root}/_quarantine"
    log_path: "{data_root}/_run_logs"

  aws: &aws_storage
    bucket: "your-data-lake"
    storage_root: "s3://your-data-lake"
    data_root: "{storage_root}/{domain}"
    quarantine_root: "{storage_root}/_quarantine"
    log_path: "{data_root}/_run_logs"

  gcp: &gcp_storage
    bucket: "your-data-lake"
    storage_root: "gs://your-data-lake"
    data_root: "{storage_root}/{domain}"

  local: &local_storage
    storage_root: "./lakehouse"
    data_root: "{storage_root}/{domain}"
    quarantine_root: "{storage_root}/_quarantine"
    log_path: "{data_root}/_run_logs"

# ── Environments ─────────────────────────────────────────────
# Use ${ENV_VAR} syntax to keep secrets out of source control.
environments:
  dev:
    <<: *azure_storage
    catalog: "${DEV_CATALOG}"
    storage_account: "${DEV_STORAGE_ACCOUNT}"
  prod:
    <<: *azure_storage
    catalog: "${PROD_CATALOG}"
    storage_account: "${PROD_STORAGE_ACCOUNT}"
  aws:
    <<: *aws_storage
    catalog: "${AWS_GLUE_CATALOG}"
  local:
    <<: *local_storage
    catalog: "local"

Supported Platforms

Environment	Platform	URI Scheme
`dev/staging/prod`	Databricks UC (Azure)	`abfss://...dfs.core.windows.net`
`fabric`	Microsoft Fabric OneLake	`abfss://...onelake.dfs.fabric.microsoft.com`
`synapse`	Microsoft Synapse Spark	`abfss://...dfs.core.windows.net`
`aws`	Amazon EMR / S3	`s3://`
`gcp`	Google Cloud / GCS	`gs://`
`local`	Local filesystem	`./lakehouse`
`colab`	Google Colab	`/content/lake`

Placeholder Variables

Contracts use {placeholder} syntax that resolves from the system registry. This keeps your contracts portable — change the storage path in one place, all contracts update automatically.

Placeholder	Source	Example Value
`{domain}`	`domain:`	`marketing`
`{system}`	`system:`	`google_analytics`
`{bronze_layer}`	`bronze_layer:` (or inherited from domain)	`bronze`
`{silver_layer}`	`silver_layer:`	`silver`
`{gold_layer}`	`gold_layer:`	`gold`
`{domain_catalog}`	Environment-specific `catalog:`	`retail_marketing`
`{storage_root}`	Environment-specific	`abfss://...`
`{data_root}`	Computed	`{storage_root}/{domain}`
`{log_path}`	Computed	`{data_root}/_run_logs`

Usage in Contracts

Example: Placeholder usage in a contract

source:
  path: "{data_root}/{bronze_layer}_{system}_events"

materialization:
  target_path: "{data_root}/{silver_layer}_{system}_sessions"

metadata:
  run_log_table: "{log_path}"

Environment Variable Resolution

Any string value in the environments: block that uses the ${ENV_VAR} syntax is automatically resolved from your system's environment variables at load time. This keeps infrastructure names, storage accounts, and catalog identifiers out of source control.

Syntax

Wrap an environment variable name in ${...}:

environments:
  prod:
    catalog: "${MY_PROD_CATALOG}"           # resolves os.environ["MY_PROD_CATALOG"]
    storage_account: "${MY_PROD_STORAGE}"   # resolves os.environ["MY_PROD_STORAGE"]

The entire value must be a single ${...} expression

Partial interpolation like "prefix-${VAR}-suffix" is not supported. Use {placeholder} syntax (see above) for template composition within storage paths.

Which Fields Support It?

All string fields inside environments.<env_name> are resolved, including:

Field	Example
`catalog`	`"${RIDEFLOW_DEV_CATALOG}"`
`storage_account`	`"${RIDEFLOW_DEV_STORAGE_ACCOUNT}"`
`region`	`"${DEPLOY_REGION}"`
Any extra field	Custom fields added via `ConfigDict(extra="allow")`

Note: local and colab environments typically use hardcoded values since they don't contain real infrastructure secrets.

Where to Set the Variables

Runtime	How to Set
Databricks	Cluster environment variables or Databricks Secrets scope
Azure DevOps	Pipeline variables / variable groups
GitHub Actions	Repository secrets → `env:` block in workflow
Local development	`.env` file, shell exports, or IDE run config
Google Colab	`os.environ["KEY"] = "value"` in a setup cell

Example: Before and After

❌ Hardcoded (secrets in source control)✅ Environment Variables (secrets externalized)

environments:
  prod:
    catalog: "rideflow-lakehouse-prod-001"
    storage_account: "sarideflowprodadls001"

environments:
  prod:
    catalog: "${RIDEFLOW_PROD_CATALOG}"
    storage_account: "${RIDEFLOW_PROD_STORAGE_ACCOUNT}"

Missing Variables

If an environment variable is not set, the value resolves to an empty string (""). This will typically surface as a clear error when the pipeline tries to connect to storage — e.g., abfss://domain@.dfs.core.windows.net (missing account name).

Contract Listing

List which contracts belong to this system. Set active: false to disable a contract without deleting it.

Example: Contract registry

contracts:
  - path: bronze/events_v1.0.yaml
    entity: events
    layer: bronze
    active: true

  - path: silver/sessions_v1.0.yaml
    entity: sessions
    layer: silver
    active: true

  - path: gold/dim_users_v1.0.yaml
    entity: dim_users
    layer: gold
    active: false    # Disabled — won't run in pipeline

External Sources (Cross-Domain Lineage)

When your domain consumes tables from another domain's pipeline, declare them here so the DAG shows the full data lineage across teams.

Why this matters: Without this, your data lineage stops at your domain boundary. With it, you can trace data from its origin all the way through to your final dashboard — even across team boundaries.

Example: Cross-domain source declaration

external_sources:
  - name: "silver_crm_customers"
    catalog_path: "catalog.silver.crm_customers"
    source_domain: "sales/crm"
    consumed_by: ["gold_customer_360"]

External nodes appear in the DAG with dashed borders. LakeLogic does not orchestrate the external pipeline — this is metadata-only for lineage tracking.

System Configuration (_system.yaml)