System Configuration (_system.yaml)
The system registry defines storage, environments, and contract listings for a single source system within a domain.
Think of it like a building's utility plan. The domain config says "this is the Marketing building," and the system config says "here's where the electricity, water, and data pipes connect for the Google Analytics floor."
domains_retail/marketing/
└── google_analytics/
├── _system.yaml ← This file
├── bronze/
│ └── events_v1.0.yaml
└── silver/
└── sessions_v1.0.yaml
Core Structure
Every _system.yaml follows this pattern. Customise the values to match your environment:
Example: Minimal system registry
domain: marketing
system: google_analytics
# ── Metadata & Run Logs ──────────────────────────────────────
metadata:
run_log_table: "{log_path}"
run_log_backend: "delta"
# ── Server (per layer, contract overrides) ───────────────────
# Applied to all contracts in this system unless overridden.
server:
bronze:
mode: "ingest"
format: "delta"
schema_policy:
evolution: "append"
unknown_fields: "allow"
cast_to_string: true
silver:
mode: "validate"
format: "delta"
schema_policy:
evolution: "strict"
unknown_fields: "quarantine"
gold:
mode: "validate"
format: "delta"
schema_policy:
evolution: "strict"
unknown_fields: "quarantine"
# ── Materialization Defaults ─────────────────────────────────
materialization:
bronze:
strategy: append
format: delta
silver:
strategy: merge
format: delta
merge_dedup_guard: true
gold:
strategy: merge
format: delta
# ── Lineage ──────────────────────────────────────────────────
lineage:
enabled: true
source_column_name: "_lakelogic_source"
timestamp_column_name: "_lakelogic_processed_at"
# ── Quarantine ───────────────────────────────────────────────
quarantine:
enabled: true
fail_on_quarantine: false
target: "{quarantine_root}/{domain}/{system}"
# ── Extraction Defaults ─────────────────────────────────────
# System-level LLM extraction defaults (override domain defaults).
extraction_defaults:
provider: "azure_openai"
model: "gpt-4o"
temperature: 0.0
max_cost_per_run: 25.00
redact_pii_before_llm: true
# ── Notifications ───────────────────────────────────────────
# System-specific channels (concatenated with domain channels).
notifications_enabled: true # Global switch to disable all system, domain, and contract notifications
notifications:
- target: "https://hooks.slack.com/services/YOUR/SYSTEM-SPECIFIC/WEBHOOK"
on_events: ["failure", "quarantine"]
# ── Cost Observability ──────────────────────────────────
# Track estimated compute cost per pipeline run.
# Budget limits and currency are inherited from _domain.yaml
cost:
provider: "manual"
attribution: "duration_proportional"
currency: "USD"
rates:
dbu_per_hour: 0.22
storage_per_gb_month: 0.023
# Optional: account for autoscaling clusters
cluster:
min_nodes: 2
max_nodes: 8
scaling_assumption: "avg"
# ── Storage ──────────────────────────────────────────────────
storage:
# ── Table Resolution ──────────────────────────────────────
# UC mode (Databricks): tables resolve via domain_catalog
# Direct mode (DuckDB/Polars/Fabric/Synapse/EMR): tables resolve via external_location_root
domain_catalog: "`{catalog}`.{domain}"
external_location_root: "{data_root}"
# ── Operational Paths (UC mode — Databricks Volumes) ──────
# Only used when storage_mode="uc". Direct mode ignores them.
contract_root: "/Workspace/Shared/data_platform/domains/{domain}/{system}"
landing_root: "/Volumes/{catalog}/nondelta/landing_{domain}/{system}"
log_root: "/Volumes/{catalog}/nondelta/_logs"
# ── Storage Paths (direct mode — cloud/local) ─────────────
# Driven by {storage_root}, {data_root}, {quarantine_root}
# which are defined per-environment below.
landing_path: "{storage_root}/_data/{domain}/{system}"
contract_path: "{storage_root}/_contracts/{domain}/{system}"
log_path: "{storage_root}/_logs/{domain}"
quarantine_path: "{quarantine_root}/{domain}/{system}"
# ── Cloud Storage Anchors (DRY) ─────────────────────────────
x-azure-storage: &azure_storage
storage_root: "abfss://nondelta@{storage_account}.dfs.core.windows.net"
data_root: "abfss://{domain}@{storage_account}.dfs.core.windows.net"
quarantine_root: "abfss://quarantine@{storage_account}.dfs.core.windows.net"
# ── Environments ────────────────────────────────────────────
environments:
dev:
catalog: "lakelogic-lakehouse-dev-001"
storage_account: "salakelogicdevadls001"
<<: *azure_storage
local:
catalog: "local"
storage_root: "./lakehouse"
data_root: "./lakehouse/{domain}"
quarantine_root: "./lakehouse/_quarantine"
colab:
catalog: "colab"
storage_root: "/content/lake"
data_root: "/content/lake/{domain}"
quarantine_root: "/content/lake/_quarantine"
# ── Contracts ───────────────────────────────────────────────
contracts:
# ── Bronze Layer ────────────────────────────────────────────
- layer: bronze
entity: events
path: "contracts/{bronze_layer}/{bronze_layer}_{system}_events_v1.0.yaml"
enabled: true
- layer: bronze
entity: sessions
path: "contracts/{bronze_layer}/{bronze_layer}_{system}_sessions_v1.0.yaml"
enabled: true
# ── Silver Layer ────────────────────────────────────────────
- layer: silver
entity: sessions_cleaned
path: "contracts/{silver_layer}/{silver_layer}_{system}_sessions_v1.0.yaml"
enabled: true
- layer: silver
entity: events_cleaned
path: "contracts/{silver_layer}/{silver_layer}_{system}_events_v1.0.yaml"
depends_on: [sessions]
enabled: true
# ── Gold Layer ──────────────────────────────────────────────
- layer: gold
entity: fact_aggregate_channel_performance
path: "contracts/{gold_layer}/{gold_layer}_{system}_fact_aggregate_channel_performance_v1.0.yaml"
enabled: true
- layer: gold
entity: dim_events_scd2
path: "contracts/{gold_layer}/{gold_layer}_{system}_dim_events_scd2_v1.0.yaml"
depends_on: [fact_aggregate_channel_performance]
enabled: true
- layer: gold
entity: fact_accumulating_snapshot_session_funnel
path: "contracts/{gold_layer}/{gold_layer}_{system}_fact_accumulating_snapshot_session_funnel_v1.0.yaml"
depends_on: [dim_events_scd2]
enabled: true
- layer: gold
entity: fact_periodic_snapshot_user_daily
path: "contracts/{gold_layer}/{gold_layer}_{system}_fact_periodic_snapshot_user_daily_v1.0.yaml"
depends_on: [dim_events_scd2]
enabled: true
- layer: gold
entity: fact_factless_user_conversions
path: "contracts/{gold_layer}/{gold_layer}_{system}_fact_factless_user_conversions_v1.0.yaml"
depends_on: [dim_events_scd2]
enabled: false
Global Defaults vs. Local Overrides
LakeLogic uses a powerful inheritance model to keep your data contracts clean. It explicitly separates Global Defaults from Local Overrides:
1. server in _system.yaml (The Global Template)
This block lives in your _system.yaml registry. It acts as the blanket rule for the entire system layer. If you define evolution: strict here under bronze:, you are telling the engine: "Unless told otherwise, treat every single Bronze contract in this network as strictly locked down."
This saves you from copying and pasting the exact same server: boilerplate into 50 different contract YAML files!
2. server in a Contract (The Local Override)
This block lives inside a specific individual Data Contract file (e.g., bronze_google_analytics_events_v1.0.yaml). Any setting you define here overrides whatever was set in the global system defaults.
Example Use Case: Imagine you have 50 Bronze tables. You want 49 of them to be highly regulated and locked down, but one specific API is notoriously messy and you just want to let it drift safely.
You would set your _system.yaml to:
And then, strictly inside that one messy contract file, you would define:
LakeLogic automatically merges them at runtime: inheriting the broad infrastructure defaults from the system registry, while respecting the fine-grained custom behaviors of an individual contract.
Cost Observability
The cost: block enables automatic compute cost estimation for every pipeline run in this system. This system-level configuration handles how cost is measured, while the budget and authoritative reporting currency are defined centrally in _domain.yaml.
| Field | Required | Default | Description |
|---|---|---|---|
provider |
No | "none" |
"none" (disabled), "manual" (duration × rate), "databricks_uc" (billing API) |
attribution |
No | "duration_proportional" |
"duration_proportional", "row_proportional", or "direct" |
currency |
No | Inherited | Must match cost.currency in _domain.yaml. Mismatches log a warning. |
rates.dbu_per_hour |
No | 0.22 |
Databricks Jobs Compute DBU rate per hour |
rates.storage_per_gb_month |
No | 0.023 |
Delta storage cost per GB per month |
cluster.min_nodes |
No | 1 |
Minimum nodes in the cluster |
cluster.max_nodes |
No | min_nodes |
Maximum nodes in the cluster |
cluster.scaling_assumption |
No | "avg" |
How to estimate node count during run. Options: "avg", "peak", "min", "p75" |
Provider Options
"none": Cost tracking disabled."manual": Estimates cost using the formula:run_duration_seconds × dbu_per_hour × avg_nodes / 3600."databricks_uc": Queriessystem.billing.usagebyrun_idtag. Falls back to manual if no billing row is found.
Cluster Scaling Assumptions
When using autoscaling clusters (min_nodes < max_nodes), the manual provider uses a scaling assumption to blend the hourly rate.
"avg"– Uses(min + max) / 2. Most common default for varied workloads."peak"– Usesmax. Conservative estimation, assumes worst-case."min"– Usesmin. Optimistic estimation for steady-state workloads."p75"– Usesmin + 0.75 × (max - min). Good for near-peak, spiky workloads.
Tip: Start with
provider: "manual"to get immediate cost visibility using duration-based estimates. Upgrade to"databricks_uc"when you need exact cost attribution from the Unity Catalog billing tables.
Cost data is recorded in the run log as estimated_cost, cost_currency, and cost_confidence columns. See the Observability docs for analytical queries and SaaS integration.
Platform-Portable Storage
Use YAML anchors to define storage patterns once and reference them across environments. This means you write your cloud connection details once and reuse them everywhere.
Example: Multi-cloud storage with YAML anchors
# ── Storage Anchors (define once) ────────────────────────────
storage:
azure: &azure_storage
storage_account: "youraccount"
container: "lakehouse"
storage_root: "abfss://lakehouse@youraccount.dfs.core.windows.net"
data_root: "{storage_root}/{domain}"
quarantine_root: "{storage_root}/_quarantine"
log_path: "{data_root}/_run_logs"
aws: &aws_storage
bucket: "your-data-lake"
storage_root: "s3://your-data-lake"
data_root: "{storage_root}/{domain}"
quarantine_root: "{storage_root}/_quarantine"
log_path: "{data_root}/_run_logs"
gcp: &gcp_storage
bucket: "your-data-lake"
storage_root: "gs://your-data-lake"
data_root: "{storage_root}/{domain}"
local: &local_storage
storage_root: "./lakehouse"
data_root: "{storage_root}/{domain}"
quarantine_root: "{storage_root}/_quarantine"
log_path: "{data_root}/_run_logs"
# ── Environments ─────────────────────────────────────────────
environments:
dev:
<<: *azure_storage
catalog: "dev_catalog"
prod:
<<: *azure_storage
catalog: "prod_catalog"
aws:
<<: *aws_storage
catalog: "glue_catalog"
local:
<<: *local_storage
catalog: "local"
Supported Platforms
| Environment | Platform | URI Scheme |
|---|---|---|
dev/staging/prod |
Databricks UC (Azure) | abfss://...dfs.core.windows.net |
fabric |
Microsoft Fabric OneLake | abfss://...onelake.dfs.fabric.microsoft.com |
synapse |
Microsoft Synapse Spark | abfss://...dfs.core.windows.net |
aws |
Amazon EMR / S3 | s3:// |
gcp |
Google Cloud / GCS | gs:// |
local |
Local filesystem | ./lakehouse |
colab |
Google Colab | /content/lake |
Placeholder Variables
Contracts use {placeholder} syntax that resolves from the system registry. This keeps your contracts portable — change the storage path in one place, all contracts update automatically.
| Placeholder | Source | Example Value |
|---|---|---|
{domain} |
domain: |
marketing |
{system} |
system: |
google_analytics |
{bronze_layer} |
bronze_layer: (or inherited from domain) |
bronze |
{silver_layer} |
silver_layer: |
silver |
{gold_layer} |
gold_layer: |
gold |
{domain_catalog} |
Environment-specific catalog: |
retail_marketing |
{storage_root} |
Environment-specific | abfss://... |
{data_root} |
Computed | {storage_root}/{domain} |
{log_path} |
Computed | {data_root}/_run_logs |
Usage in Contracts
Example: Placeholder usage in a contract
Contract Listing
List which contracts belong to this system. Set active: false to disable a contract without deleting it.
Example: Contract registry
External Sources (Cross-Domain Lineage)
When your domain consumes tables from another domain's pipeline, declare them here so the DAG shows the full data lineage across teams.
Why this matters: Without this, your data lineage stops at your domain boundary. With it, you can trace data from its origin all the way through to your final dashboard — even across team boundaries.
Example: Cross-domain source declaration
External nodes appear in the DAG with dashed borders. LakeLogic does not orchestrate the external pipeline — this is metadata-only for lineage tracking.