Schema & Model
The model: block defines the expected schema — field names, types, constraints, PII classification, and masking strategies.
Field Definition
The fields: array is the core of your contract. It defines the exact shape of your data, including data types, nullability, descriptions, and data quality rules that the LakeLogic engine will enforce upstream.
model:
fields:
- name: "customer_id"
type: "long" # string | int | long | double | boolean | date | timestamp
required: true # Generates automatic not_null rule
pii: false
classification: "public" # public | internal | confidential | restricted
description: "Unique customer identifier"
# Generator hints (used by DataGenerator for synthetic data)
min: 1
max: 999999
accepted_values: ["premium", "standard", "basic"]
# Foreign key reference
foreign_key:
contract: "silver_agents"
column: "agent_id"
severity: "error"
# Field-level quality rules
rules:
- name: "customer_id_positive"
sql: "customer_id > 0"
category: "correctness"
severity: "error"
PII Masking
LakeLogic provides native, declarative PII masking. By tagging a field with pii: true and choosing a strategy, the engine will automatically mask or encrypt the data during processing—ensuring sensitive information is secured before it ever lands in your downstream analytical tables.
- name: "email"
type: "string"
pii: true
classification: "confidential"
security_groups: ["pii-readers", "compliance-team"]
masking: "partial"
Masking Strategies
| Strategy | Output | Joinable | Reversible | GDPR |
|---|---|---|---|---|
nullify |
NULL |
No | No | ✓ |
hash |
a3f8b2c1d4... (SHA-256) |
Yes | No | ✓ |
redact |
***REDACTED*** |
No | No | ✓ |
partial |
j***@company.com |
No | No | ✗ |
encrypt |
enc:gAAAAABh... (AES-256) |
Yes | Yes | ✓ |
Custom Partial Masking Format
masking: "partial"
masking_format: "{first1}***@{domain}"
# Tokens: {first1}-{first9}, {last1}-{last9}, {domain}
# "j***@company.com", "***-***-1234", "SW** ***"
Key Injection
masking: "encrypt"
# Keys are injected via the orchestrator using
# the LAKELOGIC_PII_KEY environment variable.
Decryption Workflows
Because encrypt uses symmetric encryption (Fernet / AES), the same key used for masking is required to decrypt. LakeLogic encrypts PII at rest during the pipeline run, so your Delta/Parquet files on storage are always protected — regardless of engine or platform.
Contract Example
model:
fields:
- name: "user_id"
type: "string"
pii: true
masking: "encrypt"
security_groups: ["pii-readers", "compliance-team"]
# Users NOT in these groups see: enc:gAAAAABh...
# Users IN these groups can decrypt with the key
- name: "email"
type: "string"
pii: true
masking: "partial"
masking_format: "{first1}***@{domain}"
security_groups: ["pii-readers"]
# Users NOT in these groups see: j***@company.com
- name: "session_id"
type: "long"
pii: false
# No masking — visible to everyone
1. Encryption at Rest + Programmatic Decryption (Recommended)
LakeLogic physically writes encrypted values (enc:gAAAAABh...) into your Bronze/Silver tables. The raw files on ADLS, S3, or local disk are protected even if storage credentials leak.
Encrypted data at rest (what everyone sees by default):
| session_id | user_id | |
|---|---|---|
| 1001 | enc:gAAAAABhX9...kQ2dE= |
j***@company.com |
| 1002 | enc:gAAAAABhY1...pR7mA= |
a***@gmail.com |
| 1003 | enc:gAAAAABhZ3...nT4bC= |
m***@corp.co.uk |
Decrypted query (authorized users with the key):
import os
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
# 1. Prepare the same padded key used during pipeline masking
raw_key = os.environ["LAKELOGIC_PII_KEY"]
padded_key = (raw_key * ((16 // len(raw_key)) + 1))[:16]
# 2. Read the encrypted table
df = spark.read.format("delta").load("lakehouse/marketing/bronze_google_analytics_events")
# 3. Decrypt the user_id column
df = df.withColumn(
"user_id",
F.when(
F.col("user_id").startswith("enc:"),
F.expr(f"CAST(aes_decrypt(unbase64(substring(user_id, 5)), '{padded_key}') AS STRING)")
).otherwise(F.col("user_id"))
)
import os, base64, hashlib, polars as pl
from cryptography.fernet import Fernet
# 1. Build the Fernet cipher from the same pipeline key
raw_key = os.environ["LAKELOGIC_PII_KEY"].encode("utf-8")
fernet_key = base64.urlsafe_b64encode(hashlib.sha256(raw_key).digest())
fern = Fernet(fernet_key)
# 2. Read the encrypted table
df = pl.read_delta("lakehouse/marketing/bronze_google_analytics_events")
# 3. Decrypt the user_id column
df = df.with_columns(
pl.col("user_id")
.map_elements(
lambda v: fern.decrypt(v[4:].encode("ascii")).decode("utf-8")
if v and v.startswith("enc:") else v,
return_dtype=pl.Utf8,
)
.alias("user_id")
)
Decrypted output:
| session_id | user_id | |
|---|---|---|
| 1001 | USR-44821 |
j***@company.com |
| 1002 | USR-10392 |
a***@gmail.com |
| 1003 | USR-77564 |
m***@corp.co.uk |
Works with Polars, Spark, or DuckDB — zero vendor lock-in. Delete the key to achieve crypto-shredding (GDPR Art. 17 Right to Erasure).
2. Dynamic Column Masking (Databricks Unity Catalog)
If you are on Databricks, you can optionally layer Unity Catalog column masks on top. Use MaskingEngine.generate_uc_masks() to generate CREATE FUNCTION + ALTER TABLE SET MASK DDL tied to your contract's security_groups. Unauthorized users see encrypted values; authorized users see plaintext — transparently, with no application code required.
Pipeline Masking vs Query-Time Access Control
LakeLogic always applies maximum PII protection by default. When the pipeline runs, every field with pii: true is masked before materialization — regardless of who triggered the pipeline. You will see this confirmed in the logs:
user_groups=(none) means no special access groups were provided, so all PII fields are fully masked. This is the correct default behaviour.
The user_groups Escape Hatch
LakeLogic supports an optional user_groups parameter that can selectively skip masking for specific fields. This is designed for teams that do not have enterprise-grade column-level security (e.g., Unity Catalog, Ranger, or Lake Formation) and need to produce separate materialized outputs for different access tiers.
- name: user_id
type: string
pii: true
masking: "encrypt"
security_groups: ["fraud-team", "compliance-team"]
If you pass user_groups=["fraud-team"] into the DataProcessor, the engine skips masking for fields where the caller's group is listed in security_groups.
When to Use Each Approach
| Approach | Use When | Trade-Off |
|---|---|---|
| Pipeline masking (default) | Always | Data is protected at rest — even if storage credentials leak |
| Query-time RBAC (Unity Catalog, Ranger) | You have enterprise governance tooling | Unmasking happens transparently at read time; raw PII never sits unprotected on disk |
user_groups at pipeline time |
No enterprise RBAC available; need separate masked/unmasked outputs | Unmasked PII is permanently written to the target table — secure the storage path with tight ACLs |
[!IMPORTANT] Recommended pattern: Always mask at rest during the pipeline (the default). Enforce selective unmasking at query time using your platform's column-level security (Unity Catalog, Ranger, Lake Formation). Reserve
user_groupsfor environments where query-time RBAC is not available and you need to produce a separate, tightly ACL'd materialization for a specific consumer team.
Kimball Modeling Flags
When building analytical data models, specifically Accumulating Snapshot Fact Tables, you often track an entity (like an Order) as it moves through various stages (e.g., Placed -> Shipped -> Delivered).
LakeLogic provides native contract flags to automate this lifecycle tracking so you don't have to write complex SQL merges:
milestone: true: Tells the system this column represents a stage in a process. The pipeline will automatically preserve the timestamp of this milestone once it is set, preventing late-arriving source updates from accidentally over-writing the original event time.generated: true: Tells the system to completely ignore this column if it arrives from the upstream source system. Instead, LakeLogic will automatically generate the timestamp internally when the milestone condition is met.
- name: "shipped_date"
type: "timestamp"
nullable: true # Milestones start as NULL until the event occurs
milestone: true # Locks the timestamp once the order ships
generated: true # LakeLogic generates this, ignoring the source API
Example: Order Lifecycle Tracking
Imagine you are tracking orders. The system receives repeated updates for order_1 over several days as its status changes from "placed" to "shipped" to "delivered".
Day 1: Order is placed (Source System)
The upstream API sends a new order. The shipped_date doesn't exist yet, so it is NULL in our Lakehouse.
| order_id | status | shipped_date (Lakehouse) |
|---|---|---|
| order_1 | placed |
NULL |
Day 2: Order ships (Source System)
The API sends an update for order_1 with status: "shipped". Because shipped_date has generated: true, LakeLogic ignores any date the API might have sent and automatically stamps the current extraction/processing time into the Silver table.
| order_id | status | shipped_date (Lakehouse) |
|---|---|---|
| order_1 | shipped |
2026-04-12 09:14:00 |
Day 3: Order is delivered (Source System)
The API sends a final update for order_1 with status: "delivered". Because shipped_date has milestone: true, LakeLogic protects the historical timestamp. It prevents the record's shipped_date from being overwritten by NULL or updated to a newer date, permanently locking the milestone in the accumulating snapshot fact table!
| order_id | status | shipped_date (Lakehouse) |
|---|---|---|
| order_1 | delivered |
2026-04-12 09:14:00 (Preserved!) |
Lineage Columns
LakeLogic automatically injects audit and lineage metadata columns into every record as it flows through the medallion architecture. You can customize the column names and toggle specific tracker fields in your contract's lineage: block.
lineage:
enabled: true
capture_source_path: true
source_column_name: "_lakelogic_source"
capture_timestamp: true
timestamp_column_name: "_lakelogic_processed_at"
capture_run_id: true
run_id_column_name: "_lakelogic_run_id"
capture_contract_name: true
contract_name_column_name: "_lakelogic_contract_name"
capture_domain: true
domain_column_name: "_lakelogic_domain"
capture_system: true
system_column_name: "_lakelogic_system"
capture_created_at: true
created_at_column_name: "_lakelogic_created_at"
capture_created_by: true
created_by_column_name: "_lakelogic_created_by"
created_by_override: "etl_pipeline_svc"
# Upstream lineage
preserve_upstream: ["_upstream_run_id", "_upstream_source"]
upstream_prefix: "_upstream"
run_id_source: "run_id" # run_id | pipeline_run_id
Schema Policy
The schema policy block acts as a gatekeeper, controlling how the pipeline should react when the upstream data source unexpectedly changes its schema (e.g., adding undocumented columns or changing data types).
schema_policy:
evolution: "allow" # strict | compatible | allow (default: allow)
unknown_fields: "allow" # quarantine | drop | allow (default: allow)
Default: allow for frictionless prototyping
Both evolution and unknown_fields default to "allow". This means contracts without an explicit schema_policy block will accept new columns and schema changes automatically — ideal for rapid development and Bronze-layer ingestion. When you're ready to harden a production pipeline, explicitly set evolution: "strict" and/or unknown_fields: "quarantine" in your contract or _system.yaml.
Evolution (evolution)
Controls how the engine reacts when the source data's schema changes (e.g., fields are added or data types change).
strict- Any changes to the schema cause the pipeline to fail with a clear error.
- Use Case: Highly regulated environments (Finance, Healthcare) or Bronze-to-Silver pipelines where you want explicit human approval before any contract evolution is allowed.
compatible- New fields are allowed, and "safe" type promotions (e.g.,
INTtoLONG) are permitted, but dangerous type changes (e.g.,STRINGtoINT) are routed to quarantine (and will only fail the pipeline iffail_on_quarantine: trueis set). - Use Case: Agile environments where source teams frequently add new metrics or dimensions and you don't want the pipeline failing constantly, but you still need downstream queries to remain stable.
allow(Default)- All schema changes, including type overrides and missing required fields, are allowed through.
- Use Case: Ingesting raw JSON into the Landing/Bronze layer where you simply want 100% of the data persisted, regardless of structural changes upstream. Great for prototyping and rapid iteration.
Unknown Fields (unknown_fields)
Controls what happens when a field arrives in the source data that is not explicitly defined in the contract's model.fields block.
quarantine- The row is routed to the quarantine table so analytical tables stay clean, but no data is lost.
- Use Case: Ensuring zero data loss while strictly maintaining the integrity of the Silver/Gold tables. You want to see the unexpected fields before deciding whether to modify the contract to include them.
drop- The row proceeds through the pipeline, but the undefined field is silently stripped out.
- Use Case: Ingesting wide tables (e.g., a massive 500-column Salesforce export) where you only care about 15 specific fields. Dropping the rest saves compute and storage.
allow(Default)- The row proceeds through the pipeline and the undefined field is actively preserved. It relies entirely on your physical
schema_evolution(alloworcompatible) permitting structural mutations of the target table, otherwise the final Delta write will fail. - Use Case: Highly resilient continuous-ingestion pipelines (like Bronze APIs) where your goal is zero data loss, preserving all unknown metadata columns for future analytical evaluation without halting the pipeline. Also the recommended default for prototyping.
Server Schema Directives
While schema_policy governs the logical validation layer natively, the target storage engine and input parsing behaviors are governed by server-level directives. These map directly to physical database interactions and can be configured holistically via server in _system.yaml or overridden per-contract in the server: block.
# Inside _system.yaml (Global Defaults — per-layer)
server:
bronze:
mode: "ingest"
schema_policy:
evolution: "append" # accept new fields automatically
unknown_fields: "allow" # preserve undocumented columns
cast_to_string: true # "all strings" bronze pattern
silver:
mode: "validate"
schema_policy:
evolution: "strict" # curated — no surprises
unknown_fields: "quarantine" # route unknown columns to quarantine
gold:
mode: "validate"
schema_policy:
evolution: "strict" # business layer — tightly controlled
unknown_fields: "quarantine"
# ─────────────────────────────────────────────────────────────
# Inside bronze_messy_api_v1.0.yaml (Local Override)
server:
type: local
path: "."
schema_policy:
evolution: "allow" # This contract overrides the system default
unknown_fields: "allow"
Deprecated Keys
The following root-level server keys are deprecated and will emit warnings during validation:
schema_evolution→ useschema_policy.evolutionallow_schema_drift→ useschema_policy.unknown_fields
See Schema Validation API for migration details.
Cast to String (cast_to_string)
A boolean flag that instructs the engine to forcibly coerce all incoming data fields into strings, entirely bypassing any type definitions or inferences made by the source format.
true- Every parsed field is immediately safely cast to
STRING/VARCHARduring the initial read phase. - Use Case: The recommended pattern for raw "Bronze" ingestion. By forcing everything to strings, you eliminate data type mismatch errors entirely (e.g., an upstream developer unexpectedly changing an
integerstatus ID to an alphanumericstringcode). You safely land 100% of the raw data, preserving its exact state, and perform structured type-casting and quality gating later in the pipeline when moving from Bronze to Silver. false(Default)- Fields maintain their native or inferred strong types during data parsing and materialization.