Quality Rules
Quality rules are your data's safety net — and they're SQL-powered. Every rule, from simple null checks to complex business logic, is expressed as SQL under the hood. The YAML shorthand (not_null:, range:, accepted_values:) compiles to SQL, and you can always write custom SQL rules directly.
Bottom line: If you can write a SQL
WHEREclause, you can write a quality rule. No Python, no custom code — just SQL.
Quality validates data at two levels:
- Row-level rules — inspect each row individually and quarantine bad ones
- Dataset-level rules — check aggregate properties of the entire dataset
Think of it like airport security. Row rules are like the X-ray scanner — each bag (row) goes through individually, and anything suspicious gets pulled aside (quarantined). Dataset rules are like checking the total passenger count against the manifest — if the numbers don't add up, something is wrong.
Execution Timing
This is the actual sequence the engine follows for quality checks:
| Step | Stage | What Happens |
|---|---|---|
| 1 | Pre-transforms complete | Raw data has been cleaned/renamed |
| 2 | Pre quality rules (default) | Validate source columns |
| 3 | Good/bad split | Quarantine rows failing error-severity rules |
| 4 | Post-transforms complete | Derived columns now exist |
| 5 | Post quality rules (phase: post) |
Validate derived columns |
- Pre-phase rules can only reference source columns
- Post-phase rules can reference source and derived columns
- Errors tagged
[pre]or[post]in_lakelogic_errorsfor traceability
Configuration
quality:
enforce_required: true # Auto-generate not_null rules for required fields
fail_pipeline_on_dataset_error: false # If true, aborts the entire pipeline if a dataset rule fails
Row-Level Rules
Not Null
Example: Require customer_id to be present
row_rules:
# Simple
- not_null: "email"
# With config
- not_null:
field: "customer_id"
name: "customer_id_required"
category: "completeness"
description: "Customer ID must be present"
severity: "error" # error (quarantine) | warning (log) | info
# Multiple fields
- not_null:
fields: ["email", "status", "created_at"]
category: "completeness"
Accepted Values
Example: Restrict status to valid options
Regex Pattern
Example: Validate email format
Numeric Range
Example: Validate age is realistic
Referential Integrity
Example: Ensure country codes exist in reference data
Custom SQL Rule
Example: Block disposable email addresses
Dataset-Level Rules
These check aggregate properties of the entire good dataset — not individual rows.
Uniqueness
Example: Ensure no duplicate customer IDs
Null Ratio
Example: Phone numbers can be null, but not more than 20%
Row Count
Example: Ensure dataset has between 1K and 10M rows
Custom SQL
Example: At least 60% of customers must be active
Severity Levels
| Severity | Behaviour | Business Impact |
|---|---|---|
error |
Quarantine the row (default) | Bad data never reaches downstream consumers |
warning |
Log the issue, keep the row | Data flows through, but the team is alerted |
info |
Log only, no action | Informational tracking for observability |
Quarantine Configuration
When rows fail quality rules, they're sent to quarantine — a separate table where you can inspect and fix problems without blocking the pipeline.
Example: Quarantine setup
See Notifications for alert configuration.