Skip to content

Quality Rules

Quality rules are your data's safety net — and they're SQL-powered. Every rule, from simple null checks to complex business logic, is expressed as SQL under the hood. The YAML shorthand (not_null:, range:, accepted_values:) compiles to SQL, and you can always write custom SQL rules directly.

Bottom line: If you can write a SQL WHERE clause, you can write a quality rule. No Python, no custom code — just SQL.

Quality validates data at two levels:

  • Row-level rules — inspect each row individually and quarantine bad ones
  • Dataset-level rules — check aggregate properties of the entire dataset

Think of it like airport security. Row rules are like the X-ray scanner — each bag (row) goes through individually, and anything suspicious gets pulled aside (quarantined). Dataset rules are like checking the total passenger count against the manifest — if the numbers don't add up, something is wrong.


Execution Timing

This is the actual sequence the engine follows for quality checks:

Step Stage What Happens
1 Pre-transforms complete Raw data has been cleaned/renamed
2 Pre quality rules (default) Validate source columns
3 Good/bad split Quarantine rows failing error-severity rules
4 Post-transforms complete Derived columns now exist
5 Post quality rules (phase: post) Validate derived columns
  • Pre-phase rules can only reference source columns
  • Post-phase rules can reference source and derived columns
  • Errors tagged [pre] or [post] in _lakelogic_errors for traceability

Configuration

quality:
  enforce_required: true                 # Auto-generate not_null rules for required fields
  fail_pipeline_on_dataset_error: false  # If true, aborts the entire pipeline if a dataset rule fails

Row-Level Rules

Not Null

Example: Require customer_id to be present

row_rules:
  # Simple
  - not_null: "email"

  # With config
  - not_null:
      field: "customer_id"
      name: "customer_id_required"
      category: "completeness"
      description: "Customer ID must be present"
      severity: "error"          # error (quarantine) | warning (log) | info

  # Multiple fields
  - not_null:
      fields: ["email", "status", "created_at"]
      category: "completeness"

Accepted Values

Example: Restrict status to valid options

  - accepted_values:
      field: "status"
      values: ["ACTIVE", "INACTIVE", "PENDING", "SUSPENDED"]
      category: "consistency"

Regex Pattern

Example: Validate email format

  - regex_match:
      field: "email"
      pattern: "^[^@]+@[^@]+\\.[^@]+$"
      category: "correctness"

Numeric Range

Example: Validate age is realistic

  - range:
      field: "age"
      min: 18
      max: 120
      inclusive: true
      category: "correctness"

Referential Integrity

Example: Ensure country codes exist in reference data

  - referential_integrity:
      field: "country_code"
      reference: "dim_countries"
      key: "code"
      category: "consistency"

Custom SQL Rule

Example: Block disposable email addresses

  - name: "email_domain_valid"
    sql: "email NOT LIKE '%@temp-mail.%' AND email NOT LIKE '%@disposable.%'"
    category: "validity"
    severity: "error"

  # Post-phase rule (runs AFTER transforms)
  - name: "derived_date_not_null"
    sql: "event_date_parsed IS NOT NULL"
    phase: "post"
    category: "correctness"

Dataset-Level Rules

These check aggregate properties of the entire good dataset — not individual rows.

Uniqueness

Example: Ensure no duplicate customer IDs

dataset_rules:
  - unique: "customer_id"

  - unique:
      field: "email"
      name: "email_unique"
      category: "uniqueness"
      severity: "error"

Null Ratio

Example: Phone numbers can be null, but not more than 20%

  - null_ratio:
      field: "phone"
      max: 0.20                  # Max 20% null values
      category: "completeness"

Row Count

Example: Ensure dataset has between 1K and 10M rows

  - row_count_between:
      min: 1000
      max: 10000000
      category: "completeness"

Custom SQL

Example: At least 60% of customers must be active

  - name: "active_customer_ratio"
    sql: "SELECT SUM(CASE WHEN status = 'ACTIVE' THEN 1 ELSE 0 END) / COUNT(*) FROM source"
    must_be_greater_than: 0.60
    category: "validity"

Severity Levels

Severity Behaviour Business Impact
error Quarantine the row (default) Bad data never reaches downstream consumers
warning Log the issue, keep the row Data flows through, but the team is alerted
info Log only, no action Informational tracking for observability

Quarantine Configuration

When rows fail quality rules, they're sent to quarantine — a separate table where you can inspect and fix problems without blocking the pipeline.

Example: Quarantine setup

quarantine:
  enabled: true
  target: "s3://quarantine-bucket/customers"
  include_error_reason: true     # Include _lakelogic_errors column
  format: "parquet"              # parquet | csv | delta | json
  write_mode: "append"           # append | overwrite
  strict_notifications: true     # Fail pipeline if notification fails

See Notifications for alert configuration.