Skip to content

Reprocessing & Partitioning

Note: The OSS release supports local materialization, Spark table writes, and table-only warehouse adapters. File staging into warehouses is not included.

In a professional Data Lakehouse, you don't just "upload" data. You need a way to handle Late Arriving Data and Reprocessing without creating a mess.

LakeLogic makes your pipelines Idempotent (meaning you can run them multiple times safely).

1. Partitioning

Partitioning is like putting your data into folders by date. This makes it much faster to read and easier to manage.

materialization:
  strategy: append
  partition_by: ["event_date"]
  reprocess_policy: overwrite_partition
  target_path: output/silver_pos_sales_events
  format: csv  # use delta/iceberg with Spark

When LakeLogic runs with this config, it ensures data is written to the correct "folder" (event_date=2024-01-01).

2. Handling Late Arriving Data

What if a sale from yesterday arrives today?

If you use strategy: merge, LakeLogic doesn't care when the data arrives. It will: 1. Look for the primary_key (e.g., order_id). 2. If it exists: Update the old record. 3. If it's new: Insert the new record.

This ensures your reports are always accurate, even if the internet was slow yesterday.

3. Reprocessing (The "Delete & Re-run" Pattern)

Sometimes you find a bug in your logic and need to fix data from the last 30 days.

sequenceDiagram
    participant D as Developer
    participant L as LakeLogic
    participant S as Silver Layer

    D->>L: Update Logic (Add new column)
    D->>L: Run for last 30 days
    L->>S: overwrite_partition (Clear old data)
    L->>S: Insert Fixed Data

By using reprocess_policy: overwrite_partition, LakeLogic handles the "Clean up" step for you. It deletes the data for the specific day you are running and replaces it with the new, fixed data.

If you want an extra safety guard, use reprocess_policy: overwrite_partition_safe. This writes the new partition to a temporary folder first and only swaps it into place after the write succeeds.

Key Benefits:

  • No Duplicates: Running the same job twice won't double your sales numbers.
  • Safety: LakeLogic won't "Overwrite" your whole table by accident; it only touches the partitions it needs.
  • Speed: By using partition_by, the engine only reads the data it needs to work on.

4. Quarantine Reprocessing Strategy

Quarantine is best treated as an audit trail. When corrected data arrives, re-run the contract against the corrected batch and write the fixed records into the target table using a safe reprocessing policy.

A) Partition overwrite (batch corrections)

materialization:
  strategy: append
  partition_by: ["event_date"]
  reprocess_policy: overwrite_partition_safe
  target_path: s3://lake/silver/pos_sales_events
  format: parquet

Re-running the corrected batch will replace only the affected partition.

B) Merge by primary key (dimension corrections)

primary_key: ["customer_id"]
materialization:
  strategy: merge
  target_path: s3://lake/silver/crm_customers
  format: parquet

This updates existing rows and inserts new ones without duplicating data.

Quarantine as a Table (Lakehouse)

You can store quarantine rows in a table and keep it as an immutable audit log.

quarantine:
  target: "table:main.silver.quarantine_pos_sales_events"
metadata:
  quarantine_table_backend: spark
  quarantine_table_format: iceberg

If you want to track which rows were reprocessed, enable lineage and update your quarantine table using the run id.

lineage:
  enabled: true

Then mark the rows in a follow-up SQL step (example for Spark):

UPDATE main.silver.quarantine_pos_sales_events
SET quarantine_reprocessed = true
WHERE _lakelogic_run_id = '<run_id_from_reprocess>';

5. Date-Range Reprocessing (Backfill)

Sometimes you need to reprocess a specific date window โ€” for example, after fixing a bug in a transformation that affected 2 weeks of data.

Contract Config

Tell LakeLogic which column to filter on:

materialization:
  strategy: append
  partition_by: ["event_date"]
  reprocess_policy: overwrite_partition
  reprocess_date_column: event_date   # optional โ€” defaults to first partition_by
  target_path: output/silver_pos_sales_events
  format: parquet

If reprocess_date_column is not set, LakeLogic automatically uses the first partition_by column.

Python API

from lakelogic import DataProcessor

processor = DataProcessor(
    contract="contracts/silver_pos_sales_events.yaml",
    engine="polars",
)

# Reprocess a specific date range
result = processor.run_source(
    reprocess_from="2026-01-15",
    reprocess_to="2026-02-01",
)

# What happens under the hood:
# 1. Loads ALL source data (incremental watermark is bypassed)
# 2. Filters to rows where event_date is between 2026-01-15 and 2026-02-01
# 3. Validates against the contract
# 4. Materializes with overwrite_partition for affected dates only

Databricks Pipeline Driver

The reference architecture pipeline driver exposes reprocessing as job widgets:

reprocess_from:  2026-01-15
reprocess_to:    2026-02-01

When these are left blank, the pipeline runs in normal incremental mode.

Key Behaviours

Scenario What happens
reprocess_from only Filters rows >= that date (open-ended)
reprocess_to only Filters rows <= that date (open-ended)
Both set Filters rows within the closed range
Neither set Normal run (incremental or full)
partition_by configured Only affected partitions are overwritten
strategy: merge Upserts by primary key within the date range