Soft Delete Pattern¶
Business Scenario¶
In regulated environments (GDPR, HIPAA, SOX), hard-deleting rows removes audit history and makes recovery impossible. You need a consistent strategy that flags records as deleted, stamps when and why, and keeps every row available for compliance queries — without writing custom merge logic in every pipeline.
Value Proposition¶
- Preserve full audit trails — deleted rows stay in the table, just flagged
- Capture deletion timestamp and reason automatically
- Support easy recovery: flip
_lakelogic_is_deletedback tofalse - Standardize delete metadata across all datasets via a single contract setting
Goals¶
- Load an initial batch of active users
- Process a CDC delete signal — watch LakeLogic flag the row instead of removing it
- Filter active vs. soft-deleted rows
Setup¶
In [ ]:
Copied!
import importlib.util
import os
import sys
import shutil
import polars as pl
from pathlib import Path
if importlib.util.find_spec("lakelogic") is None:
import subprocess
subprocess.run([sys.executable, "-m", "pip", "install", "lakelogic", "-q"], check=True)
print("lakelogic installed.")
else:
print("lakelogic ready.")
if "google.colab" in sys.modules:
repo = Path("/content/LakeLogic")
if not repo.exists():
import subprocess
subprocess.run(
[
"git",
"clone",
"--quiet",
"https://github.com/lakelogic/LakeLogic.git",
str(repo),
],
check=True,
)
os.chdir(repo / "examples" / "02_core_patterns" / "soft_delete")
def get_path(*parts):
base = Path.cwd()
for candidate in [base] + list(base.parents):
target = candidate / "soft_delete" if candidate.name != "soft_delete" else candidate
if (target / "contract.yaml").exists():
return (target / Path(*parts)).resolve()
return (base / Path(*parts)).resolve()
# Clear previous output for a clean run
output_dir = get_path("data", "users_silver")
if output_dir.exists():
shutil.rmtree(output_dir)
from lakelogic import DataProcessor
print("Setup complete.")
import importlib.util
import os
import sys
import shutil
import polars as pl
from pathlib import Path
if importlib.util.find_spec("lakelogic") is None:
import subprocess
subprocess.run([sys.executable, "-m", "pip", "install", "lakelogic", "-q"], check=True)
print("lakelogic installed.")
else:
print("lakelogic ready.")
if "google.colab" in sys.modules:
repo = Path("/content/LakeLogic")
if not repo.exists():
import subprocess
subprocess.run(
[
"git",
"clone",
"--quiet",
"https://github.com/lakelogic/LakeLogic.git",
str(repo),
],
check=True,
)
os.chdir(repo / "examples" / "02_core_patterns" / "soft_delete")
def get_path(*parts):
base = Path.cwd()
for candidate in [base] + list(base.parents):
target = candidate / "soft_delete" if candidate.name != "soft_delete" else candidate
if (target / "contract.yaml").exists():
return (target / Path(*parts)).resolve()
return (base / Path(*parts)).resolve()
# Clear previous output for a clean run
output_dir = get_path("data", "users_silver")
if output_dir.exists():
shutil.rmtree(output_dir)
from lakelogic import DataProcessor
print("Setup complete.")
How It Works¶
LakeLogic's merge strategy with soft-delete columns intercepts CDC delete signals
and, instead of removing the row, stamps it with configurable metadata columns.
Data flow¶
Batch 1 (Alice I, Bob I)
→ merge into silver table
→ _lakelogic_is_deleted = false for both
Batch 2 (Bob D ← CDC delete signal)
→ cdc_op_field detects op = "D"
→ instead of removing Bob:
_lakelogic_is_deleted = true
_lakelogic_deleted_at = <current UTC timestamp>
_lakelogic_delete_reason = "cdc_delete_signal"
→ Alice untouched
Key contract settings¶
| Setting | Value | What it does |
|---|---|---|
source.cdc_op_field |
op |
Column that signals the operation type |
source.cdc_delete_values |
["D"] |
Values in op that mean delete |
materialization.strategy |
merge |
Upserts by primary key; applies soft-delete logic |
soft_delete_column |
_lakelogic_is_deleted |
Boolean flag set to true when deleted |
soft_delete_time_column |
_lakelogic_deleted_at |
Auto-stamped with current UTC timestamp |
soft_delete_reason_column |
_lakelogic_delete_reason |
Reason string — defaults to cdc_delete_signal |
Known data¶
| user_id | Name | Batch | op |
Expected outcome |
|---|---|---|---|---|
| 1 | Alice | 1 | I |
Active — is_deleted = false |
| 2 | Bob | 1 | I |
Active — is_deleted = false |
| 2 | Bob | 2 | D |
Soft-deleted — is_deleted = true, timestamp set |
1. Initial Load — Insert Active Users¶
Two users arrive with op = "I". Both are merged as active records.
In [ ]:
Copied!
contract_path = get_path("contract.yaml")
output_path = get_path("data", "users_silver", "data.parquet")
processor = DataProcessor(contract=contract_path)
processor.run(
[
{"user_id": 1, "name": "Alice", "op": "I"},
{"user_id": 2, "name": "Bob", "op": "I"},
],
source_path="initial_load",
materialize=True,
)
print("SILVER TABLE — BATCH 1 (initial load):")
display(
pl.read_parquet(output_path).select(
[
"user_id",
"name",
"_lakelogic_is_deleted",
"_lakelogic_deleted_at",
"_lakelogic_delete_reason",
]
)
)
contract_path = get_path("contract.yaml")
output_path = get_path("data", "users_silver", "data.parquet")
processor = DataProcessor(contract=contract_path)
processor.run(
[
{"user_id": 1, "name": "Alice", "op": "I"},
{"user_id": 2, "name": "Bob", "op": "I"},
],
source_path="initial_load",
materialize=True,
)
print("SILVER TABLE — BATCH 1 (initial load):")
display(
pl.read_parquet(output_path).select(
[
"user_id",
"name",
"_lakelogic_is_deleted",
"_lakelogic_deleted_at",
"_lakelogic_delete_reason",
]
)
)
2. Soft Delete — Process a CDC Delete Signal¶
Bob arrives with op = "D". LakeLogic stamps the row in-place; no hard delete.
In [ ]:
Copied!
processor.run(
[{"user_id": 2, "name": "Bob", "op": "D"}],
source_path="cdc_batch_2",
materialize=True,
)
print("SILVER TABLE — BATCH 2 (soft delete applied):")
display(
pl.read_parquet(output_path).select(
[
"user_id",
"name",
"_lakelogic_is_deleted",
"_lakelogic_deleted_at",
"_lakelogic_delete_reason",
]
)
)
processor.run(
[{"user_id": 2, "name": "Bob", "op": "D"}],
source_path="cdc_batch_2",
materialize=True,
)
print("SILVER TABLE — BATCH 2 (soft delete applied):")
display(
pl.read_parquet(output_path).select(
[
"user_id",
"name",
"_lakelogic_is_deleted",
"_lakelogic_deleted_at",
"_lakelogic_delete_reason",
]
)
)
3. Filter — Active vs. Deleted¶
Both rows are preserved. Use _lakelogic_is_deleted to split them for downstream views.
In [ ]:
Copied!
full_table = pl.read_parquet(output_path)
print("ACTIVE (_lakelogic_is_deleted = false):")
display(full_table.filter(pl.col("_lakelogic_is_deleted") == False).select(["user_id", "name"]))
print("SOFT-DELETED (_lakelogic_is_deleted = true):")
display(
full_table.filter(pl.col("_lakelogic_is_deleted") == True).select(
["user_id", "name", "_lakelogic_deleted_at", "_lakelogic_delete_reason"]
)
)
full_table = pl.read_parquet(output_path)
print("ACTIVE (_lakelogic_is_deleted = false):")
display(full_table.filter(pl.col("_lakelogic_is_deleted") == False).select(["user_id", "name"]))
print("SOFT-DELETED (_lakelogic_is_deleted = true):")
display(
full_table.filter(pl.col("_lakelogic_is_deleted") == True).select(
["user_id", "name", "_lakelogic_deleted_at", "_lakelogic_delete_reason"]
)
)
Summary¶
| Metric | Batch 1 | Batch 2 |
|---|---|---|
| Records processed | 2 inserts | 1 delete signal |
| Active rows after | 2 | 1 (Alice) |
| Soft-deleted rows after | 0 | 1 (Bob) |
| Hard deletes | 0 | 0 — row preserved |
What LakeLogic did automatically¶
- Intercepted
op = "D"viacdc_op_field/cdc_delete_values— no conditional code needed - Stamped
_lakelogic_is_deleted = trueon the matched row (byprimary_key: user_id) - Auto-filled
_lakelogic_deleted_atwith the current UTC timestamp - Auto-filled
_lakelogic_delete_reason = "cdc_delete_signal" - Left Alice unchanged
Next Steps — Try It Yourself¶
1. Edit the source data¶
# Add a third user and supply a custom delete reason:
v1_rows = [
{"user_id": 1, "name": "Alice", "op": "I"},
{"user_id": 2, "name": "Bob", "op": "I"},
{"user_id": 3, "name": "Carol", "op": "I"}, # <-- new
]
v2_rows = [
{"user_id": 1, "name": "Alice", "op": "D",
"_lakelogic_delete_reason": "gdpr_erasure_request"}, # custom reason
]
Ideas:
- Delete a user never inserted → tombstone row created as already-deleted
- Mix inserts and deletes in the same batch
- Add
_lakelogic_delete_reasonto the source row to override the default reason
2. Edit the contract¶
source:
cdc_op_field: "op" # column that signals the operation type
cdc_delete_values: ["D"] # → try ["DELETE"] or ["d", "D"] for your CDC
primary_key: ["user_id"] # key used to match incoming row to existing record
materialization:
strategy: merge
soft_delete_column: "_lakelogic_is_deleted" # boolean flag column
soft_delete_value: true # value written when deleted
soft_delete_time_column: "_lakelogic_deleted_at" # auto-stamped UTC timestamp
soft_delete_reason_column: "_lakelogic_delete_reason" # overridable per-row
Key contract knobs:
| What to change | Where in contract.yaml |
Effect |
|---|---|---|
| CDC operation column | source.cdc_op_field |
Which column signals insert / delete |
| Delete signal values | source.cdc_delete_values |
e.g. ["D"], ["DELETE"] |
| Deleted flag column | soft_delete_column |
Boolean column written on deleted rows |
| Deletion timestamp column | soft_delete_time_column |
Auto-stamped; omit to skip |
| Deletion reason column | soft_delete_reason_column |
Overridable per-row in source |
| Match key | primary_key |
Column(s) used to find the existing row |
3. Explore related playbooks¶
../scd2_dimension/— preserve full history with versioned rows../reference_joins/— enrich records in the same contract