HIPAA & GDPR Compliance with LakeLogic¶
Business Scenario¶
A global healthcare and e-commerce company operates across the United States and European Union. They ingest two types of sensitive data:
- US Patient Records — containing Protected Health Information (PHI) under HIPAA (names, SSNs, diagnoses)
- EU Customer Records — containing Personally Identifiable Information (PII) under GDPR (names, emails, phone numbers, consent status)
Both regulations require strict controls, but they differ in important ways:
| Requirement | HIPAA (US) | GDPR (EU) |
|---|---|---|
| Scope | Protected Health Information (PHI) | All personal data of EU/EEA residents |
| Consent | Implied for treatment | Must be explicit and recorded (Art. 7) |
| Lawful Basis | Not required per record | Must be one of six types (Art. 6) |
| Right to Erasure | Limited HIPAA exceptions | Full right to be forgotten (Art. 17) |
| Data Retention | Minimum 6 years | Storage limitation required (Art. 5(1)(e)) |
| PII Masking | De-identification of 18 PHI identifiers | Pseudonymisation or anonymisation |
Value Proposition¶
- Multi-Regulation Compliance — apply HIPAA and GDPR Policy Packs to different tables using the same framework
- Automated PII Protection — mask sensitive fields at ingestion, not as a downstream afterthought
- Audit-Ready Quarantine — every failed record is quarantined with the exact rule violation
- Portable Governance — same YAML contracts work on Polars (dev), Spark (prod), DuckDB (CI/CD)
Goals¶
- Apply HIPAA Policy Pack to US patient records — validate format, mask PHI
- Apply GDPR Policy Pack to EU customer records — enforce consent, lawful basis, retention
- Compare results side-by-side to understand the regulatory differences
Setup¶
import importlib.util
import os
import sys
import yaml
import polars as pl
from pathlib import Path
if importlib.util.find_spec("lakelogic") is None:
import subprocess
subprocess.run([sys.executable, "-m", "pip", "install", "lakelogic", "-q"], check=True)
print("lakelogic installed.")
else:
print("lakelogic ready.")
if "google.colab" in sys.modules:
repo = Path("/content/LakeLogic")
if not repo.exists():
import subprocess
subprocess.run(
[
"git",
"clone",
"--quiet",
"https://github.com/lakelogic/LakeLogic.git",
str(repo),
],
check=True,
)
os.chdir(repo / "examples" / "03_compliance_governance" / "hipaa_gdpr_pii_masking")
print(f"Working directory: {Path.cwd()}")
def resolve_example_dir(name: str) -> Path:
cwd = Path.cwd()
for base in [cwd] + list(cwd.parents):
for candidate in [
base / name,
base / "examples" / "03_compliance_governance" / name,
]:
if candidate.exists():
return candidate
return cwd / name
def resolve_repo_root(start: Path) -> Path:
for base in [start] + list(start.parents):
if (base / "policy_packs" / "hipaa_compliance.yaml").exists():
return base
return start
base_path = resolve_example_dir("hipaa_gdpr_pii_masking")
repo_root = resolve_repo_root(base_path)
print(f"Example dir: {base_path}")
print(f"Repo root : {repo_root}")
print("Setup complete.")
How It Works¶
LakeLogic's Policy Pack system lets you define compliance rules once and apply them across any number of contracts. Each regulation gets its own pack:
policy_packs/
hipaa_compliance.yaml ← SSN format, email validation, PHI de-identification
gdpr_compliance.yaml ← Consent (Art.7), lawful basis (Art.6), retention (Art.5)
contracts/
medical_records.yaml ← references hipaa_compliance + pii_masking hook
eu_customers.yaml ← references gdpr_compliance + pii_masking hook
HIPAA vs GDPR — rules at a glance¶
| Pack | Key rules | PII treatment |
|---|---|---|
| HIPAA | SSN format, email validation | Fields → [PROTECTED] |
| GDPR | Consent required, lawful basis, retention date, email | Fields → [GDPR_REDACTED] |
Source data overview¶
US patient records (phi_records.csv): Bob — invalid email; Edward — malformed SSN
EU customer records (eu_customers.csv):
| Customer | Violation | GDPR Article |
|---|---|---|
| Sofia García | Invalid email format | Art. 5 (Accuracy) |
| Lars Eriksson | Missing retention date, empty phone | Art. 5(1)(e) (Storage Limitation) |
| Maria Rossi | Missing consent status | Art. 7 (Consent) |
| Hans Weber | Invalid lawful basis | Art. 6 (Lawful Basis) |
1. Load HIPAA Policy Pack¶
The HIPAA pack enforces SSN format, email validation, and strict schema evolution across every contract that references it.
hipaa_pack_path = repo_root / "policy_packs" / "hipaa_compliance.yaml"
with open(hipaa_pack_path, "r") as f:
hipaa_pack = yaml.safe_load(f)
print("HIPAA Policy Pack rules:")
print(f" Framework : {hipaa_pack.get('defaults', {}).get('metadata', {}).get('compliance_framework', 'N/A')}")
print(f" Schema policy: {hipaa_pack.get('defaults', {}).get('schema_policy', {}).get('evolution', 'N/A')}")
print()
for rule in hipaa_pack.get("defaults", {}).get("quality", {}).get("row_rules", []):
print(f" - {rule.get('name')}: {rule.get('description')}")
2. Inspect Raw Patient Records (PHI)¶
The source CSV contains Protected Health Information — patient names, SSNs, and emails. Bob has an invalid email; Edward has a malformed SSN.
phi_data_path = base_path / "data" / "phi_records.csv"
raw_phi = pl.read_csv(phi_data_path)
print("Raw Patient Records (contains PHI):")
display(raw_phi)
3. Run HIPAA Validation with PII Masking¶
Two layers activate in one run_source() call:
- Quality gates — SSN format and email validation from the HIPAA Policy Pack
- PII masking — names, SSNs, and emails are replaced with
[PROTECTED]before Silver
from lakelogic import DataProcessor
hipaa_contract = base_path / "contracts" / "medical_records.yaml"
hipaa_processor = DataProcessor(contract=hipaa_contract)
hipaa_result = hipaa_processor.run_source(phi_data_path)
print("HIPAA Results:")
print(f" Source records : {len(hipaa_result.raw)}")
print(f" Valid (Silver) : {len(hipaa_result.good)}")
print(f" Quarantined : {len(hipaa_result.bad)}")
print(f" Reconciled : {len(hipaa_result.raw)} = {len(hipaa_result.good)} + {len(hipaa_result.bad)}")
4. Inspect HIPAA Quarantine¶
Bob and Edward failed the HIPAA Policy Pack rules. Each quarantined record includes the exact failure reason for compliance audit.
print("QUARANTINE — Failed HIPAA Rules:")
display(hipaa_result.bad)
5. Inspect Masked Silver Layer (HIPAA)¶
Alice, Charlie, and Diane passed validation. Their patient_name, ssn, and email
are replaced with [PROTECTED] — safe for analytics teams, no raw PHI in Silver.
print("SILVER LAYER — Clean & Masked (HIPAA Safe):")
display(hipaa_result.good)
6. Load GDPR Policy Pack¶
GDPR requires different controls than HIPAA. The GDPR pack adds:
- Consent tracking (Art. 7) — every record must have explicit consent
- Lawful basis (Art. 6) — one of six legal justifications must be documented
- Storage limitation (Art. 5(1)(e)) — a retention expiry date must be set
- Data accuracy (Art. 5) — email must be valid
gdpr_pack_path = repo_root / "policy_packs" / "gdpr_compliance.yaml"
with open(gdpr_pack_path, "r") as f:
gdpr_pack = yaml.safe_load(f)
print("GDPR Policy Pack rules:")
print(f" Framework: {gdpr_pack.get('defaults', {}).get('metadata', {}).get('compliance_framework', 'N/A')}")
print(f" Regions : {gdpr_pack.get('defaults', {}).get('metadata', {}).get('applicable_regions', 'N/A')}")
print()
for rule in gdpr_pack.get("defaults", {}).get("quality", {}).get("row_rules", []):
print(f" - {rule.get('name')}: {rule.get('description')}")
7. Inspect Raw EU Customer Data¶
Four records have intentional GDPR violations across different articles.
eu_data_path = base_path / "data" / "eu_customers.csv"
raw_eu = pl.read_csv(eu_data_path)
print("Raw EU Customer Records (contains PII):")
display(raw_eu)
8. Run GDPR Validation with PII Masking¶
The EU customer contract validates consent, lawful basis, retention date, and email format, then pseudonymises PII fields in passing records.
gdpr_contract = base_path / "contracts" / "eu_customers.yaml"
gdpr_processor = DataProcessor(contract=gdpr_contract)
gdpr_result = gdpr_processor.run_source(eu_data_path)
print("GDPR Results:")
print(f" Source records : {len(gdpr_result.raw)}")
print(f" Valid (Silver) : {len(gdpr_result.good)}")
print(f" Quarantined : {len(gdpr_result.bad)}")
print(f" Reconciled : {len(gdpr_result.raw)} = {len(gdpr_result.good)} + {len(gdpr_result.bad)}")
9. Inspect GDPR Quarantine¶
Each quarantined record includes the specific GDPR article violation — exactly what Data Protection Officers (DPOs) need for audit evidence.
print("QUARANTINE — Failed GDPR Rules:")
display(gdpr_result.bad)
10. Inspect GDPR-Compliant Silver Layer¶
Valid records have PII pseudonymised ([GDPR_REDACTED]). Notice:
full_name,email, andphoneare maskedconsent_given,lawful_basis, andretention_expires_atare preserved (needed for compliance tracking)- Non-PII fields (
country_code,id) remain untouched
print("SILVER LAYER — GDPR-Compliant & Pseudonymised:")
display(gdpr_result.good)
11. Side-by-Side Results¶
print("COMPLIANCE SUMMARY")
print("=" * 60)
print(" HIPAA (US Healthcare)")
print(
f" Source: {len(hipaa_result.raw):>3} | Valid: {len(hipaa_result.good):>3} | Quarantined: {len(hipaa_result.bad):>3}"
)
print(" Masking: patient_name, ssn, email -> [PROTECTED]")
print(" Rules: SSN format (XXX-XX-XXXX), email validation")
print()
print(" GDPR (EU Data Protection)")
print(
f" Source: {len(gdpr_result.raw):>3} | Valid: {len(gdpr_result.good):>3} | Quarantined: {len(gdpr_result.bad):>3}"
)
print(" Masking: full_name, email, phone -> [GDPR_REDACTED]")
print(" Rules: consent (Art.7), lawful basis (Art.6), retention (Art.5), email")
print("=" * 60)
print(" Key difference:")
print(" HIPAA focuses on data FORMAT (SSN patterns, email syntax)")
print(" GDPR focuses on data RIGHTS (consent, legal basis, retention)")
print(" LakeLogic handles BOTH with the same contract framework.")
Summary¶
What LakeLogic did automatically¶
Policy Packs as Compliance Templates — HIPAA and GDPR rules are defined once and applied across any number of contracts. Compliance teams manage the pack; data engineers just reference it.
Automated PII Masking at Ingestion — PII is masked before data reaches Silver. Analysts never see raw identifiers.
Audit-Ready Quarantine — every failed record includes the exact rule name, the regulatory article reference, and the failure reason.
100% Reconciliation —
source = valid + quarantined. No silent data loss.Engine-Agnostic — same contracts run identically on Polars (dev), Spark (prod), or DuckDB (CI/CD). No compliance logic rewrite when changing platforms.
Next Steps — Try It Yourself¶
1. Add a new patient record with a violation¶
csv
# data/phi_records.csv
P006,Frank,999-88-7777,frank-no-at-sign,Flu
2. Add a GDPR rule to an existing policy pack¶
# policy_packs/gdpr_compliance.yaml
defaults:
quality:
row_rules:
- name: Phone Not Empty # <-- add this
sql: "phone IS NOT NULL AND phone != ''"
description: "Art. 5 — contact data must be complete"
Key knobs:
| What to change | Where | Effect |
|---|---|---|
| Add a PHI column to mask | hooks.post_validate[pii_masking].columns |
That column becomes [PROTECTED] in Silver |
| Swap policy pack | policy_pack: in contract |
Switch between HIPAA / GDPR / custom |
| Add a quality rule | defaults.quality.row_rules in policy pack |
Applies to all contracts using that pack |
3. Explore related playbooks¶
tutorial_hipaa_compliance.ipynb— HIPAA only, simpler walkthrough../../02_core_patterns/bronze_quality_gate/— quality gating without compliance layer../../notifications_and_secrets/— alert on quarantine thresholds