Ingestion & Cloud

Think of ingestion as border control. You don't inspect every detail of what's crossing the border — but you do check passports (schema) and flag anything suspicious. The deep security check happens later (Silver layer).

LakeLogic acts as a schema gate for ingestion, ensuring that what lands in your Bronze layer has a known shape and a clean lineage — even if the data itself is messy.

Cloud Storage Support

LakeLogic reads from all major cloud storage providers:

Provider	Path format
Amazon S3	`s3://my-bucket/raw_data/`
Google GCS	`gs://my-bucket/raw_data/`
Azure ADLS	`abfss://container@account.dfs.core.windows.net/path/`
Local files	`./data/raw_customers.parquet`

Ingestion Mode (Raw → Bronze)

When moving data from external sources into Bronze, you want schema protection without heavy validation:

server:
  type: gcs
  path: gs://landing-zone/daily_extract/
  mode: ingest
  schema_policy:
    evolution: append

Schema Evolution Strategies

Strategy	Behaviour	Best for
`strict`	Fails if incoming file doesn't match the schema exactly	Stable, well-governed sources
`append`	Automatically allows new columns to pass through	APIs that add fields over time
`merge`	Upgrades the schema to the greatest common denominator	Multi-file ingestion with mixed schemas

Schema Drift Protection

Think of schema drift like a supplier changing their packaging without telling you. The product inside might be fine, but if your warehouse is set up for boxes and they start sending tubes, things break.

LakeLogic detects unknown or missing fields during ingestion and can trigger alerts:

server:
  mode: ingest
  schema_policy:
    evolution: append
    unknown_fields: quarantine  # Alert when drift is detected

quarantine:
  notifications:
    - type: slack
      channel: "#data-alerts"
      on_events: ["schema_drift"]

Why this matters: You catch schema changes the moment they arrive, not three days later when a dashboard breaks.

Cleanse-on-Arrival

Bronze data often arrives with duplicates or soft-deleted records. Clean them at the gate:

transformations:
  - sql: |
      SELECT * FROM (
        SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) AS rn
        FROM source
        WHERE is_deleted = false
      ) WHERE rn = 1
    phase: pre

Why this matters: A lean Bronze layer saves storage costs and compute time in every downstream layer.

Example: Azure to Bronze

version: 1.0.0
info:
  title: CRM Ingestion
  target_layer: bronze

server:
  type: adls
  path: abfss://raw@datalake.dfs.core.windows.net/crm/
  mode: ingest
  schema_policy:
    evolution: append

model:
  fields:
    - name: user_id
      type: long
    - name: signup_date
      type: timestamp

No quality rules here — we want an exact copy of the source. But we still define the expected schema to catch drift early.

The "All Strings" Bronze Pattern

Think of this as "photograph the evidence before you touch it."

Many high-scale teams read every column as a string in Bronze:

server:
  mode: ingest
  cast_to_string: true

Why this works:

Benefit	How
Zero ingestion failures	You never crash because an API sent `"N/A"` into a numeric field
100% data capture	Every value is preserved exactly as the source sent it
Fix in Silver	Casting and cleaning happen in Silver, where quarantine catches rows that won't convert

What's Next?

Reprocessing & Partitioning — Handling late data and backfills
Automatic Credentials — Zero-config cloud authentication