Hello World: Remote Data Ingestion¶

Business Scenario¶

You have a raw data file at a public URL and you need to ingest it into your lakehouse with quality validation — without downloading files, setting up databases, or writing custom ETL code.

Value Proposition¶

Point at any remote URL and get a governed, validated table in seconds
Define rules once in a contract; they apply every run
Switch between Python dict contracts (prototyping) and YAML files (production) with zero code change

Goals¶

Run LakeLogic against a remote CSV with an in-memory contract (Python dict)
Repeat using a YAML file contract — same result, production-ready
Inspect result.raw, result.good, and result.bad

Setup¶

In [1]:

Copied!





import importlib.util
import os
import urllib.request
import sys
from pathlib import Path

if importlib.util.find_spec("lakelogic") is None:
    import subprocess

    subprocess.run([sys.executable, "-m", "pip", "install", "lakelogic", "-q"], check=True)
    print("lakelogic installed.")
else:
    print("lakelogic ready.")

if "google.colab" in sys.modules:
    repo = Path("/content/LakeLogic")
    if not repo.exists():
        import subprocess

        subprocess.run(
            [
                "git",
                "clone",
                "--quiet",
                "https://github.com/lakelogic/LakeLogic.git",
                str(repo),
            ],
            check=True,
        )
    os.chdir(repo / "examples" / "01_quickstart")
    print(f"Working directory: {Path.cwd()}")

from lakelogic import DataProcessor

REMOTE_URL = (
    "https://raw.githubusercontent.com/lakelogic/LakeLogic/main/examples/01_quickstart/files/excel/data/employees.csv"
)
print(f"Source URL: {REMOTE_URL}")
print("Setup complete.")
import importlib.util
import os
import urllib.request
import sys
from pathlib import Path

if importlib.util.find_spec("lakelogic") is None:
    import subprocess

    subprocess.run([sys.executable, "-m", "pip", "install", "lakelogic", "-q"], check=True)
    print("lakelogic installed.")
else:
    print("lakelogic ready.")

if "google.colab" in sys.modules:
    repo = Path("/content/LakeLogic")
    if not repo.exists():
        import subprocess

        subprocess.run(
            [
                "git",
                "clone",
                "--quiet",
                "https://github.com/lakelogic/LakeLogic.git",
                str(repo),
            ],
            check=True,
        )
    os.chdir(repo / "examples" / "01_quickstart")
    print(f"Working directory: {Path.cwd()}")

from lakelogic import DataProcessor

REMOTE_URL = (
    "https://raw.githubusercontent.com/lakelogic/LakeLogic/main/examples/01_quickstart/files/excel/data/employees.csv"
)
print(f"Source URL: {REMOTE_URL}")
print("Setup complete.")

lakelogic ready.
Source URL: https://raw.githubusercontent.com/lakelogic/LakeLogic/main/examples/01_quickstart/files/excel/data/employees.csv
Setup complete.

Engine Selection¶

LakeLogic supports multiple engines from the same contract. Choose your engine below.

Spark note: Spark cannot read files directly from HTTPS URLs (the Hadoop HttpsFileSystem driver doesn't implement listStatus). When engine = 'spark', the cell below automatically downloads the file to /tmp/ first and passes the local path to LakeLogic. All other engines (Polars, DuckDB, Pandas) read the remote URL directly.

In [5]:

Copied!





# ── Choose your engine ──────────────────────────────────────────────────
# Options: 'polars' | 'duckdb' | 'pandas' | 'spark'
ENGINE = 'polars'   # change this to try a different engine

# ── Resolve source path ──────────────────────────────────────────────────
# Spark can't read from HTTPS URLs directly — download locally first.
if ENGINE == 'spark':
    LOCAL_FILE = '/tmp/lakelogic_employees.csv'
    if not os.path.exists(LOCAL_FILE):
        print(f"Downloading {REMOTE_URL} for Spark...")
        urllib.request.urlretrieve(REMOTE_URL, LOCAL_FILE)
    SOURCE = LOCAL_FILE
    print(f"Spark engine: using local file → {SOURCE}")
else:
    SOURCE = REMOTE_URL
    print(f"Engine: {ENGINE} | Source: {SOURCE}")
# ── Choose your engine ──────────────────────────────────────────────────
# Options: 'polars' | 'duckdb' | 'pandas' | 'spark'
ENGINE = 'polars'   # change this to try a different engine

# ── Resolve source path ──────────────────────────────────────────────────
# Spark can't read from HTTPS URLs directly — download locally first.
if ENGINE == 'spark':
    LOCAL_FILE = '/tmp/lakelogic_employees.csv'
    if not os.path.exists(LOCAL_FILE):
        print(f"Downloading {REMOTE_URL} for Spark...")
        urllib.request.urlretrieve(REMOTE_URL, LOCAL_FILE)
    SOURCE = LOCAL_FILE
    print(f"Spark engine: using local file → {SOURCE}")
else:
    SOURCE = REMOTE_URL
    print(f"Engine: {ENGINE} | Source: {SOURCE}")

Engine: polars | Source: https://raw.githubusercontent.com/lakelogic/LakeLogic/main/examples/01_quickstart/files/excel/data/employees.csv

How It Works¶

LakeLogic reads the source URL, applies the contract's quality rules row-by-row, and returns a ValidationResult with three DataFrames:

Attribute	Contents
`result.raw`	Every row exactly as read from the source
`result.good`	Rows that passed all quality rules — safe for analytics
`result.bad`	Rows that failed — quarantined with an error reason column

Contract formats¶

LakeLogic accepts contracts two ways — both produce identical results:

Format	Best for	How to pass
Python `dict`	Prototyping, dynamic generation, notebooks	`DataProcessor(contract=my_dict)`
YAML file	Production, Git versioning, team governance	`DataProcessor(contract="path/to.yaml")`

Quality rule applied¶

quality:
  row_rules:
    - name: Valid Email
      sql: "email LIKE '%@%'"   # any row without @ in email → quarantined

1. Run via In-Memory Contract (Python Dict)¶

Best for prototyping. Define and run the contract in a single cell — no files needed.

In [6]:

Copied!





contract_dict = {
    "version": "1.0.0",
    "dataset": "remote_employees",
    "source": {"type": "landing"},
    "quality": {"row_rules": [{"name": "Valid Email", "sql": "email LIKE '%@%'"}]},
}

processor = DataProcessor(contract=contract_dict, engine=ENGINE)
result = processor.run_source(SOURCE)

print(f"Engine      : {ENGINE}")
print(f"Source rows : {result.source_count}")
print(f"Good rows   : {result.good_count}")
print(f"Bad rows    : {result.bad_count}")

print("\nGOOD DATA (passed quality rules):")
display(result.good)
contract_dict = {
    "version": "1.0.0",
    "dataset": "remote_employees",
    "source": {"type": "landing"},
    "quality": {"row_rules": [{"name": "Valid Email", "sql": "email LIKE '%@%'"}]},
}

processor = DataProcessor(contract=contract_dict, engine=ENGINE)
result = processor.run_source(SOURCE)

print(f"Engine      : {ENGINE}")
print(f"Source rows : {result.source_count}")
print(f"Good rows   : {result.good_count}")
print(f"Bad rows    : {result.bad_count}")

print("\nGOOD DATA (passed quality rules):")
display(result.good)

2026-03-04 02:44:11.168 | INFO     | lakelogic.core.processor:run_source:724 - Loading source: https://raw.githubusercontent.com/lakelogic/LakeLogic/main/examples/01_quickstart/files/excel/data/employees.csv via polars
2026-03-04 02:44:20.634 | INFO     | lakelogic.core.processor:run:500 - Run complete. Source: 5, Total (post-transform): 5, Good: 4, Quarantined: 1, Pre-Transform Dropped: 0, Ratio: 20.00%

Engine      : polars
Source rows : 5
Good rows   : 4
Bad rows    : 1

GOOD DATA (passed quality rules):

shape: (4, 7)

id	name	email	department	salary	hire_date	status
i64	str	str	str	i64	str	str
1	"Frank Wilson"	"frank@company.com"	"Engineering"	105000	"2023-02-10"	"active"
3	"Henry Brown"	"henry@company.com"	"Marketing"	-10000	"2023-06-20"	"inactive"
4	"Iris Taylor"	"iris@company.com"	"Sales"	92000	"2022-12-05"	"active"
5	"Jack Anderson"	"jack@company.com"	"HR"	79000	"2024-02-14"	"active"

2. Run via YAML File Contract¶

Best for production. The contract lives in a .yaml file tracked in Git; the code never changes, only the contract does.

In [4]:

Copied!





import yaml

yaml_path = Path("users_contract_remote.yaml")
yaml_path.write_text(yaml.dump(contract_dict), encoding="utf-8")

processor_prod = DataProcessor(contract=str(yaml_path), engine=ENGINE)
result_prod = processor_prod.run_source(SOURCE)

print(f"YAML contract — Good: {result_prod.good_count}, Bad: {result_prod.bad_count}")
print("BAD DATA (quarantined):")
display(result_prod.bad)
import yaml

yaml_path = Path("users_contract_remote.yaml")
yaml_path.write_text(yaml.dump(contract_dict), encoding="utf-8")

processor_prod = DataProcessor(contract=str(yaml_path), engine=ENGINE)
result_prod = processor_prod.run_source(SOURCE)

print(f"YAML contract — Good: {result_prod.good_count}, Bad: {result_prod.bad_count}")
print("BAD DATA (quarantined):")
display(result_prod.bad)

2026-03-04 02:41:35.803 | INFO     | lakelogic.core.processor:run_source:724 - Loading source: D:\tmp\lakelogic_employees.csv via spark
2026-03-04 02:41:36.763 | INFO     | lakelogic.core.processor:run:500 - Run complete. Source: 5, Total (post-transform): 5, Good: 4, Quarantined: 1, Pre-Transform Dropped: 0, Ratio: 20.00%

YAML contract — Good: 4, Bad: 1
BAD DATA (quarantined):

DataFrame[id: string, name: string, email: string, department: string, salary: string, hire_date: string, status: string, _lakelogic_errors: array<string>, _lakelogic_categories: array<string>, quarantine_state: string, quarantine_reprocessed: boolean]

Summary¶

Step	What happened
Source	Remote CSV read directly — no download needed
Quality rule	`email LIKE '%@%'` filtered rows without a valid email
`result.good`	Rows safe for analytics
`result.bad`	Rows quarantined with error reason
YAML vs dict	Both produced identical results

What LakeLogic did automatically¶

Fetched the remote CSV without local file I/O
Applied the quality SQL rule row-by-row
Split results into good / bad with zero custom code
Added _lakelogic_processed_at and _lakelogic_run_id audit columns

Next Steps — Try It Yourself¶

1. Change the quality rules¶

contract_dict = {
    "version": "1.0.0",
    "dataset": "remote_employees",
    "source": {"type": "landing"},
    "quality": {
        "row_rules": [
            {"name": "Valid Email",  "sql": "email LIKE '%@%'"},
            {"name": "Has Name",     "sql": "name IS NOT NULL AND name != ''"},
            {"name": "Dept Not Null","sql": "department IS NOT NULL"},  # add a rule
        ]
    }
}

Ideas:

Point REMOTE_URL at your own CSV hosted on GitHub or S3
Add accepted_values rules: "department IN ('Engineering', 'Sales')"
Add materialization to write result.good directly to a Parquet file

2. Key contract knobs¶

What to change	Where in contract	Effect
Quality rules	`quality.row_rules`	Filter rows into good / bad
Source type	`source.type`	`landing` (files/URLs), `table` (dict/rows)
Write output	`materialization.target_path` + `format`	Persist `good` rows to Parquet/CSV/Delta
Schema enforcement	`model.fields`	Validate column types and required fields

02_database_governance.ipynb — same quality-gate pattern applied to a SQLite database
../02_core_patterns/scd2_dimension/ — add SCD2 history tracking to any pipeline
../02_core_patterns/soft_delete/ — flag deletes instead of removing rows