Hello World: Remote Data Ingestion¶
Business Scenario¶
You have a raw data file at a public URL and you need to ingest it into your lakehouse with quality validation — without downloading files, setting up databases, or writing custom ETL code.
Value Proposition¶
- Point at any remote URL and get a governed, validated table in seconds
- Define rules once in a contract; they apply every run
- Switch between Python dict contracts (prototyping) and YAML files (production) with zero code change
Goals¶
- Run LakeLogic against a remote CSV with an in-memory contract (Python dict)
- Repeat using a YAML file contract — same result, production-ready
- Inspect
result.raw,result.good, andresult.bad
Setup¶
import importlib.util
import os
import urllib.request
import sys
from pathlib import Path
if importlib.util.find_spec("lakelogic") is None:
import subprocess
subprocess.run([sys.executable, "-m", "pip", "install", "lakelogic", "-q"], check=True)
print("lakelogic installed.")
else:
print("lakelogic ready.")
if "google.colab" in sys.modules:
repo = Path("/content/LakeLogic")
if not repo.exists():
import subprocess
subprocess.run(
[
"git",
"clone",
"--quiet",
"https://github.com/lakelogic/LakeLogic.git",
str(repo),
],
check=True,
)
os.chdir(repo / "examples" / "01_quickstart")
print(f"Working directory: {Path.cwd()}")
from lakelogic import DataProcessor
REMOTE_URL = (
"https://raw.githubusercontent.com/lakelogic/LakeLogic/main/examples/01_quickstart/files/excel/data/employees.csv"
)
print(f"Source URL: {REMOTE_URL}")
print("Setup complete.")
lakelogic ready. Source URL: https://raw.githubusercontent.com/lakelogic/LakeLogic/main/examples/01_quickstart/files/excel/data/employees.csv Setup complete.
Engine Selection¶
LakeLogic supports multiple engines from the same contract. Choose your engine below.
Spark note: Spark cannot read files directly from HTTPS URLs (the Hadoop
HttpsFileSystemdriver doesn't implementlistStatus). Whenengine = 'spark', the cell below automatically downloads the file to/tmp/first and passes the local path to LakeLogic. All other engines (Polars, DuckDB, Pandas) read the remote URL directly.
# ── Choose your engine ──────────────────────────────────────────────────
# Options: 'polars' | 'duckdb' | 'pandas' | 'spark'
ENGINE = 'polars' # change this to try a different engine
# ── Resolve source path ──────────────────────────────────────────────────
# Spark can't read from HTTPS URLs directly — download locally first.
if ENGINE == 'spark':
LOCAL_FILE = '/tmp/lakelogic_employees.csv'
if not os.path.exists(LOCAL_FILE):
print(f"Downloading {REMOTE_URL} for Spark...")
urllib.request.urlretrieve(REMOTE_URL, LOCAL_FILE)
SOURCE = LOCAL_FILE
print(f"Spark engine: using local file → {SOURCE}")
else:
SOURCE = REMOTE_URL
print(f"Engine: {ENGINE} | Source: {SOURCE}")
Engine: polars | Source: https://raw.githubusercontent.com/lakelogic/LakeLogic/main/examples/01_quickstart/files/excel/data/employees.csv
How It Works¶
LakeLogic reads the source URL, applies the contract's quality rules row-by-row,
and returns a ValidationResult with three DataFrames:
| Attribute | Contents |
|---|---|
result.raw |
Every row exactly as read from the source |
result.good |
Rows that passed all quality rules — safe for analytics |
result.bad |
Rows that failed — quarantined with an error reason column |
Contract formats¶
LakeLogic accepts contracts two ways — both produce identical results:
| Format | Best for | How to pass |
|---|---|---|
Python dict |
Prototyping, dynamic generation, notebooks | DataProcessor(contract=my_dict) |
| YAML file | Production, Git versioning, team governance | DataProcessor(contract="path/to.yaml") |
Quality rule applied¶
quality:
row_rules:
- name: Valid Email
sql: "email LIKE '%@%'" # any row without @ in email → quarantined
1. Run via In-Memory Contract (Python Dict)¶
Best for prototyping. Define and run the contract in a single cell — no files needed.
contract_dict = {
"version": "1.0.0",
"dataset": "remote_employees",
"source": {"type": "landing"},
"quality": {"row_rules": [{"name": "Valid Email", "sql": "email LIKE '%@%'"}]},
}
processor = DataProcessor(contract=contract_dict, engine=ENGINE)
result = processor.run_source(SOURCE)
print(f"Engine : {ENGINE}")
print(f"Source rows : {result.source_count}")
print(f"Good rows : {result.good_count}")
print(f"Bad rows : {result.bad_count}")
print("\nGOOD DATA (passed quality rules):")
display(result.good)
2026-03-04 02:44:11.168 | INFO | lakelogic.core.processor:run_source:724 - Loading source: https://raw.githubusercontent.com/lakelogic/LakeLogic/main/examples/01_quickstart/files/excel/data/employees.csv via polars 2026-03-04 02:44:20.634 | INFO | lakelogic.core.processor:run:500 - Run complete. Source: 5, Total (post-transform): 5, Good: 4, Quarantined: 1, Pre-Transform Dropped: 0, Ratio: 20.00%
Engine : polars Source rows : 5 Good rows : 4 Bad rows : 1 GOOD DATA (passed quality rules):
| id | name | department | salary | hire_date | status | |
|---|---|---|---|---|---|---|
| i64 | str | str | str | i64 | str | str |
| 1 | "Frank Wilson" | "frank@company.com" | "Engineering" | 105000 | "2023-02-10" | "active" |
| 3 | "Henry Brown" | "henry@company.com" | "Marketing" | -10000 | "2023-06-20" | "inactive" |
| 4 | "Iris Taylor" | "iris@company.com" | "Sales" | 92000 | "2022-12-05" | "active" |
| 5 | "Jack Anderson" | "jack@company.com" | "HR" | 79000 | "2024-02-14" | "active" |
2. Run via YAML File Contract¶
Best for production. The contract lives in a .yaml file tracked in Git;
the code never changes, only the contract does.
import yaml
yaml_path = Path("users_contract_remote.yaml")
yaml_path.write_text(yaml.dump(contract_dict), encoding="utf-8")
processor_prod = DataProcessor(contract=str(yaml_path), engine=ENGINE)
result_prod = processor_prod.run_source(SOURCE)
print(f"YAML contract — Good: {result_prod.good_count}, Bad: {result_prod.bad_count}")
print("BAD DATA (quarantined):")
display(result_prod.bad)
2026-03-04 02:41:35.803 | INFO | lakelogic.core.processor:run_source:724 - Loading source: D:\tmp\lakelogic_employees.csv via spark 2026-03-04 02:41:36.763 | INFO | lakelogic.core.processor:run:500 - Run complete. Source: 5, Total (post-transform): 5, Good: 4, Quarantined: 1, Pre-Transform Dropped: 0, Ratio: 20.00%
YAML contract — Good: 4, Bad: 1 BAD DATA (quarantined):
DataFrame[id: string, name: string, email: string, department: string, salary: string, hire_date: string, status: string, _lakelogic_errors: array<string>, _lakelogic_categories: array<string>, quarantine_state: string, quarantine_reprocessed: boolean]
Summary¶
| Step | What happened |
|---|---|
| Source | Remote CSV read directly — no download needed |
| Quality rule | email LIKE '%@%' filtered rows without a valid email |
result.good |
Rows safe for analytics |
result.bad |
Rows quarantined with error reason |
| YAML vs dict | Both produced identical results |
What LakeLogic did automatically¶
- Fetched the remote CSV without local file I/O
- Applied the quality SQL rule row-by-row
- Split results into
good/badwith zero custom code - Added
_lakelogic_processed_atand_lakelogic_run_idaudit columns
Next Steps — Try It Yourself¶
1. Change the quality rules¶
contract_dict = {
"version": "1.0.0",
"dataset": "remote_employees",
"source": {"type": "landing"},
"quality": {
"row_rules": [
{"name": "Valid Email", "sql": "email LIKE '%@%'"},
{"name": "Has Name", "sql": "name IS NOT NULL AND name != ''"},
{"name": "Dept Not Null","sql": "department IS NOT NULL"}, # add a rule
]
}
}
Ideas:
- Point
REMOTE_URLat your own CSV hosted on GitHub or S3 - Add
accepted_valuesrules:"department IN ('Engineering', 'Sales')" - Add
materializationto writeresult.gooddirectly to a Parquet file
2. Key contract knobs¶
| What to change | Where in contract | Effect |
|---|---|---|
| Quality rules | quality.row_rules |
Filter rows into good / bad |
| Source type | source.type |
landing (files/URLs), table (dict/rows) |
| Write output | materialization.target_path + format |
Persist good rows to Parquet/CSV/Delta |
| Schema enforcement | model.fields |
Validate column types and required fields |
3. Explore related quickstarts¶
02_database_governance.ipynb— same quality-gate pattern applied to a SQLite database../02_core_patterns/scd2_dimension/— add SCD2 history tracking to any pipeline../02_core_patterns/soft_delete/— flag deletes instead of removing rows