Capability Matrix

This matrix summarizes what each engine supports in the OSS runtime. When a feature is engine-specific, use an explicit engine selection.

Engine Support

Engine	File Sources	Table Sources	File Outputs	Table Outputs	Quarantine Targets	Notes
Polars	CSV, Parquet, Delta¹	Delta tables¹	CSV, Parquet, Delta¹, Iceberg²	Delta tables¹	Delta¹ file	Delta-RS provides Spark-free lakehouse table support.
Spark	CSV, Parquet, Delta, Iceberg, JSON	Spark tables	CSV, Parquet, Delta, Iceberg, JSON	Spark tables	Spark tables (Delta/Iceberg)	Recommended for lakehouse catalogs (Unity Catalog/Fabric).
Snowflake	Table-only	Table-only	Table writes (Snowflake)	Snowflake tables	Snowflake tables	Requires `snowflake-connector-python` and credentials.
BigQuery	Table-only	Table-only	Table writes (BigQuery)	BigQuery tables	BigQuery tables	Requires `google-cloud-bigquery` and credentials.
LLM	Text, PDF, Image, Audio, Video	N/A	Structured rows	N/A	Via parent engine	Unstructured → structured extraction via LLM providers.

¹ Delta support via delta-rs (pip install deltalake). Includes read, write, atomic MERGE, vacuum, optimize, and time travel — no Spark/JVM required. Cloud storage auto-credentials via DeltaAdapter. ² Iceberg file output uses DuckDB's iceberg extension (COPY ... FORMAT ICEBERG).

Materialization Strategies

Strategy	Polars	Spark
`overwrite`	Native	Native
`append`	Native	Native
`merge`	Native (Delta via delta-rs)	Native (distributed)
`scd2`	Native	Native (distributed)

Spark advantage: Merge and SCD2 operations run natively using distributed DataFrame operations, avoiding driver memory bottlenecks. For Delta Lake tables, LakeLogic uses MERGE INTO when available.

Delta merge via delta-rs: When the output format is Delta, Polars uses DeltaTable.merge() from delta-rs for atomic merge operations. This uses the same MERGE INTO semantics as Spark Delta but runs without a JVM.

Non-Delta merge/SCD2: For CSV/Parquet targets, Polars utilizes native DataFrame operations without driver memory bottlenecks — however, use Spark or Delta format for exceptionally large datasets.

Format Defaults

Quarantine file targets default to Parquet (override with quarantine.format, or metadata.quarantine_format, or file suffix).
Quarantine table targets on Spark default to Delta (override with metadata.quarantine_table_format).
Non-Spark engines support CSV, Parquet, and Delta for quarantine file targets.

If you need an unsupported combination, use Spark or route through an external staging step.

LLM Extraction Engine

The LLM engine (engines/llm.py) extracts structured data from unstructured inputs:

Provider	API Key Env Var	Notes
`openai`	`OPENAI_API_KEY`	GPT-4o, GPT-4o-mini
`azure_openai`	`AZURE_OPENAI_API_KEY` + `AZURE_OPENAI_ENDPOINT`	Azure-hosted OpenAI
`anthropic`	`ANTHROPIC_API_KEY`	Claude 3.5 Sonnet, Haiku
`google`	`GOOGLE_API_KEY`	Gemini 2.0 Flash
`bedrock`	AWS credentials (boto3)	Claude, Llama, Titan via AWS
`ollama`	None (local)	Any Ollama model, default `http://localhost:11434`
`local`	None (local)	HuggingFace Transformers (Phi-3-mini default)

Preprocessing supports PDF (OCR), image, audio (Whisper), and video inputs.

Lakehouse Catalog Notes (Unity Catalog / Fabric)

Unity Catalog external tables work with Polars. LakeLogic resolves 3-part table names (catalog.schema.table) to storage paths via the Databricks API, then reads/writes via delta-rs. No Spark needed.
UC managed tables require Spark. Managed tables don't expose a direct storage path — if resolution returns no storage_location, Spark is required.
Iceberg tables in a catalog require Spark. File-based Iceberg output is not the same as a catalog-managed table.
Delta via delta-rs is used automatically by LakeLogic for read, write, merge, vacuum, and time travel on any Delta table accessible via path. A standalone DeltaAdapter class is also available for ad-hoc use outside the pipeline.