Skip to content

Capability Matrix

This matrix summarizes what each engine supports in the OSS runtime. When a feature is engine-specific, use an explicit engine selection.

Engine Support

Engine File Sources Table Sources File Outputs Table Outputs Quarantine Targets Notes
Polars CSV, Parquet, Delta¹ Delta tables¹ CSV, Parquet, Delta¹, Iceberg² Delta tables¹ DuckDB/SQLite table, Delta¹ file Delta-RS provides Spark-free lakehouse table support.
Pandas CSV, Parquet, Delta¹ Delta tables¹ CSV, Parquet, Delta¹, Iceberg² Delta tables¹ DuckDB/SQLite table, Delta¹ file Uses DuckDB for SQL execution.
DuckDB CSV, Parquet, Delta¹ Delta tables¹ CSV, Parquet, Delta¹, Iceberg² DuckDB + Delta tables¹ DuckDB/SQLite table, Delta¹ file Table outputs are DuckDB or Delta tables.
Spark CSV, Parquet, Delta, Iceberg, JSON Spark tables CSV, Parquet, Delta, Iceberg, JSON Spark tables Spark tables (Delta/Iceberg) Recommended for lakehouse catalogs (Unity Catalog/Fabric).
Snowflake Table-only Table-only Table writes (Snowflake) Snowflake tables Snowflake tables Requires snowflake-connector-python and credentials.
BigQuery Table-only Table-only Table writes (BigQuery) BigQuery tables BigQuery tables Requires google-cloud-bigquery and credentials.
LLM Text, PDF, Image, Audio, Video N/A Structured rows N/A Via parent engine Unstructured → structured extraction via LLM providers.

¹ Delta support via delta-rs (pip install deltalake). Includes read, write, atomic MERGE, vacuum, optimize, and time travel — no Spark/JVM required. Cloud storage auto-credentials via DeltaAdapter. ² Iceberg file output uses DuckDB's iceberg extension (COPY ... FORMAT ICEBERG).

Materialization Strategies

Strategy Polars Pandas DuckDB Spark
overwrite Native Native Native Native
append Native Native Native Native
merge Native (Delta via delta-rs) / pandas (CSV/Parquet) Native Native (Delta via delta-rs) / pandas (CSV/Parquet) Native (distributed)
scd2 Via pandas Native Via pandas Native (distributed)

Spark advantage: Merge and SCD2 operations run natively using distributed DataFrame operations, avoiding driver memory bottlenecks. For Delta Lake tables, LakeLogic uses MERGE INTO when available.

Delta merge via delta-rs: When the output format is Delta, Polars and DuckDB use DeltaTable.merge() from delta-rs for atomic merge operations — no pandas conversion needed. This uses the same MERGE INTO semantics as Spark Delta but runs without JVM.

Non-Delta merge/SCD2: For CSV/Parquet targets, Polars and DuckDB fall back to pandas for merge and SCD2 operations. This works well for moderate data volumes but may hit driver memory limits at scale — use Spark or Delta format for large datasets.

Format Defaults

  • Quarantine file targets default to Parquet (override with quarantine.format, or metadata.quarantine_format, or file suffix).
  • Quarantine table targets on Spark default to Delta (override with metadata.quarantine_table_format).
  • Non-Spark engines support CSV, Parquet, and Delta for quarantine file targets.

If you need an unsupported combination, use Spark or route through an external staging step.

LLM Extraction Engine

The LLM engine (engines/llm.py) extracts structured data from unstructured inputs:

Provider API Key Env Var Notes
openai OPENAI_API_KEY GPT-4o, GPT-4o-mini
azure_openai AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT Azure-hosted OpenAI
anthropic ANTHROPIC_API_KEY Claude 3.5 Sonnet, Haiku
google GOOGLE_API_KEY Gemini 2.0 Flash
bedrock AWS credentials (boto3) Claude, Llama, Titan via AWS
ollama None (local) Any Ollama model, default http://localhost:11434
local None (local) HuggingFace Transformers (Phi-3-mini default)

Preprocessing supports PDF (OCR), image, audio (Whisper), and video inputs.

Lakehouse Catalog Notes (Unity Catalog / Fabric)

  • Unity Catalog external tables work with Polars/Pandas. LakeLogic resolves 3-part table names (catalog.schema.table) to storage paths via the Databricks API, then reads/writes via delta-rs. No Spark needed.
  • UC managed tables require Spark. Managed tables don't expose a direct storage path — if resolution returns no storage_location, Spark is required.
  • Iceberg tables in a catalog require Spark. File-based Iceberg output is not the same as a catalog-managed table.
  • Delta via delta-rs is used automatically by LakeLogic for read, write, merge, vacuum, and time travel on any Delta table accessible via path. A standalone DeltaAdapter class is also available for ad-hoc use outside the pipeline.