Capability Matrix
This matrix summarizes what each engine supports in the OSS runtime. When a feature is engine-specific, use an explicit engine selection.
Engine Support
| Engine | File Sources | Table Sources | File Outputs | Table Outputs | Quarantine Targets | Notes |
|---|---|---|---|---|---|---|
| Polars | CSV, Parquet, Delta¹ | Delta tables¹ | CSV, Parquet, Delta¹, Iceberg² | Delta tables¹ | DuckDB/SQLite table, Delta¹ file | Delta-RS provides Spark-free lakehouse table support. |
| Pandas | CSV, Parquet, Delta¹ | Delta tables¹ | CSV, Parquet, Delta¹, Iceberg² | Delta tables¹ | DuckDB/SQLite table, Delta¹ file | Uses DuckDB for SQL execution. |
| DuckDB | CSV, Parquet, Delta¹ | Delta tables¹ | CSV, Parquet, Delta¹, Iceberg² | DuckDB + Delta tables¹ | DuckDB/SQLite table, Delta¹ file | Table outputs are DuckDB or Delta tables. |
| Spark | CSV, Parquet, Delta, Iceberg, JSON | Spark tables | CSV, Parquet, Delta, Iceberg, JSON | Spark tables | Spark tables (Delta/Iceberg) | Recommended for lakehouse catalogs (Unity Catalog/Fabric). |
| Snowflake | Table-only | Table-only | Table writes (Snowflake) | Snowflake tables | Snowflake tables | Requires snowflake-connector-python and credentials. |
| BigQuery | Table-only | Table-only | Table writes (BigQuery) | BigQuery tables | BigQuery tables | Requires google-cloud-bigquery and credentials. |
| LLM | Text, PDF, Image, Audio, Video | N/A | Structured rows | N/A | Via parent engine | Unstructured → structured extraction via LLM providers. |
¹ Delta support via delta-rs (pip install deltalake). Includes read, write, atomic MERGE, vacuum, optimize, and time travel — no Spark/JVM required. Cloud storage auto-credentials via DeltaAdapter.
² Iceberg file output uses DuckDB's iceberg extension (COPY ... FORMAT ICEBERG).
Materialization Strategies
| Strategy | Polars | Pandas | DuckDB | Spark |
|---|---|---|---|---|
overwrite |
Native | Native | Native | Native |
append |
Native | Native | Native | Native |
merge |
Native (Delta via delta-rs) / pandas (CSV/Parquet) | Native | Native (Delta via delta-rs) / pandas (CSV/Parquet) | Native (distributed) |
scd2 |
Via pandas | Native | Via pandas | Native (distributed) |
Spark advantage: Merge and SCD2 operations run natively using distributed DataFrame operations, avoiding driver memory bottlenecks. For Delta Lake tables, LakeLogic uses MERGE INTO when available.
Delta merge via delta-rs: When the output format is Delta, Polars and DuckDB use DeltaTable.merge() from delta-rs for atomic merge operations — no pandas conversion needed. This uses the same MERGE INTO semantics as Spark Delta but runs without JVM.
Non-Delta merge/SCD2: For CSV/Parquet targets, Polars and DuckDB fall back to pandas for merge and SCD2 operations. This works well for moderate data volumes but may hit driver memory limits at scale — use Spark or Delta format for large datasets.
Format Defaults
- Quarantine file targets default to Parquet (override with
quarantine.format, ormetadata.quarantine_format, or file suffix). - Quarantine table targets on Spark default to Delta (override with
metadata.quarantine_table_format). - Non-Spark engines support CSV, Parquet, and Delta for quarantine file targets.
If you need an unsupported combination, use Spark or route through an external staging step.
LLM Extraction Engine
The LLM engine (engines/llm.py) extracts structured data from unstructured inputs:
| Provider | API Key Env Var | Notes |
|---|---|---|
openai |
OPENAI_API_KEY |
GPT-4o, GPT-4o-mini |
azure_openai |
AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT |
Azure-hosted OpenAI |
anthropic |
ANTHROPIC_API_KEY |
Claude 3.5 Sonnet, Haiku |
google |
GOOGLE_API_KEY |
Gemini 2.0 Flash |
bedrock |
AWS credentials (boto3) | Claude, Llama, Titan via AWS |
ollama |
None (local) | Any Ollama model, default http://localhost:11434 |
local |
None (local) | HuggingFace Transformers (Phi-3-mini default) |
Preprocessing supports PDF (OCR), image, audio (Whisper), and video inputs.
Lakehouse Catalog Notes (Unity Catalog / Fabric)
- Unity Catalog external tables work with Polars/Pandas. LakeLogic resolves 3-part table names (
catalog.schema.table) to storage paths via the Databricks API, then reads/writes via delta-rs. No Spark needed. - UC managed tables require Spark. Managed tables don't expose a direct storage path — if resolution returns no
storage_location, Spark is required. - Iceberg tables in a catalog require Spark. File-based Iceberg output is not the same as a catalog-managed table.
- Delta via delta-rs is used automatically by LakeLogic for read, write, merge, vacuum, and time travel on any Delta table accessible via path. A standalone
DeltaAdapterclass is also available for ad-hoc use outside the pipeline.