LLM Extraction — Unstructured Data Processing

LakeLogic extracts structured data from PDFs, images, and free-text using a contract-first approach. Define what to extract in model.fields and how in extraction: — LakeLogic picks the right library, extracts, validates, and materialises. No custom parsers. No glue code.

How It Works

  Contract                   DataProcessor.run()
┌────────────────────┐      ┌──────────────────────────────────────────────┐
│ model.fields       │      │ 1. Preprocess  (pdfplumber / OCR / spaCy)    │
│  → what to extract │─────▶│ 2. Extract     (LLM prompt / NER / tables)   │
│ extraction.config  │      │ 3. Validate    (quality rules + quarantine)   │
│  → how to extract  │      │ 4. Materialise (parquet / delta / CSV)        │
└────────────────────┘      └──────────────────────────────────────────────┘

The same pattern runs every provider. For text DataFrames pass directly; for binary files (PDF, image) pass a DataFrame with a file_path column:

import polars as pl
from lakelogic import DataProcessor

proc = DataProcessor("contract.yaml", engine="polars")

# Text / CSV source — pass the DataFrame directly
good, bad = proc.run(tickets_df)

# Binary source (PDF, image) — point the engine at the file via a DataFrame
files_df = pl.DataFrame({"file_path": ["invoice.pdf"]})
good, bad = proc.run(files_df)   # contract.extraction.preprocessing.file_column: file_path

Supported Providers

Provider	Install	Input	Cost
`local` (pdfplumber)	`lakelogic[extraction-ocr]`	PDF, DOCX	$0, offline
`rapidocr`	`lakelogic[extraction-ocr]`	Scanned images	$0, ONNX, no torch
`spacy`	`lakelogic[nlp]`	Free text	$0, local NER
`unstructured`	`lakelogic[extraction]`	PDF, DOCX, HTML	$0, offline
`openai`	`lakelogic[ai]`	Any	Per token
`anthropic`	`lakelogic[ai]`	Any	Per token
`ollama`	`lakelogic[ai]`	Any	$0, local LLM
`azure_openai`	`lakelogic[ai]`	Any	Per token
`bedrock`	`lakelogic[ai]`	Any	Per token

Quick Start

# contracts/ticket_extraction.yaml
version: 1.0.0
info:
  title: "Support Ticket Extraction"
dataset: "support_tickets"

source:
  type: "landing"
  path: "data/tickets/*.csv"

extraction:
  provider: "openai"
  model: "gpt-4o-mini"
  temperature: 0.1
  text_column: "ticket_body"
  output_schema:
    - name: "sentiment"
      type: "string"
      accepted_values: ["positive", "neutral", "negative"]
    - name: "issue_category"
      type: "string"
      extraction_examples: ["billing", "technical", "account"]

model:
  fields:
    - name: ticket_id
      type: integer
    - name: ticket_body
      type: string
    - name: sentiment
      type: string
    - name: issue_category
      type: string

quality:
  row_rules:
    - not_null: ticket_id
    - accepted_values:
        field: sentiment
        values: ["positive", "neutral", "negative"]

from lakelogic import DataProcessor

processor = DataProcessor(engine="polars", contract="contracts/ticket_extraction.yaml")
good_df, bad_df = processor.run_source("data/tickets/batch_001.csv")

Cloud Providers

Provider	Env Var	Example Models
`openai`	`OPENAI_API_KEY`	`gpt-4o`, `gpt-4o-mini`
`anthropic`	`ANTHROPIC_API_KEY`	`claude-sonnet-4-20250514`, `claude-3-haiku`
`azure_openai`	`AZURE_OPENAI_API_KEY` + `AZURE_OPENAI_ENDPOINT`	Azure-hosted OpenAI models
`google`	`GOOGLE_API_KEY`	`gemini-2.0-flash`, `gemini-pro`
`bedrock`	AWS credentials (boto3)	Amazon Bedrock models

Local Providers (No API Key, No Data Leaves Your Network)

Provider	Setup	Example Models
`ollama`	Install Ollama, pull a model	`llama3.1`, `mistral`, `phi3`
`local`	`pip install lakelogic[local]`	HuggingFace Transformers (Phi-3-mini default)

# Override default provider globally
export LAKELOGIC_AI_PROVIDER=ollama
export LAKELOGIC_AI_MODEL=llama3.1

# Override Ollama URL
export OLLAMA_BASE_URL="http://localhost:11434"

Configuration Reference

Prompt Templates

Use Jinja2 templates with column names as variables:

extraction:
  text_column: "description"
  context_columns: ["category", "date"]

  prompt_template: |
    Given this product description:
    {{ description }}

    Category: {{ category }}
    Date: {{ date }}

    Extract the following fields as JSON.

  system_prompt: "You are a product data extraction assistant."

Output Schema

model.fields is the single source of truth for field names and types. output_schema only lists fields that need extraction-specific hints — anything not listed defaults to extraction_task: extraction. You never need to repeat type in output_schema.

model:
  fields:
    - name: brand
      type: string
    - name: price
      type: float
    - name: condition
      type: string

extraction:
  # Only fields needing non-default hints appear here — types are inherited from model.fields
  output_schema:
    - name: brand
      extraction_task: ner           # named-entity recognition, not plain extraction

    - name: condition
      extraction_task: classification
      accepted_values: ["new", "used", "refurbished"]
      extraction_examples: ["new", "like new", "refurbished"]

    # 'price' omitted — defaults to extraction_task: extraction

`extraction_task`	Description
`classification`	Pick from `accepted_values`
`extraction`	Extract a specific value
`ner`	Named entity recognition
`summarization`	Summarize text

Confidence Scoring

extraction:
  confidence:
    enabled: true
    method: "field_completeness"
    column: "_lakelogic_extraction_confidence"

Method	Description
`field_completeness`	Ratio of non-null extracted fields
`log_probs`	LLM log probabilities (OpenAI only)
`self_assessment`	Ask the LLM to rate its own confidence
`consistency`	Run twice, compare results

Cost & Safety Controls

extraction:
  # Budget limits
  max_cost_per_run: 50.00       # USD cap
  max_rows_per_run: 10000       # Row limit

  # Throughput
  batch_size: 50                # Rows per API call
  concurrency: 5                # Parallel API calls

  # Retry
  retry:
    max_attempts: 3
    backoff: "exponential"
    initial_delay: 1.0

  # Fallback
  fallback_model: "gpt-4o-mini"
  fallback_provider: "openai"

  # PII safety — redact before sending to LLM
  redact_pii_before_llm: true
  pii_fields: ["email", "phone", "ssn"]

Preprocessing: PDF, Image, Audio, Video

For non-text sources, add a preprocessing block to convert raw files into text before LLM extraction.

PDF

extraction:
  preprocessing:
    content_type: "pdf"
    ocr:
      enabled: true
      engine: "tesseract"          # or azure_di, textract, google_vision
      language: "eng"
    chunking:
      strategy: "page"             # or paragraph, sentence, fixed_size
      max_chunk_tokens: 4000
      overlap_tokens: 200

Image

extraction:
  preprocessing:
    content_type: "image"
    ocr:
      enabled: true
      engine: "tesseract"

Supported Content Types

Type	Preprocessing	Notes
`pdf`	OCR + chunking	Page-level chunking recommended
`image`	OCR → text	tesseract, Azure DI, Textract
`audio`	Whisper transcription	Audio → text → extraction
`video`	Audio track → Whisper	Video → audio → text → extraction
`html`	Built-in parser	HTML → clean text
`email`	Built-in parser	Parse headers + body
`text`	None	Direct extraction

End-to-End Example: Invoice Processing

version: 1.0.0
info:
  title: "Bronze Invoice Extraction"
  target_layer: "bronze"

dataset: "invoices"

source:
  type: "landing"
  path: "data/invoices/*.pdf"

extraction:
  provider: "openai"
  model: "gpt-4o"
  temperature: 0.0

  preprocessing:
    content_type: "pdf"
    ocr:
      enabled: true
      engine: "azure_di"
    chunking:
      strategy: "page"
      max_chunk_tokens: 4000

  output_schema:
    - name: "invoice_number"
      type: "string"
      extraction_task: "extraction"
    - name: "vendor_name"
      type: "string"
      extraction_task: "ner"
    - name: "total_amount"
      type: "float"
      extraction_task: "extraction"
    - name: "currency"
      type: "string"
      accepted_values: ["USD", "EUR", "GBP"]
    - name: "invoice_date"
      type: "date"
      extraction_task: "extraction"

  confidence:
    enabled: true
    method: "field_completeness"
    column: "_extraction_confidence"

  max_cost_per_run: 25.00
  batch_size: 10
  concurrency: 3
  redact_pii_before_llm: true
  pii_fields: ["bank_account", "tax_id"]

model:
  fields:
    - name: invoice_number
      type: string
    - name: vendor_name
      type: string
    - name: total_amount
      type: float
    - name: currency
      type: string
    - name: invoice_date
      type: date
    - name: _extraction_confidence
      type: float

quality:
  row_rules:
    - not_null: invoice_number
    - not_null: total_amount
    - accepted_values:
        field: currency
        values: ["USD", "EUR", "GBP"]

materialization:
  strategy: "append"
  path: "s3://bronze/invoices"
  format: "delta"

lineage:
  enabled: true

Last Updated: March 2026