OCI Pipeline - Document OCR with Vision-Language Model

Technology Stack

Document OCR with vision-language model - leveraging DeepSeek Vision for intelligent document understanding and text extraction powered by vLLM inference server.

Vision Model DeepSeek VL

Inference Server vLLM

Processing Document OCR

Output Structured Data

Workflow 1: DeepSeek Vision OCR

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#C17852', 'primaryTextColor': '#F0F6FC', 'primaryBorderColor': '#4A5E32', 'lineColor': '#E6C98F', 'secondaryColor': '#161B22', 'tertiaryColor': '#0D1117', 'background': '#0D1117', 'mainBkg': '#161B22', 'nodeBorder': '#4A5E32', 'clusterBkg': '#161B22', 'clusterBorder': '#4A5E32', 'titleColor': '#E6C98F', 'edgeLabelBackground': '#161B22'}}}%%
flowchart TB
    subgraph Input["📥 DOCUMENT INPUT"]
        DOC[/"📄 Document Image
PDF / PNG / JPG"/]
        CFG[/"⚙️ OCR Config
Language, Mode"/]
    end

    subgraph Preprocessing["🔧 IMAGE PREPROCESSING"]
        LOAD["📥 Load Image
PIL / OpenCV"]
        RESIZE["📐 Resize & Normalize
Optimal Resolution"]
        ENCODE["🔢 Encode to Base64
Model Input Format"]
    end

    subgraph VisionModel["🧠 DEEPSEEK VISION MODEL"]
        direction TB
        PROMPT["📝 Craft OCR Prompt
Extract all text..."]
        VL[("🧠 DeepSeek VL
Vision-Language Model")]
        INFER["⚡ Model Inference
Text Extraction"]
    end

    subgraph Output["📤 TEXT OUTPUT"]
        RAW["📜 Raw Extracted Text"]
        CLEAN["✨ Clean & Format"]
        RESULT["📋 Final OCR Result"]
    end

    DOC --> LOAD
    CFG --> PROMPT
    LOAD --> RESIZE
    RESIZE --> ENCODE
    ENCODE --> VL
    PROMPT --> VL
    VL --> INFER
    INFER --> RAW
    RAW --> CLEAN
    CLEAN --> RESULT

DeepSeek Vision: Multimodal vision-language model capable of understanding document layouts, handwriting, tables, and complex formatting for accurate text extraction.

Workflow 2: vLLM Inference Server

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#C17852', 'primaryTextColor': '#F0F6FC', 'primaryBorderColor': '#4A5E32', 'lineColor': '#E6C98F', 'secondaryColor': '#161B22', 'tertiaryColor': '#0D1117', 'background': '#0D1117', 'mainBkg': '#161B22', 'nodeBorder': '#4A5E32', 'clusterBkg': '#161B22', 'clusterBorder': '#4A5E32', 'titleColor': '#E6C98F', 'edgeLabelBackground': '#161B22'}}}%%
flowchart LR
    subgraph Client["📱 CLIENT"]
        REQ["🌐 HTTP Request
Image + Prompt"]
        RESP["📥 Response
Extracted Text"]
    end

    subgraph Gateway["🔐 API GATEWAY"]
        AUTH["🔑 Authentication"]
        RATE["⏱️ Rate Limiting"]
        QUEUE["📋 Request Queue"]
    end

    subgraph vLLMServer["⚡ vLLM SERVER"]
        direction TB
        ENGINE[("⚡ vLLM Engine
PagedAttention")]
        BATCH["📦 Continuous Batching
Dynamic Batching"]
        KV["💾 KV Cache
Memory Optimization"]
        GPU["🎮 GPU Inference
CUDA Acceleration"]
    end

    subgraph Model["🧠 DEEPSEEK VL"]
        WEIGHTS[("🧠 Model Weights
Vision + Language")]
    end

    REQ --> AUTH
    AUTH --> RATE
    RATE --> QUEUE
    QUEUE --> ENGINE
    ENGINE --> BATCH
    BATCH --> KV
    KV --> GPU
    GPU --> WEIGHTS
    WEIGHTS --> GPU
    GPU --> RESP

vLLM Performance: High-throughput inference with PagedAttention for efficient KV cache management, continuous batching for optimal GPU utilization, and memory-efficient serving.

Workflow 3: Document Preprocessing

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#C17852', 'primaryTextColor': '#F0F6FC', 'primaryBorderColor': '#4A5E32', 'lineColor': '#E6C98F', 'secondaryColor': '#161B22', 'tertiaryColor': '#0D1117', 'background': '#0D1117', 'mainBkg': '#161B22', 'nodeBorder': '#4A5E32', 'clusterBkg': '#161B22', 'clusterBorder': '#4A5E32', 'titleColor': '#E6C98F', 'edgeLabelBackground': '#161B22'}}}%%
flowchart TB
    subgraph Input["📥 RAW INPUT"]
        PDF[/"📄 PDF Document"/]
        IMG[/"🖼️ Image File"/]
        SCAN[/"📷 Scanned Doc"/]
    end

    subgraph Detection["🔍 FORMAT DETECTION"]
        DETECT["🔎 File Type Detection
MIME / Magic Bytes"]
        ROUTE{"🔀 Route by Type"}
    end

    subgraph PDFProcess["📄 PDF PROCESSING"]
        EXTRACT["📑 Extract Pages
pdf2image"]
        DPI["📐 Set DPI
300 DPI Default"]
    end

    subgraph ImageProcess["🖼️ IMAGE PROCESSING"]
        LOAD2["📥 Load Image"]
        ORIENT["🔄 Auto-Orient
EXIF Rotation"]
        DESKEW["📏 Deskew
Angle Correction"]
    end

    subgraph Normalize["📊 NORMALIZATION"]
        RESIZE2["📐 Resize
Max 2048px"]
        CONTRAST["🎨 Enhance Contrast"]
        DENOISE["🔇 Denoise
Gaussian Blur"]
        SHARP["✨ Sharpen
Edge Enhancement"]
    end

    subgraph Output["📤 MODEL INPUT"]
        TENSOR["🔢 To Tensor
Normalized Array"]
        BASE64["📝 Base64 Encode"]
        READY["✅ Ready for VL Model"]
    end

    PDF --> DETECT
    IMG --> DETECT
    SCAN --> DETECT
    DETECT --> ROUTE
    ROUTE -->|"PDF"| EXTRACT
    ROUTE -->|"Image"| LOAD2
    EXTRACT --> DPI
    DPI --> ORIENT
    LOAD2 --> ORIENT
    ORIENT --> DESKEW
    DESKEW --> RESIZE2
    RESIZE2 --> CONTRAST
    CONTRAST --> DENOISE
    DENOISE --> SHARP
    SHARP --> TENSOR
    TENSOR --> BASE64
    BASE64 --> READY

Workflow 4: Multi-Page Processing

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#C17852', 'primaryTextColor': '#F0F6FC', 'primaryBorderColor': '#4A5E32', 'lineColor': '#E6C98F', 'secondaryColor': '#161B22', 'tertiaryColor': '#0D1117', 'background': '#0D1117', 'mainBkg': '#161B22', 'nodeBorder': '#4A5E32', 'clusterBkg': '#161B22', 'clusterBorder': '#4A5E32', 'titleColor': '#E6C98F', 'edgeLabelBackground': '#161B22'}}}%%
flowchart TB
    subgraph Input["📥 BATCH INPUT"]
        DOCS[/"📚 Multi-Page PDF
or Document Batch"/]
        CONFIG[/"⚙️ Batch Config
Concurrency, Priority"/]
    end

    subgraph Splitter["✂️ PAGE SPLITTER"]
        SPLIT["📑 Split to Pages"]
        INDEX["🔢 Index Pages
Maintain Order"]
        QUEUE2["📋 Page Queue"]
    end

    subgraph ParallelOCR["⚡ PARALLEL OCR"]
        direction LR
        subgraph Worker1["Worker 1"]
            W1["🧠 DeepSeek VL"]
        end
        subgraph Worker2["Worker 2"]
            W2["🧠 DeepSeek VL"]
        end
        subgraph Worker3["Worker 3"]
            W3["🧠 DeepSeek VL"]
        end
        subgraph WorkerN["Worker N"]
            WN["🧠 DeepSeek VL"]
        end
    end

    subgraph Aggregator["🔗 RESULT AGGREGATOR"]
        COLLECT["📥 Collect Results"]
        ORDER["🔢 Restore Order"]
        MERGE["🔗 Merge Text
Page Separators"]
    end

    subgraph Output["📤 FINAL OUTPUT"]
        COMBINED["📄 Combined Document"]
        META[("📊 Metadata
Page Count, Confidence")]
    end

    DOCS --> SPLIT
    CONFIG --> QUEUE2
    SPLIT --> INDEX
    INDEX --> QUEUE2
    QUEUE2 --> W1
    QUEUE2 --> W2
    QUEUE2 --> W3
    QUEUE2 --> WN
    W1 --> COLLECT
    W2 --> COLLECT
    W3 --> COLLECT
    WN --> COLLECT
    COLLECT --> ORDER
    ORDER --> MERGE
    MERGE --> COMBINED
    MERGE --> META

Parallel Processing: Distribute pages across multiple vLLM workers for high-throughput batch document processing with automatic load balancing and result aggregation.

Workflow 5: Structured Data Extraction

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#C17852', 'primaryTextColor': '#F0F6FC', 'primaryBorderColor': '#4A5E32', 'lineColor': '#E6C98F', 'secondaryColor': '#161B22', 'tertiaryColor': '#0D1117', 'background': '#0D1117', 'mainBkg': '#161B22', 'nodeBorder': '#4A5E32', 'clusterBkg': '#161B22', 'clusterBorder': '#4A5E32', 'titleColor': '#E6C98F', 'edgeLabelBackground': '#161B22'}}}%%
flowchart TB
    subgraph Input["📥 RAW OCR OUTPUT"]
        RAW2[/"📜 Raw Extracted Text"/]
        SCHEMA[/"📋 Target Schema
JSON Template"/]
    end

    subgraph Analysis["🔍 TEXT ANALYSIS"]
        SEGMENT["📊 Segment Text
Headers, Body, Tables"]
        DETECT2["🏷️ Entity Detection
Dates, Names, Numbers"]
        PATTERN["🔎 Pattern Matching
Regex Extraction"]
    end

    subgraph LLMParsing["🧠 LLM STRUCTURED PARSING"]
        PROMPT2["📝 Parsing Prompt
Schema-Guided"]
        VL2[("🧠 DeepSeek VL
JSON Mode")]
        VALIDATE["✅ JSON Validation
Schema Check"]
    end

    subgraph Transform["🔄 TRANSFORMATION"]
        NORMALIZE2["📐 Normalize Values
Dates, Currency"]
        ENRICH["✨ Enrich Data
Computed Fields"]
        CLEAN2["🧹 Clean Nulls
Default Values"]
    end

    subgraph Output["📤 STRUCTURED OUTPUT"]
        JSON2[("📋 JSON Document")]
        CSV["📊 CSV Export"]
        DB[("🗄️ Database
Insert/Update")]
    end

    RAW2 --> SEGMENT
    SCHEMA --> PROMPT2
    SEGMENT --> DETECT2
    DETECT2 --> PATTERN
    PATTERN --> PROMPT2
    PROMPT2 --> VL2
    VL2 --> VALIDATE
    VALIDATE --> NORMALIZE2
    NORMALIZE2 --> ENRICH
    ENRICH --> CLEAN2
    CLEAN2 --> JSON2
    JSON2 --> CSV
    JSON2 --> DB

Workflow 6: Full OCR Architecture

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#C17852', 'primaryTextColor': '#F0F6FC', 'primaryBorderColor': '#4A5E32', 'lineColor': '#E6C98F', 'secondaryColor': '#161B22', 'tertiaryColor': '#0D1117', 'background': '#0D1117', 'mainBkg': '#161B22', 'nodeBorder': '#4A5E32', 'clusterBkg': '#161B22', 'clusterBorder': '#4A5E32', 'titleColor': '#E6C98F', 'edgeLabelBackground': '#161B22'}}}%%
flowchart LR
    subgraph Intake["📥 DOCUMENT INTAKE"]
        UPLOAD["📤 Upload API
REST / gRPC"]
        STORAGE[("☁️ Object Storage
S3 / GCS")]
        TRIGGER["⚡ Event Trigger
New Document"]
    end

    subgraph Pipeline["🔄 OCR PIPELINE"]
        direction TB
        PRE["🔧 Preprocessing
Normalize Images"]
        SPLIT2["✂️ Page Splitting
Multi-Page Support"]

        subgraph InferenceCluster["⚡ vLLM CLUSTER"]
            LB["⚖️ Load Balancer"]
            N1["🧠 DeepSeek VL #1"]
            N2["🧠 DeepSeek VL #2"]
            N3["🧠 DeepSeek VL #3"]
        end

        AGG["🔗 Aggregator
Merge Results"]
    end

    subgraph PostProcess["✨ POST-PROCESSING"]
        STRUCT["📋 Structure Extraction
Tables, Forms"]
        VALIDATE2["✅ Confidence Check
Quality Score"]
        FORMAT["📝 Output Formatting
JSON, Markdown"]
    end

    subgraph Delivery["📤 DELIVERY"]
        WEBHOOK["🔔 Webhook
Callback URL"]
        QUEUE3["📬 Message Queue
Kafka / RabbitMQ"]
        API["🌐 REST API
Poll Results"]
        ARCHIVE[("🗄️ Archive
Long-term Storage")]
    end

    UPLOAD --> STORAGE
    STORAGE --> TRIGGER
    TRIGGER --> PRE
    PRE --> SPLIT2
    SPLIT2 --> LB
    LB --> N1
    LB --> N2
    LB --> N3
    N1 --> AGG
    N2 --> AGG
    N3 --> AGG
    AGG --> STRUCT
    STRUCT --> VALIDATE2
    VALIDATE2 --> FORMAT
    FORMAT --> WEBHOOK
    FORMAT --> QUEUE3
    FORMAT --> API
    FORMAT --> ARCHIVE

Production Architecture: Scalable document processing pipeline with load-balanced vLLM inference cluster, async processing via message queues, and multiple delivery options.