Skip to content

How It Works

Concept-RAG processes your documents through a multi-stage pipeline that extracts meaning, generates embeddings, and enables powerful hybrid search.

flowchart TB
    subgraph Input["📄 Documents"]
        PDF[PDF Files]
        EPUB[EPUB Files]
    end

    subgraph Processing["⚙️ Processing Pipeline"]
        Parse[Parse & Extract Text]
        OCR[OCR Fallback]
        Chunk[Chunk Text]
        Extract[Extract Concepts]
        Embed[Generate Embeddings]
        Summarize[Generate Summary]
    end

    subgraph Storage["💾 LanceDB Storage"]
        Catalog[(Catalog<br/>Documents)]
        Chunks[(Chunks<br/>Text Segments)]
        Concepts[(Concepts<br/>Index)]
        Categories[(Categories<br/>Taxonomy)]
    end

    subgraph Search["🔍 Hybrid Search"]
        Vector[Vector Similarity]
        BM25[BM25 Keywords]
        ConceptMatch[Concept Matching]
        WordNet[WordNet Expansion]
        Rank[Weighted Ranking]
    end

    subgraph Output["🤖 MCP Tools"]
        Tools[10 Specialized Tools]
        AI[AI Assistants]
    end

    PDF --> Parse
    EPUB --> Parse
    Parse --> OCR
    OCR --> Chunk
    Chunk --> Extract
    Chunk --> Embed
    Extract --> Summarize

    Embed --> Chunks
    Extract --> Concepts
    Summarize --> Catalog
    Extract --> Categories

    Catalog --> Vector
    Chunks --> Vector
    Concepts --> ConceptMatch

    Vector --> Rank
    BM25 --> Rank
    ConceptMatch --> Rank
    WordNet --> Rank

    Rank --> Tools
    Tools --> AI

Pipeline Stages

1. Document Ingestion

  • PDF Processing: Text extraction with layout preservation
  • EPUB Processing: Structured content extraction from chapters
  • OCR Fallback: Tesseract for scanned documents with no extractable text

2. Text Chunking

Documents are split into semantic chunks optimized for retrieval:

  • Target size: ~500 tokens per chunk
  • Overlap between chunks to preserve context
  • Page number tracking for citations

3. Concept Extraction

Each document is analyzed by an LLM to extract:

  • Primary Concepts: Core topics and themes (15-25 per document)
  • Technical Terms: Domain-specific vocabulary
  • Related Concepts: Secondary ideas and connections

4. Embedding Generation

Vector embeddings are generated for:

  • Document summaries (catalog search)
  • Individual chunks (content search)
  • Concept definitions (semantic matching)

Queries are scored using four signals:

Signal Weight Purpose
Vector Similarity 35% Semantic meaning match
BM25 Keywords 35% Exact term matching
Concept Matching 15% Extracted concept overlap
WordNet Expansion 15% Synonym and hypernym matching

Results are combined using weighted ranking for optimal retrieval accuracy.

6. Gap Detection (Elbow Method)

Search results are filtered using gap detection instead of fixed limits:

Scores: [0.85, 0.82, 0.78, 0.75, 0.40, 0.38, 0.35]
Gaps:   [0.03, 0.04, 0.03, 0.35, 0.02, 0.03]
                            ↑ largest gap
Returns: [0.85, 0.82, 0.78, 0.75] (high-scoring cluster)

This approach:

  • Adaptive: Returns more results for broad queries, fewer for specific ones
  • Quality-focused: Filters based on score quality, not arbitrary counts
  • Automatic: Finds the natural boundary between relevant and less-relevant results
Tool Result Filtering
catalog_search Gap detection (1-30 results)
broad_chunks_search Gap detection (1-30 results)
chunks_search Fixed limit (5 results)
concept_search All matching content
category_search All documents in category