How It Works¶

Concept-RAG processes your documents through a multi-stage pipeline that extracts meaning, generates embeddings, and enables powerful hybrid search.

flowchart TB
    subgraph Input["📄 Documents"]
        PDF[PDF Files]
        EPUB[EPUB Files]
    end

    subgraph Processing["⚙️ Processing Pipeline"]
        Parse[Parse & Extract Text]
        OCR[OCR Fallback]
        Chunk[Chunk Text]
        Extract[Extract Concepts]
        Embed[Generate Embeddings]
        Summarize[Generate Summary]
    end

    subgraph Storage["💾 LanceDB Storage"]
        Catalog[(Catalog<br/>Documents)]
        Chunks[(Chunks<br/>Text Segments)]
        Concepts[(Concepts<br/>Index)]
        Categories[(Categories<br/>Taxonomy)]
    end

    subgraph Search["🔍 Hybrid Search"]
        Vector[Vector Similarity]
        BM25[BM25 Keywords]
        ConceptMatch[Concept Matching]
        WordNet[WordNet Expansion]
        Rank[Weighted Ranking]
    end

    subgraph Output["🤖 MCP Tools"]
        Tools[10 Specialized Tools]
        AI[AI Assistants]
    end

    PDF --> Parse
    EPUB --> Parse
    Parse --> OCR
    OCR --> Chunk
    Chunk --> Extract
    Chunk --> Embed
    Extract --> Summarize

    Embed --> Chunks
    Extract --> Concepts
    Summarize --> Catalog
    Extract --> Categories

    Catalog --> Vector
    Chunks --> Vector
    Concepts --> ConceptMatch

    Vector --> Rank
    BM25 --> Rank
    ConceptMatch --> Rank
    WordNet --> Rank

    Rank --> Tools
    Tools --> AI

Pipeline Stages¶

1. Document Ingestion¶

PDF Processing: Text extraction with layout preservation
EPUB Processing: Structured content extraction from chapters
OCR Fallback: Tesseract for scanned documents with no extractable text

2. Text Chunking¶

Documents are split into semantic chunks optimized for retrieval:

Target size: ~500 tokens per chunk
Overlap between chunks to preserve context
Page number tracking for citations

3. Concept Extraction¶

Each document is analyzed by an LLM to extract:

Primary Concepts: Core topics and themes (15-25 per document)
Technical Terms: Domain-specific vocabulary
Related Concepts: Secondary ideas and connections

4. Embedding Generation¶

Vector embeddings are generated for:

Document summaries (catalog search)
Individual chunks (content search)
Concept definitions (semantic matching)

5. Hybrid Search¶

Queries are scored using four signals:

Signal	Weight	Purpose
Vector Similarity	35%	Semantic meaning match
BM25 Keywords	35%	Exact term matching
Concept Matching	15%	Extracted concept overlap
WordNet Expansion	15%	Synonym and hypernym matching

Results are combined using weighted ranking for optimal retrieval accuracy.

6. Gap Detection (Elbow Method)¶

Search results are filtered using gap detection instead of fixed limits:

Scores: [0.85, 0.82, 0.78, 0.75, 0.40, 0.38, 0.35]
Gaps:   [0.03, 0.04, 0.03, 0.35, 0.02, 0.03]
                            ↑ largest gap
Returns: [0.85, 0.82, 0.78, 0.75] (high-scoring cluster)

This approach:

Adaptive: Returns more results for broad queries, fewer for specific ones
Quality-focused: Filters based on score quality, not arbitrary counts
Automatic: Finds the natural boundary between relevant and less-relevant results

Tool	Result Filtering
`catalog_search`	Gap detection (1-30 results)
`broad_chunks_search`	Gap detection (1-30 results)
`chunks_search`	Fixed limit (5 results)
`concept_search`	All matching content
`category_search`	All documents in category