Concept-RAG Database Schema¶

Last Updated: 2025-11-29
Database: LanceDB (embedded vector database)
Embedding Model: all-MiniLM-L6-v2 (384 dimensions)
Schema Version: Normalized v7 (derived text fields, no ID caches needed)

Overview¶

Concept-RAG uses a four-table normalized architecture optimized for concept-heavy workloads:

┌─────────────┐       ┌─────────────┐       ┌─────────────┐       ┌──────────────┐
│   catalog   │──────<│   chunks    │       │  concepts   │       │  categories  │
│ (documents) │       │  (segments) │       │   (index)   │       │  (taxonomy)  │
└─────────────┘       └─────────────┘       └─────────────┘       └──────────────┘
      │                     │                     │                      │
      │                     │                     │                      │
      └─────────────────────┴─────────────────────┴──────────────────────┘
                        Many-to-many via native array columns

Design Principles¶

Normalization: No redundant data - single source of truth for each piece of information
Hash-based IDs: FNV-1a hashing for stable integer IDs across database rebuilds
Native arrays: Many-to-many relationships stored as LanceDB native arrays (not JSON strings)
ID-based references: All relationships use integer IDs (no string paths in relationships)
Derived text fields: Human-readable names stored alongside IDs for direct text queries
No runtime caches: Derived fields eliminate need for ID-to-name cache lookups
Vector storage: 384-dimensional embeddings on all tables for semantic search
LLM as source: Concepts table is populated by LLM extraction, serving as canonical source

Tables¶

1. Catalog Table¶

Purpose: Document-level metadata and summaries for document discovery.

Column	Type	Description
`id`	`number`	Hash-based integer ID (FNV-1a of source path)
`source`	`string`	Document file path (unique identifier)
`title`	`string`	Document title (parsed from filename or content)
`summary`	`string`	LLM-generated document summary
`hash`	`string`	SHA-256 content hash for deduplication
`vector`	`Float32Array`	384-dimensional embedding of summary
`concept_ids`	`number[]`	Native array of concept IDs (foreign keys to concepts table)
`concept_names`	`string[]`	DERIVED: Concept names for display and text search
`category_ids`	`number[]`	Native array of category integer IDs
`category_names`	`string[]`	DERIVED: Category names for display and text search
`origin_hash`	`string`	Reserved: Hash of original file before processing
`author`	`string`	Document author(s) (parsed from filename)
`year`	`number`	Publication year (parsed from filename)
`publisher`	`string`	Publisher name (parsed from filename)
`isbn`	`string`	ISBN (parsed from filename, preserved with hyphens)
`document_type`	`string`	Document classification: 'book', 'paper', 'article', 'unknown'
`doi`	`string`	Digital Object Identifier (e.g., "10.1109/MS.2022.3166266")
`arxiv_id`	`string`	ArXiv identifier (e.g., "2204.11193v1")
`venue`	`string`	Publication venue (journal/conference name)
`keywords`	`string[]`	Keywords from paper metadata or extraction
`abstract`	`string`	Paper abstract (distinct from LLM summary)
`authors`	`string[]`	Array of author names (for multi-author papers)

Filename Metadata Parsing¶

Bibliographic fields are automatically extracted from filenames using the -- delimiter format:

Title -- Author -- Date -- Publisher -- ISBN -- Hash.ext

Example filename:

Effective software testing -- Elfriede Dustin -- December 18, 2002 -- Addison-Wesley -- 9780201794298 -- 5297d243.pdf

Normalization rules: - Underscores (_) are converted to spaces - URL-encoded spaces (%20, _20) are converted to spaces
- Multiple consecutive spaces are collapsed to single space - Fields that cannot be parsed are left empty (string) or zero (number) - If no -- delimiters exist, entire filename (without extension) becomes the title

Research Paper Metadata Extraction¶

For research papers (especially LaTeX-generated PDFs like arXiv preprints), metadata is extracted from multiple sources:

1. Filename Patterns: - ArXiv IDs: 2204.11193v1.pdf → arxiv_id: "2204.11193v1" - DOI-based: 10.1109-MS.2022.3166266.pdf → doi: "10.1109/MS.2022.3166266"

2. PDF Metadata Dictionary: - Title, Author, Subject, Keywords (when available) - Often empty for LaTeX-generated PDFs

3. Content-Based Extraction: - Title: First prominent text on page 1 - Authors: Names following title, before abstract - Abstract: Text following "Abstract" heading - Keywords: Extracted from "Keywords:" section if present

Document Type Detection: Papers are distinguished from books using heuristics: - Page count (papers typically <50 pages) - Presence of "Abstract" section - Citation patterns ([1], [2] style references) - Filename patterns (arXiv ID, DOI)

Priority Order: 1. Content-based extraction (most reliable for LaTeX PDFs) 2. PDF metadata dictionary (for publisher PDFs) 3. Filename parsing (fallback)

Example Record (Book)¶

{
  id: 3847293847,
  source: "/home/user/ebooks/Clean Architecture -- Robert Martin -- 2017 -- Pearson -- 9780134494166 -- abc123.pdf",
  title: "Clean Architecture",
  summary: "Comprehensive guide to Clean Architecture principles...",
  hash: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
  vector: Float32Array(384),
  concept_ids: [2938475683, 1029384756, 3847293900],
  concept_names: ["clean architecture", "dependency injection", "solid principles"],
  category_ids: [1847362847, 2938476523],
  category_names: ["software architecture", "design patterns"],
  // Bibliographic fields (parsed from filename)
  origin_hash: "",
  author: "Robert Martin",
  year: 2017,
  publisher: "Pearson",
  isbn: "9780134494166",
  // Research paper fields (empty for books)
  document_type: "book",
  doi: "",
  arxiv_id: "",
  venue: "",
  keywords: [],
  abstract: "",
  authors: []
}

Example Record (Research Paper)¶

{
  id: 1029384756,
  source: "/home/user/papers/2204.11193v1.pdf",
  title: "Exploring Security Practices of Smart Contract Developers",
  summary: "Study examining how developers approach security in smart contract development...",
  hash: "a1b2c3d4e5f6...",
  vector: Float32Array(384),
  concept_ids: [1111111111, 2222222222],
  concept_names: ["smart contracts", "blockchain security"],
  category_ids: [3333333333],
  category_names: ["blockchain", "security"],
  // Bibliographic fields
  origin_hash: "",
  author: "Tanusree Sharma et al.",
  year: 2022,
  publisher: "",
  isbn: "",
  // Research paper fields (extracted from content/metadata)
  document_type: "paper",
  doi: "",
  arxiv_id: "2204.11193v1",
  venue: "arXiv preprint",
  keywords: ["smart contract", "security", "blockchain", "developer"],
  abstract: "Smart contracts are self-executing programs that run on blockchains...",
  authors: ["Tanusree Sharma", "Zhixuan Zhou", "Andrew Miller", "Yang Wang"]
}

2. Chunks Table¶

Purpose: Text segments for fine-grained retrieval and semantic search.

Column	Type	Description
`id`	`number`	Hash-based integer ID (FNV-1a of source-hash-index)
`catalog_id`	`number`	Required. Hash-based catalog entry ID (foreign key to catalog)
`catalog_title`	`string`	DERIVED: Document title from catalog (for display)
`text`	`string`	Chunk text content (typically 100-500 words)
`hash`	`string`	Content hash for deduplication
`vector`	`Float32Array`	384-dimensional embedding
`concept_ids`	`number[]`	Native array of concept integer IDs (for concept-chunk linkage)
`concept_names`	`string[]`	DERIVED: Concept names for display and text search
`concept_density`	`number`	Concept richness score: `concept_ids.length / (word_count / 10)`
`page_number`	`number`	Page number in source document (from PDF loader)
`is_reference`	`boolean`	True if chunk is from bibliography/references section
`has_math`	`boolean`	True if chunk contains mathematical content (symbols, equations)
`has_extraction_issues`	`boolean`	True if chunk has extraction quality issues (e.g., garbled math)

Note: The source field was removed in v7. Chunks store catalog_title for display and catalog_id for joins. Tools should use catalog_title directly.

Note: The category_ids field was removed - use catalog_id → catalog.category_ids for category lookup.

Note: No ID-to-name caches are needed at runtime. Use concept_names and catalog_title directly.

Example Record¶

{
  id: 2938475612,  // hash-based integer
  catalog_id: 3847293847,  // foreign key to catalog
  catalog_title: "Clean Architecture",  // DERIVED: for display
  text: "Clean architecture emphasizes separation of concerns...",
  hash: "def456",
  vector: Float32Array(384),
  concept_ids: [3847293847, 1928374652, 2837465928],
  concept_names: ["clean architecture", "separation of concerns", "dependency rule"],  // DERIVED
  page_number: 15,
  is_reference: false,  // Not from bibliography section
  has_math: false,  // No mathematical content
  has_extraction_issues: false  // Clean extraction
}

3. Concepts Table¶

Purpose: Deduplicated concept index with semantic enrichment. Canonical source - populated by LLM extraction.

Column	Type	Description
`id`	`number`	Hash-based integer ID (FNV-1a of concept name)
`name`	`string`	Concept name (unique, lowercase, e.g., "dependency injection")
`summary`	`string`	LLM-generated contextual summary of the concept
`catalog_ids`	`number[]`	Native array of catalog entry integer IDs
`catalog_titles`	`string[]`	DERIVED: Document titles for display
`chunk_ids`	`number[]`	Native array of chunk IDs where concept appears
`adjacent_ids`	`number[]`	Co-occurrence links (concepts appearing together in documents)
`related_ids`	`number[]`	Lexical links (concepts sharing significant words)
`synonyms`	`string[]`	Native array of synonyms (from WordNet)
`broader_terms`	`string[]`	Native array of hypernyms (from WordNet)
`narrower_terms`	`string[]`	Native array of hyponyms (from WordNet)
`weight`	`number`	Importance weight (0-1, based on frequency/centrality)
`vector`	`Float32Array`	384-dimensional concept embedding

Concept Summaries¶

Concept summaries are generated during extraction (same LLM call) to ensure alignment with the source text: - 15-30 words describing the concept in the document's context - Domain-specific terminology from the source document - Enables fuzzy text search on concept meanings

Concept Linking¶

Two types of concept relationships:

Adjacent (co-occurrence): Concepts that appear together in the same document
E.g., "clean architecture" and "dependency injection" both in same book
Related (lexical): Concepts sharing significant words (≥5 chars)
E.g., "military strategy" ↔ "strategy pattern" (share "strategy")

Example Record¶

{
  id: 3847293847,
  name: "clean architecture",
  summary: "A software design approach that separates business logic from infrastructure through dependency inversion and clear boundaries.",
  catalog_ids: [1029384756, 2938475612],
  catalog_titles: ["Clean Architecture", "Domain-Driven Design"],  // DERIVED
  chunk_ids: [123456, 234567, 345678],
  adjacent_ids: [2938475683, 1029384756],  // co-occurrence
  related_ids: [1847362999, 2938476000],   // lexical links
  synonyms: ["uncle bob architecture"],
  broader_terms: ["software architecture", "system design"],
  narrower_terms: ["dependency rule", "boundary layer"],
  weight: 0.85,
  vector: Float32Array(384)
}

4. Categories Table¶

Purpose: Taxonomy of semantic categories with hierarchy for domain browsing.

Column	Type	Description
`id`	`number`	Hash-based integer ID (FNV-1a of category name)
`category`	`string`	Normalized category name (e.g., "software architecture")
`description`	`string`	Human-readable category description
`summary`	`string`	LLM-generated one-sentence summary of the category
`parent_category_id`	`number \\| null`	Parent category ID (null for root categories)
`aliases`	`string[]`	Native array of alternative names
`related_categories`	`number[]`	Native array of co-occurring category IDs
`document_count`	`number`	Number of documents in this category
`chunk_count`	`number`	Number of chunks tagged with this category
`concept_count`	`number`	Number of unique concepts in this category
`vector`	`Float32Array`	384-dimensional category embedding

Example Record¶

{
  id: 1847362847,
  category: "software architecture",
  description: "Patterns and practices for designing software systems",
  summary: "The high-level structure and organization of software systems...",
  parent_category_id: null,
  aliases: ["software design", "system architecture"],
  related_categories: [2938476523, 1029384756],
  document_count: 25,
  chunk_count: 1250,
  concept_count: 340,
  vector: Float32Array(384)
}

Derived Fields¶

DERIVED fields are denormalized for query performance and display. They can be regenerated from canonical sources at any time.

Design Philosophy: IDs for Truth, Names for Queries¶

IDs (concept_ids, catalog_ids, etc.)    →  SOURCE OF TRUTH (foreign keys)
Names (concept_names, catalog_title)    →  PRIMARY FOR QUERIES (human-readable)

Runtime queries use TEXT fields directly (fast, human-readable, no cache lookups):

// Query on derived text fields - no cache needed
chunks.query().where(`array_contains(concept_names, 'dependency injection')`)
concepts.query().where(`array_contains(catalog_titles, 'Clean Architecture')`)

// Display uses derived fields directly
const title = chunk.catalog_title;  // Not: cache.getTitle(chunk.catalog_id)
const concepts = chunk.concept_names;  // Not: cache.getNames(chunk.concept_ids)

Derived Field Summary¶

Table	Field	Type	Regeneration Source
`chunks`	`catalog_title`	`string`	`catalog_id` → `catalog.title`
`chunks`	`concept_names`	`string[]`	`concept_ids` → `concepts.name`
`catalog`	`concept_names`	`string[]`	`concept_ids` → `concepts.name`
`catalog`	`category_names`	`string[]`	`category_ids` → `categories.category`
`concepts`	`catalog_titles`	`string[]`	`catalog_ids` → `catalog.title`

Regenerating Derived Fields¶

Use the regeneration script to rebuild all derived name fields:

# Full regeneration
npx tsx scripts/rebuild_derived_names.ts --dbpath ~/.concept_rag

# Target specific table
npx tsx scripts/rebuild_derived_names.ts --table chunks

# Dry run to see what would change
npx tsx scripts/rebuild_derived_names.ts --dry-run

Relationships¶

Catalog (1) ──────< (N) Chunks       // One document has many chunks (via catalog_id)
Catalog (N) >─────< (N) Categories   // Documents belong to categories (via category_ids)
Catalog (N) >─────< (N) Concepts     // Documents tagged with concepts (via concept_ids)
Chunks (N) >──────< (N) Concepts     // Chunks tagged with concepts (via concept_ids)
Concepts (N) >────< (N) Catalog      // Concepts appear in documents (via catalog_ids)
Concepts (N) >────< (N) Concepts     // Adjacent concepts (via adjacent_ids)
Concepts (N) >────< (N) Concepts     // Related concepts (via related_ids)
Categories (N) ───< (1) Categories   // Hierarchical parent-child (via parent_category_id)

Concept Retrieval Flow¶

concept_search("strategy pattern")
       │
       ▼
┌─────────────────┐
│ concepts table  │  Find concept by name → get catalog_ids, catalog_titles
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ catalog table   │  Documents listed in catalog_titles (no extra lookup)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ chunks table    │  Find chunks with concept_name, display catalog_title
└─────────────────┘

Querying Relationships¶

// Find chunks by concept name (uses derived field)
const chunks = await chunksTable
  .query()
  .where(`array_contains(concept_names, 'dependency injection')`)
  .toArray();

// Display chunk results (uses derived fields directly)
for (const chunk of chunks) {
  console.log(`Title: ${chunk.catalog_title}`);
  console.log(`Concepts: ${chunk.concept_names.join(', ')}`);
}

// Find documents with a concept
const docs = await catalogTable
  .query()
  .where(`array_contains(concept_names, 'clean architecture')`)
  .toArray();

// Find documents in a category
const docs = await catalogTable
  .query()
  .where(`array_contains(category_names, 'software architecture')`)
  .toArray();

ID Generation¶

All integer IDs use FNV-1a hashing for deterministic, stable IDs:

const FNV_OFFSET_BASIS = 2166136261;
const FNV_PRIME = 16777619;

function hashToId(str: string): number {
  let hash = FNV_OFFSET_BASIS;
  for (let i = 0; i < str.length; i++) {
    hash ^= str.charCodeAt(i);
    hash = Math.imul(hash, FNV_PRIME);
  }
  return hash >>> 0; // Unsigned 32-bit integer
}

Benefits: - Same input → same ID (deterministic) - No external mapping files needed - IDs stable across database rebuilds - Efficient integer comparisons

Vector Index¶

The chunks table uses an IVF-PQ vector index for fast similarity search:

await chunksTable.createIndex("vector", {
  config: {
    type: 'ivf_pq',
    numPartitions: Math.max(2, Math.floor(rowCount / 100)),
    numSubVectors: 16
  }
});

Index Configuration: - Type: IVF-PQ (Inverted File with Product Quantization) - Partitions: Dynamic based on table size - Sub-vectors: 16 (for 384-dim embeddings) - Minimum rows: 256 (required for IVF-PQ)

Production Statistics (November 2025)¶

Metric	Value
Total Size	~180 MB (40 docs) to 1.1 GB (259 docs)
Documents (catalog)	40-259 (varies by deployment)
Chunks	~82,000 (40 docs) to 471,000+ (259 docs)
Concepts	~5,000 (40 docs) to 60,000+ (259 docs)
Categories	76-687 (varies by content)
Avg concepts per chunk	~3-5

Note: Statistics vary by deployment. The preprod database with 40 documents is approximately 183 MB with 82,663 chunks and 4,787 concepts.

Schema Evolution¶

Date	Change	ADR
2024-11	Initial two-table (catalog, chunks)	ADR-0002
2025-10-13	Added concepts table (three-table architecture)	ADR-0009
2025-11-19	Hash-based integer IDs	ADR-0027
2025-11-19	Added categories table (four-table architecture)	ADR-0028
2025-11-26	Schema normalization (redundant field removal)	ADR-0043
2025-11-28	Added derived name fields: `concept_names`, `category_names`, `catalog_titles`	-
2025-11-28	Replaced `source` with `catalog_title` in chunks (v7)	-
2025-11-28	Removed ID mapping caches (ConceptIdCache, CatalogSourceCache, CategoryIdCache)	-
2025-11-28	Added `summary` field to concepts (extracted with concept)	-
2025-11-29	Parallel concept extraction support (SharedRateLimiter, ParallelConceptExtractor)	-
2025-12-07	Added research paper metadata fields: `document_type`, `doi`, `arxiv_id`, `venue`, `keywords`, `abstract`, `authors`	-

Schema Changes (v7 - November 2025)¶

Philosophy: Derived Text Fields Replace Caches¶

The v7 schema eliminates the need for runtime ID-to-name caches by storing derived text fields directly:

Before (v6)	After (v7)
`chunk.catalog_id` → `CatalogSourceCache.getSource()`	`chunk.catalog_title` (direct)
`chunk.concept_ids` → `ConceptIdCache.getNames()`	`chunk.concept_names` (direct)
`concept.catalog_ids` → cache lookup	`concept.catalog_titles` (direct)

Chunks Table Changes (v7)¶

Change	Details
`source`	Removed - Use `catalog_title` for display
`catalog_title`	Added - Document title from catalog (derived)
`concept_names`	Added - Concept names for display (derived)

Research Paper Fields (v8)¶

New fields added for research paper support:

Field	Type	Description
`is_reference`	`boolean`	True if chunk is from references/bibliography section
`has_math`	`boolean`	True if chunk contains mathematical content
`has_extraction_issues`	`boolean`	True if extraction quality issues detected

Mathematical Content Detection: - Greek letters (α, β, γ, etc.) - Mathematical symbols (∑, ∫, ∂, etc.) - Subscripts/superscripts (x₁, x²) - LaTeX commands that leaked through - Garbled math: Detects broken PDF extraction where Mathematical Alphanumeric Symbols (U+1D400-U+1D7FF) appear as Hangul syllables (U+D400-U+D7FF)

Use Cases: - is_reference=true: Exclude from semantic search (citations not meaningful for content search) - has_math=true: Route to specialized math-aware processing - has_extraction_issues=true: Flag for OCR fallback or manual review

Removed: ID Mapping Caches¶

The following caches are no longer needed and have been removed:

ConceptIdCache - Replaced by concept_names derived field
CatalogSourceCache - Replaced by catalog_title derived field
CategoryIdCache - Replaced by category_names derived field

Performance Caches (Still Used)¶

These performance caches remain for CPU/DB optimization:

EmbeddingCache - Avoids recomputing embeddings
SearchResultCache - Caches repeated search queries

ADR-0002: LanceDB Vector Storage
ADR-0009: Three-Table Architecture
ADR-0027: Hash-Based Integer IDs
ADR-0043: Schema Normalization
Domain Models: src/domain/models/
Schema Validators: src/infrastructure/lancedb/utils/schema-validators.ts
Lexical Linking: scripts/link_related_concepts.ts
Derived Fields Regeneration: scripts/rebuild_derived_names.ts
Seeding Script: hybrid_fast_seed.ts (supports --parallel N for parallel concept extraction)

Concept-RAG Database Schema¶

Overview¶

Design Principles¶

Tables¶

1. Catalog Table¶

Filename Metadata Parsing¶

Research Paper Metadata Extraction¶

Example Record (Book)¶

Example Record (Research Paper)¶

2. Chunks Table¶

Example Record¶

3. Concepts Table¶

Concept Summaries¶

Concept Linking¶

Example Record¶

4. Categories Table¶

Example Record¶

Derived Fields¶

Design Philosophy: IDs for Truth, Names for Queries¶

Derived Field Summary¶

Regenerating Derived Fields¶

Relationships¶

Concept Retrieval Flow¶

Querying Relationships¶

ID Generation¶

Vector Index¶

Production Statistics (November 2025)¶

Schema Evolution¶

Schema Changes (v7 - November 2025)¶

Philosophy: Derived Text Fields Replace Caches¶

Chunks Table Changes (v7)¶

Research Paper Fields (v8)¶

Removed: ID Mapping Caches¶

Performance Caches (Still Used)¶

Related Documentation¶