Concept-RAG Database Schema¶
Last Updated: 2025-11-29
Database: LanceDB (embedded vector database)
Embedding Model: all-MiniLM-L6-v2 (384 dimensions)
Schema Version: Normalized v7 (derived text fields, no ID caches needed)
Overview¶
Concept-RAG uses a four-table normalized architecture optimized for concept-heavy workloads:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐
│ catalog │──────<│ chunks │ │ concepts │ │ categories │
│ (documents) │ │ (segments) │ │ (index) │ │ (taxonomy) │
└─────────────┘ └─────────────┘ └─────────────┘ └──────────────┘
│ │ │ │
│ │ │ │
└─────────────────────┴─────────────────────┴──────────────────────┘
Many-to-many via native array columns
Design Principles¶
- Normalization: No redundant data - single source of truth for each piece of information
- Hash-based IDs: FNV-1a hashing for stable integer IDs across database rebuilds
- Native arrays: Many-to-many relationships stored as LanceDB native arrays (not JSON strings)
- ID-based references: All relationships use integer IDs (no string paths in relationships)
- Derived text fields: Human-readable names stored alongside IDs for direct text queries
- No runtime caches: Derived fields eliminate need for ID-to-name cache lookups
- Vector storage: 384-dimensional embeddings on all tables for semantic search
- LLM as source: Concepts table is populated by LLM extraction, serving as canonical source
Tables¶
1. Catalog Table¶
Purpose: Document-level metadata and summaries for document discovery.
| Column | Type | Description |
|---|---|---|
id |
number |
Hash-based integer ID (FNV-1a of source path) |
source |
string |
Document file path (unique identifier) |
title |
string |
Document title (parsed from filename or content) |
summary |
string |
LLM-generated document summary |
hash |
string |
SHA-256 content hash for deduplication |
vector |
Float32Array |
384-dimensional embedding of summary |
concept_ids |
number[] |
Native array of concept IDs (foreign keys to concepts table) |
concept_names |
string[] |
DERIVED: Concept names for display and text search |
category_ids |
number[] |
Native array of category integer IDs |
category_names |
string[] |
DERIVED: Category names for display and text search |
origin_hash |
string |
Reserved: Hash of original file before processing |
author |
string |
Document author(s) (parsed from filename) |
year |
number |
Publication year (parsed from filename) |
publisher |
string |
Publisher name (parsed from filename) |
isbn |
string |
ISBN (parsed from filename, preserved with hyphens) |
document_type |
string |
Document classification: 'book', 'paper', 'article', 'unknown' |
doi |
string |
Digital Object Identifier (e.g., "10.1109/MS.2022.3166266") |
arxiv_id |
string |
ArXiv identifier (e.g., "2204.11193v1") |
venue |
string |
Publication venue (journal/conference name) |
keywords |
string[] |
Keywords from paper metadata or extraction |
abstract |
string |
Paper abstract (distinct from LLM summary) |
authors |
string[] |
Array of author names (for multi-author papers) |
Filename Metadata Parsing¶
Bibliographic fields are automatically extracted from filenames using the -- delimiter format:
Example filename:
Effective software testing -- Elfriede Dustin -- December 18, 2002 -- Addison-Wesley -- 9780201794298 -- 5297d243.pdf
Normalization rules:
- Underscores (_) are converted to spaces
- URL-encoded spaces (%20, _20) are converted to spaces
- Multiple consecutive spaces are collapsed to single space
- Fields that cannot be parsed are left empty (string) or zero (number)
- If no -- delimiters exist, entire filename (without extension) becomes the title
Research Paper Metadata Extraction¶
For research papers (especially LaTeX-generated PDFs like arXiv preprints), metadata is extracted from multiple sources:
1. Filename Patterns:
- ArXiv IDs: 2204.11193v1.pdf → arxiv_id: "2204.11193v1"
- DOI-based: 10.1109-MS.2022.3166266.pdf → doi: "10.1109/MS.2022.3166266"
2. PDF Metadata Dictionary: - Title, Author, Subject, Keywords (when available) - Often empty for LaTeX-generated PDFs
3. Content-Based Extraction: - Title: First prominent text on page 1 - Authors: Names following title, before abstract - Abstract: Text following "Abstract" heading - Keywords: Extracted from "Keywords:" section if present
Document Type Detection:
Papers are distinguished from books using heuristics:
- Page count (papers typically <50 pages)
- Presence of "Abstract" section
- Citation patterns ([1], [2] style references)
- Filename patterns (arXiv ID, DOI)
Priority Order: 1. Content-based extraction (most reliable for LaTeX PDFs) 2. PDF metadata dictionary (for publisher PDFs) 3. Filename parsing (fallback)
Example Record (Book)¶
{
id: 3847293847,
source: "/home/user/ebooks/Clean Architecture -- Robert Martin -- 2017 -- Pearson -- 9780134494166 -- abc123.pdf",
title: "Clean Architecture",
summary: "Comprehensive guide to Clean Architecture principles...",
hash: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
vector: Float32Array(384),
concept_ids: [2938475683, 1029384756, 3847293900],
concept_names: ["clean architecture", "dependency injection", "solid principles"],
category_ids: [1847362847, 2938476523],
category_names: ["software architecture", "design patterns"],
// Bibliographic fields (parsed from filename)
origin_hash: "",
author: "Robert Martin",
year: 2017,
publisher: "Pearson",
isbn: "9780134494166",
// Research paper fields (empty for books)
document_type: "book",
doi: "",
arxiv_id: "",
venue: "",
keywords: [],
abstract: "",
authors: []
}
Example Record (Research Paper)¶
{
id: 1029384756,
source: "/home/user/papers/2204.11193v1.pdf",
title: "Exploring Security Practices of Smart Contract Developers",
summary: "Study examining how developers approach security in smart contract development...",
hash: "a1b2c3d4e5f6...",
vector: Float32Array(384),
concept_ids: [1111111111, 2222222222],
concept_names: ["smart contracts", "blockchain security"],
category_ids: [3333333333],
category_names: ["blockchain", "security"],
// Bibliographic fields
origin_hash: "",
author: "Tanusree Sharma et al.",
year: 2022,
publisher: "",
isbn: "",
// Research paper fields (extracted from content/metadata)
document_type: "paper",
doi: "",
arxiv_id: "2204.11193v1",
venue: "arXiv preprint",
keywords: ["smart contract", "security", "blockchain", "developer"],
abstract: "Smart contracts are self-executing programs that run on blockchains...",
authors: ["Tanusree Sharma", "Zhixuan Zhou", "Andrew Miller", "Yang Wang"]
}
2. Chunks Table¶
Purpose: Text segments for fine-grained retrieval and semantic search.
| Column | Type | Description |
|---|---|---|
id |
number |
Hash-based integer ID (FNV-1a of source-hash-index) |
catalog_id |
number |
Required. Hash-based catalog entry ID (foreign key to catalog) |
catalog_title |
string |
DERIVED: Document title from catalog (for display) |
text |
string |
Chunk text content (typically 100-500 words) |
hash |
string |
Content hash for deduplication |
vector |
Float32Array |
384-dimensional embedding |
concept_ids |
number[] |
Native array of concept integer IDs (for concept-chunk linkage) |
concept_names |
string[] |
DERIVED: Concept names for display and text search |
concept_density |
number |
Concept richness score: concept_ids.length / (word_count / 10) |
page_number |
number |
Page number in source document (from PDF loader) |
is_reference |
boolean |
True if chunk is from bibliography/references section |
has_math |
boolean |
True if chunk contains mathematical content (symbols, equations) |
has_extraction_issues |
boolean |
True if chunk has extraction quality issues (e.g., garbled math) |
Note: The
sourcefield was removed in v7. Chunks storecatalog_titlefor display andcatalog_idfor joins. Tools should usecatalog_titledirectly.Note: The
category_idsfield was removed - usecatalog_id→catalog.category_idsfor category lookup.Note: No ID-to-name caches are needed at runtime. Use
concept_namesandcatalog_titledirectly.
Example Record¶
{
id: 2938475612, // hash-based integer
catalog_id: 3847293847, // foreign key to catalog
catalog_title: "Clean Architecture", // DERIVED: for display
text: "Clean architecture emphasizes separation of concerns...",
hash: "def456",
vector: Float32Array(384),
concept_ids: [3847293847, 1928374652, 2837465928],
concept_names: ["clean architecture", "separation of concerns", "dependency rule"], // DERIVED
page_number: 15,
is_reference: false, // Not from bibliography section
has_math: false, // No mathematical content
has_extraction_issues: false // Clean extraction
}
3. Concepts Table¶
Purpose: Deduplicated concept index with semantic enrichment. Canonical source - populated by LLM extraction.
| Column | Type | Description |
|---|---|---|
id |
number |
Hash-based integer ID (FNV-1a of concept name) |
name |
string |
Concept name (unique, lowercase, e.g., "dependency injection") |
summary |
string |
LLM-generated contextual summary of the concept |
catalog_ids |
number[] |
Native array of catalog entry integer IDs |
catalog_titles |
string[] |
DERIVED: Document titles for display |
chunk_ids |
number[] |
Native array of chunk IDs where concept appears |
adjacent_ids |
number[] |
Co-occurrence links (concepts appearing together in documents) |
related_ids |
number[] |
Lexical links (concepts sharing significant words) |
synonyms |
string[] |
Native array of synonyms (from WordNet) |
broader_terms |
string[] |
Native array of hypernyms (from WordNet) |
narrower_terms |
string[] |
Native array of hyponyms (from WordNet) |
weight |
number |
Importance weight (0-1, based on frequency/centrality) |
vector |
Float32Array |
384-dimensional concept embedding |
Concept Summaries¶
Concept summaries are generated during extraction (same LLM call) to ensure alignment with the source text: - 15-30 words describing the concept in the document's context - Domain-specific terminology from the source document - Enables fuzzy text search on concept meanings
Concept Linking¶
Two types of concept relationships:
- Adjacent (co-occurrence): Concepts that appear together in the same document
-
E.g., "clean architecture" and "dependency injection" both in same book
-
Related (lexical): Concepts sharing significant words (≥5 chars)
- E.g., "military strategy" ↔ "strategy pattern" (share "strategy")
Example Record¶
{
id: 3847293847,
name: "clean architecture",
summary: "A software design approach that separates business logic from infrastructure through dependency inversion and clear boundaries.",
catalog_ids: [1029384756, 2938475612],
catalog_titles: ["Clean Architecture", "Domain-Driven Design"], // DERIVED
chunk_ids: [123456, 234567, 345678],
adjacent_ids: [2938475683, 1029384756], // co-occurrence
related_ids: [1847362999, 2938476000], // lexical links
synonyms: ["uncle bob architecture"],
broader_terms: ["software architecture", "system design"],
narrower_terms: ["dependency rule", "boundary layer"],
weight: 0.85,
vector: Float32Array(384)
}
4. Categories Table¶
Purpose: Taxonomy of semantic categories with hierarchy for domain browsing.
| Column | Type | Description |
|---|---|---|
id |
number |
Hash-based integer ID (FNV-1a of category name) |
category |
string |
Normalized category name (e.g., "software architecture") |
description |
string |
Human-readable category description |
summary |
string |
LLM-generated one-sentence summary of the category |
parent_category_id |
number \| null |
Parent category ID (null for root categories) |
aliases |
string[] |
Native array of alternative names |
related_categories |
number[] |
Native array of co-occurring category IDs |
document_count |
number |
Number of documents in this category |
chunk_count |
number |
Number of chunks tagged with this category |
concept_count |
number |
Number of unique concepts in this category |
vector |
Float32Array |
384-dimensional category embedding |
Example Record¶
{
id: 1847362847,
category: "software architecture",
description: "Patterns and practices for designing software systems",
summary: "The high-level structure and organization of software systems...",
parent_category_id: null,
aliases: ["software design", "system architecture"],
related_categories: [2938476523, 1029384756],
document_count: 25,
chunk_count: 1250,
concept_count: 340,
vector: Float32Array(384)
}
Derived Fields¶
DERIVED fields are denormalized for query performance and display. They can be regenerated from canonical sources at any time.
Design Philosophy: IDs for Truth, Names for Queries¶
IDs (concept_ids, catalog_ids, etc.) → SOURCE OF TRUTH (foreign keys)
Names (concept_names, catalog_title) → PRIMARY FOR QUERIES (human-readable)
Runtime queries use TEXT fields directly (fast, human-readable, no cache lookups):
// Query on derived text fields - no cache needed
chunks.query().where(`array_contains(concept_names, 'dependency injection')`)
concepts.query().where(`array_contains(catalog_titles, 'Clean Architecture')`)
// Display uses derived fields directly
const title = chunk.catalog_title; // Not: cache.getTitle(chunk.catalog_id)
const concepts = chunk.concept_names; // Not: cache.getNames(chunk.concept_ids)
Derived Field Summary¶
| Table | Field | Type | Regeneration Source |
|---|---|---|---|
chunks |
catalog_title |
string |
catalog_id → catalog.title |
chunks |
concept_names |
string[] |
concept_ids → concepts.name |
catalog |
concept_names |
string[] |
concept_ids → concepts.name |
catalog |
category_names |
string[] |
category_ids → categories.category |
concepts |
catalog_titles |
string[] |
catalog_ids → catalog.title |
Regenerating Derived Fields¶
Use the regeneration script to rebuild all derived name fields:
# Full regeneration
npx tsx scripts/rebuild_derived_names.ts --dbpath ~/.concept_rag
# Target specific table
npx tsx scripts/rebuild_derived_names.ts --table chunks
# Dry run to see what would change
npx tsx scripts/rebuild_derived_names.ts --dry-run
Relationships¶
Catalog (1) ──────< (N) Chunks // One document has many chunks (via catalog_id)
Catalog (N) >─────< (N) Categories // Documents belong to categories (via category_ids)
Catalog (N) >─────< (N) Concepts // Documents tagged with concepts (via concept_ids)
Chunks (N) >──────< (N) Concepts // Chunks tagged with concepts (via concept_ids)
Concepts (N) >────< (N) Catalog // Concepts appear in documents (via catalog_ids)
Concepts (N) >────< (N) Concepts // Adjacent concepts (via adjacent_ids)
Concepts (N) >────< (N) Concepts // Related concepts (via related_ids)
Categories (N) ───< (1) Categories // Hierarchical parent-child (via parent_category_id)
Concept Retrieval Flow¶
concept_search("strategy pattern")
│
▼
┌─────────────────┐
│ concepts table │ Find concept by name → get catalog_ids, catalog_titles
└────────┬────────┘
│
▼
┌─────────────────┐
│ catalog table │ Documents listed in catalog_titles (no extra lookup)
└────────┬────────┘
│
▼
┌─────────────────┐
│ chunks table │ Find chunks with concept_name, display catalog_title
└─────────────────┘
Querying Relationships¶
// Find chunks by concept name (uses derived field)
const chunks = await chunksTable
.query()
.where(`array_contains(concept_names, 'dependency injection')`)
.toArray();
// Display chunk results (uses derived fields directly)
for (const chunk of chunks) {
console.log(`Title: ${chunk.catalog_title}`);
console.log(`Concepts: ${chunk.concept_names.join(', ')}`);
}
// Find documents with a concept
const docs = await catalogTable
.query()
.where(`array_contains(concept_names, 'clean architecture')`)
.toArray();
// Find documents in a category
const docs = await catalogTable
.query()
.where(`array_contains(category_names, 'software architecture')`)
.toArray();
ID Generation¶
All integer IDs use FNV-1a hashing for deterministic, stable IDs:
const FNV_OFFSET_BASIS = 2166136261;
const FNV_PRIME = 16777619;
function hashToId(str: string): number {
let hash = FNV_OFFSET_BASIS;
for (let i = 0; i < str.length; i++) {
hash ^= str.charCodeAt(i);
hash = Math.imul(hash, FNV_PRIME);
}
return hash >>> 0; // Unsigned 32-bit integer
}
Benefits: - Same input → same ID (deterministic) - No external mapping files needed - IDs stable across database rebuilds - Efficient integer comparisons
Vector Index¶
The chunks table uses an IVF-PQ vector index for fast similarity search:
await chunksTable.createIndex("vector", {
config: {
type: 'ivf_pq',
numPartitions: Math.max(2, Math.floor(rowCount / 100)),
numSubVectors: 16
}
});
Index Configuration: - Type: IVF-PQ (Inverted File with Product Quantization) - Partitions: Dynamic based on table size - Sub-vectors: 16 (for 384-dim embeddings) - Minimum rows: 256 (required for IVF-PQ)
Production Statistics (November 2025)¶
| Metric | Value |
|---|---|
| Total Size | ~180 MB (40 docs) to 1.1 GB (259 docs) |
| Documents (catalog) | 40-259 (varies by deployment) |
| Chunks | ~82,000 (40 docs) to 471,000+ (259 docs) |
| Concepts | ~5,000 (40 docs) to 60,000+ (259 docs) |
| Categories | 76-687 (varies by content) |
| Avg concepts per chunk | ~3-5 |
Note: Statistics vary by deployment. The preprod database with 40 documents is approximately 183 MB with 82,663 chunks and 4,787 concepts.
Schema Evolution¶
| Date | Change | ADR |
|---|---|---|
| 2024-11 | Initial two-table (catalog, chunks) | ADR-0002 |
| 2025-10-13 | Added concepts table (three-table architecture) | ADR-0009 |
| 2025-11-19 | Hash-based integer IDs | ADR-0027 |
| 2025-11-19 | Added categories table (four-table architecture) | ADR-0028 |
| 2025-11-26 | Schema normalization (redundant field removal) | ADR-0043 |
| 2025-11-28 | Added derived name fields: concept_names, category_names, catalog_titles |
- |
| 2025-11-28 | Replaced source with catalog_title in chunks (v7) |
- |
| 2025-11-28 | Removed ID mapping caches (ConceptIdCache, CatalogSourceCache, CategoryIdCache) | - |
| 2025-11-28 | Added summary field to concepts (extracted with concept) |
- |
| 2025-11-29 | Parallel concept extraction support (SharedRateLimiter, ParallelConceptExtractor) | - |
| 2025-12-07 | Added research paper metadata fields: document_type, doi, arxiv_id, venue, keywords, abstract, authors |
- |
Schema Changes (v7 - November 2025)¶
Philosophy: Derived Text Fields Replace Caches¶
The v7 schema eliminates the need for runtime ID-to-name caches by storing derived text fields directly:
| Before (v6) | After (v7) |
|---|---|
chunk.catalog_id → CatalogSourceCache.getSource() |
chunk.catalog_title (direct) |
chunk.concept_ids → ConceptIdCache.getNames() |
chunk.concept_names (direct) |
concept.catalog_ids → cache lookup |
concept.catalog_titles (direct) |
Chunks Table Changes (v7)¶
| Change | Details |
|---|---|
source |
Removed - Use catalog_title for display |
catalog_title |
Added - Document title from catalog (derived) |
concept_names |
Added - Concept names for display (derived) |
Research Paper Fields (v8)¶
New fields added for research paper support:
| Field | Type | Description |
|---|---|---|
is_reference |
boolean |
True if chunk is from references/bibliography section |
has_math |
boolean |
True if chunk contains mathematical content |
has_extraction_issues |
boolean |
True if extraction quality issues detected |
Mathematical Content Detection: - Greek letters (α, β, γ, etc.) - Mathematical symbols (∑, ∫, ∂, etc.) - Subscripts/superscripts (x₁, x²) - LaTeX commands that leaked through - Garbled math: Detects broken PDF extraction where Mathematical Alphanumeric Symbols (U+1D400-U+1D7FF) appear as Hangul syllables (U+D400-U+D7FF)
Use Cases:
- is_reference=true: Exclude from semantic search (citations not meaningful for content search)
- has_math=true: Route to specialized math-aware processing
- has_extraction_issues=true: Flag for OCR fallback or manual review
Removed: ID Mapping Caches¶
The following caches are no longer needed and have been removed:
- ConceptIdCache - Replaced by
concept_namesderived field - CatalogSourceCache - Replaced by
catalog_titlederived field - CategoryIdCache - Replaced by
category_namesderived field
Performance Caches (Still Used)¶
These performance caches remain for CPU/DB optimization:
- EmbeddingCache - Avoids recomputing embeddings
- SearchResultCache - Caches repeated search queries
Related Documentation¶
- ADR-0002: LanceDB Vector Storage
- ADR-0009: Three-Table Architecture
- ADR-0027: Hash-Based Integer IDs
- ADR-0043: Schema Normalization
- Domain Models:
src/domain/models/ - Schema Validators:
src/infrastructure/lancedb/utils/schema-validators.ts - Lexical Linking:
scripts/link_related_concepts.ts - Derived Fields Regeneration:
scripts/rebuild_derived_names.ts - Seeding Script:
hybrid_fast_seed.ts(supports--parallel Nfor parallel concept extraction)