Modular RAG Hub
Modular RAG is the architectural evolution beyond naive Retrieve-Then-Generate pipelines. Where naive RAG applies a single fixed retrieval step, Modular RAG decomposes retrieval into independently configurable, swappable components: routers, retrievers, rerankers, and critics. Each component has a defined interface; the orchestrator composes them at runtime based on query characteristics.
This hub routes to the key Modular RAG sub-patterns.
The Modular Stack
Query
│
▼
[Router] ─── decides which retrievers to invoke and in what order
│
▼
[Retrievers] ─── dense, sparse, graph, structured (parallel or sequential)
│
▼
[Reranker] ─── cross-encoder or LLM-as-judge relevance scoring
│
▼
[Critic / Self-RAG] ─── decides if retrieved context is sufficient
│ └── if NOT: reformulate query, invoke retrievers again
▼
[Generator] ─── final LLM synthesis over validated context
Adaptive Routing
Adaptive Routing is the query classification step that selects the retrieval strategy before any retrieval occurs. The router prevents every query from hitting every retriever — a naive pattern that is expensive, slow, and produces noisy merged results.
Router Types
| Router | Decision Basis | Use Case |
|---|---|---|
| Rule-based | Keyword or regex triggers | Simple domain dispatch |
| Classifier | Fine-tuned embedding or LLM classifier | Structured taxonomy of query types |
| LLM-as-Router | Zero-shot LLM judgment with schema output | Flexible, extensible — highest latency |
| Confidence-based | Route based on retriever confidence scores from the previous step | Fallback cascades |
Routing Targets
A router typically dispatches to:
- Dense retrieval (vector similarity) for semantic / natural-language questions
- Sparse retrieval (BM25/keyword) for exact terminology, proper nouns, code identifiers
- Graph traversal (GraphRAG) for relational or multi-hop questions
- Structured query (SQL/SPARQL) for tabular or ontology-anchored data
- No retrieval (direct LLM generation) for simple factual or reasoning-only questions
Self-RAG
Self-RAG (Self-Reflective RAG) adds a critique loop after retrieval: the model itself decides whether the retrieved documents are relevant enough to answer the query, and whether its generated answer is actually supported by the retrieved context. If either check fails, the loop iterates.
Self-RAG Decision Points
Self-RAG inserts classifier tokens (or structured LLM judgment calls) at three points:
| Decision Point | Token / Output | Meaning |
|---|---|---|
| Pre-generation | [Retrieve] / [No Retrieve] |
Is retrieval needed for this query? |
| Post-retrieval | [Relevant] / [Irrelevant] per passage |
Which retrieved chunks are actually useful? |
| Post-generation | [Supported] / [Partially] / [Contradiction] |
Is the answer grounded in the context? |
On [Irrelevant] or [Contradiction], the loop reformulates the query (query rewriting or query expansion) and retrieves again, up to a configured max-retry limit.
Self-RAG vs. Agentic RAG
| Dimension | [[agentic-rag | Agentic RAG]] | Self-RAG |
|---|---|---|---|
| Decision maker | External agent with tool calls | Internal model with classifier tokens | |
| Loop control | Agent orchestrator | Token-conditioned generation | |
| Latency | Higher (separate agent hops) | Lower (single model pass) | |
| Flexibility | High — can call arbitrary tools | Limited to retrieval reformulation |
Self-RAG is better for cost/latency-sensitive pipelines; Agentic RAG is better when the failure modes require calling diverse tool types.
Hyperbolic Embeddings
Standard vector embeddings (Euclidean space) are inefficient for hierarchical data — the volume of Euclidean space doesn't grow fast enough to preserve tree distances without distortion. Hyperbolic embeddings map data into a Poincaré disk (negatively curved space) where the available volume grows exponentially with radius, mirroring the exponential branching of hierarchies.
When to use:
- Document collections with strong categorical / ontology structure (medical codes, legal taxonomies, product catalogs)
- Knowledge graphs where entity relationships are predominantly
is-aorpart-of - Multi-level topic hierarchies where leaf nodes should be far from root but close to siblings
Practical state (May 2026): Libraries like geoopt (PyTorch) and geomstats support Poincaré disk models. Performance advantages over Euclidean are strongest at low embedding dimensions (32–64d). At 256d+, Euclidean models close the gap. Start with standard dense embeddings; reach for hyperbolic when hierarchical distortion is measurably hurting retrieval quality.
GraphRAG + Agentic Planner Hybrid
GraphRAG builds an entity-relationship graph over the corpus and answers queries by traversing the graph rather than (only) retrieving chunks. Combining it with an agentic planner produces a Graph-Agent Hybrid:
Planner Agent
│
├── Sub-task: entity lookup → Graph traversal → entity subgraph
├── Sub-task: relationship reasoning → Graph path query → relationship chain
└── Sub-task: supporting evidence → Dense retrieval over chunk store
│
Synthesizer Agent
└── Merges graph subgraph + chunk evidence → final answer
Key advantage over pure GraphRAG: The planner can mix graph and dense retrieval strategies within a single answer — using the graph for relational reasoning and chunk retrieval for supporting quotations. Pure GraphRAG forces all reasoning through the graph, which is slow for simple semantic queries.
Key advantage over pure Agentic RAG: The graph's pre-computed structure lets the planner issue precise relational queries (e.g., "all papers that cite X and also cite Y") that would require multiple uncertain LLM steps in a purely chunk-based retrieval system.
Where to Start
- Dense + Sparse only, small corpus → agentic-rag patterns; llamaindex
QueryEngineTool - Large, heterogeneous corpus → Add Adaptive Routing to route by query type
- Factual accuracy is critical → Add Self-RAG critique loop
- Hierarchical knowledge (taxonomies, ontologies) → Evaluate Hyperbolic Embeddings
- Relational multi-hop questions → GraphRAG + planner hybrid