anna-claudette/angruvadal
RAM-Backed MCP Memory Architecture for Consumer LLM Inference — 900K token context on 16GB VRAM
Platform-specific configuration:
{
"mcpServers": {
"angruvadal": {
"command": "npx",
"args": [
"-y",
"angruvadal"
]
}
}
}Add the config above to .claude/settings.json under the mcpServers key.
RAM-Backed MCP Memory Architecture for Consumer LLM Inference *Codename: Angruvadal*
> Sub-10ms retrieval. 90% RAG accuracy. 100% tool compliance. ~200 lines of code.
---
Angruvadal is a working implementation of two complementary ideas:
1. RAM as first-class LLM memory (proven today) A FastAPI MCP server backed by 192GB DDR5. Any llama.cpp-served model calls context_retrieve as a tool — the server does semantic search and returns relevant context in <10ms. The model never hits a context limit. It just calls a tool when it needs to remember something.
2. RotorQuant KV compression (in progress) 3-bit KV cache compression at 3.5× ratio, Triton kernels confirmed working on AMD RDNA4 (gfx1201). When integrated into llama.cpp, extends in-VRAM context window from ~52K to ~192K tokens. Combined with the MCP layer: effectively unlimited context on consumer hardware.
---
| Metric | Value | |--------|-------| | Retrieve p50 @ 1K chunks | 9.0ms | | Retrieve p50 @ 5K chunks | 16.6ms | | Sequential throughput | 62.5 queries/sec | | Memory per chunk | 1.82 KB (→ 105M chunks in 192GB) | | RAG accuracy (10 QA pairs) | 90% (9/10) | | Tool call compliance | 100% (10/10) | | MCP overhead in E2E latency | <0.2% | | Scaling behavior | Linear O(n) — no cliffs to 25K+ chunks |
| Metric | Value | |--------|-------| | Token generation | 134 tok/s | | Prompt processing | 3,574 tok/s | | VRAM | 12.5GB / 16GB | | Context window | 128K (YaRN RoPE) | | Load time | 3.4s |
---
User prompt
↓
GPT-OSS 20B — GPU (16GB VRAM, 134 tok/s)
↓ tool call: context_retrieve
↓
Angruvadal MCP — RAM (192GB DDR5, 9ms)
↓ semantic search → relevant chunks returned
↓ model incorporates context, answers
ResponseThe model decides when
Loading reviews...