angruvadal

MCP Tool

anna-claudette/angruvadal

RAM-Backed MCP Memory Architecture for Consumer LLM Inference — 900K token context on 16GB VRAM

Install

$ npx loaditout add anna-claudette/angruvadal

Platform-specific configuration:

.claude/settings.json

{
  "mcpServers": {
    "angruvadal": {
      "command": "npx",
      "args": [
        "-y",
        "angruvadal"
      ]
    }
  }
}

Add the config above to .claude/settings.json under the mcpServers key.

About

Angruvadal

RAM-Backed MCP Memory Architecture for Consumer LLM Inference *Codename: Angruvadal*

> Sub-10ms retrieval. 90% RAG accuracy. 100% tool compliance. ~200 lines of code.

---

What This Is

Angruvadal is a working implementation of two complementary ideas:

1. RAM as first-class LLM memory (proven today) A FastAPI MCP server backed by 192GB DDR5. Any llama.cpp-served model calls context_retrieve as a tool — the server does semantic search and returns relevant context in <10ms. The model never hits a context limit. It just calls a tool when it needs to remember something.

2. RotorQuant KV compression (in progress) 3-bit KV cache compression at 3.5× ratio, Triton kernels confirmed working on AMD RDNA4 (gfx1201). When integrated into llama.cpp, extends in-VRAM context window from ~52K to ~192K tokens. Combined with the MCP layer: effectively unlimited context on consumer hardware.

---

Proven Results (2026-03-27, GURTHANG II)

MCP RAM Server

| Metric | Value | |--------|-------| | Retrieve p50 @ 1K chunks | 9.0ms | | Retrieve p50 @ 5K chunks | 16.6ms | | Sequential throughput | 62.5 queries/sec | | Memory per chunk | 1.82 KB (→ 105M chunks in 192GB) | | RAG accuracy (10 QA pairs) | 90% (9/10) | | Tool call compliance | 100% (10/10) | | MCP overhead in E2E latency | <0.2% | | Scaling behavior | Linear O(n) — no cliffs to 25K+ chunks |

GPT-OSS 20B on RX 9070 (llama.cpp Vulkan)

| Metric | Value | |--------|-------| | Token generation | 134 tok/s | | Prompt processing | 3,574 tok/s | | VRAM | 12.5GB / 16GB | | Context window | 128K (YaRN RoPE) | | Load time | 3.4s |

---

Architecture

User prompt
    ↓
GPT-OSS 20B — GPU (16GB VRAM, 134 tok/s)
    ↓ tool call: context_retrieve
    ↓
Angruvadal MCP — RAM (192GB DDR5, 9ms)
    ↓ semantic search → relevant chunks returned
    ↓ model incorporates context, answers
Response

The model decides when

Reviews

Loading reviews...

Quality Signals

Installs

Last updated15 days ago

Security: AREADME

Safety

Risk Level

angruvadal

Install

About

Tags

Reviews

Quality Signals

Safety

Details

Embed Badge