honeybadger

MCP Tool

kroq86/honeybadger

benchmark copilot for execution-style model evaluation

Install

$ npx loaditout add kroq86/honeybadger

Platform-specific configuration:

.claude/settings.json

{
  "mcpServers": {
    "honeybadger": {
      "command": "npx",
      "args": [
        "-y",
        "honeybadger"
      ]
    }
  }
}

Add the config above to .claude/settings.json under the mcpServers key.

About

vmbench

vmbench is a small toolkit for testing whether language models can follow a formal virtual-machine execution model.

It gives you:

a toy ISA and reference VM
synthetic dataset generation for execution tasks
local baseline evaluation against those tasks
curriculum-style gating over evaluation results
SFT export for bounded fine-tuning experiments

What This Is

This repo started as a research workbench around AI-native execution ideas.

The strongest reusable outcome is not a new programming language inside the model. It is a practical benchmark/tooling stack for asking a narrower question:

Can a model follow formal machine semantics on synthetic execution tasks?

Product Shape

Primary surface:

MCP-friendly tool layer

Secondary surface:

Optional surface:

Codex skill

Core Pieces

reference_vm.py
formal VM/parser/executor
dataset_pipeline.py
main benchmark dataset generator
baseline_trainer.py
local inference/eval harness
curriculum_gate.py
stage gate evaluation
sft_export.py
SFT export from benchmark datasets

Quickstart

Generate the core benchmark dataset:

python3 /Users/ll/honeybadger/vmbench_cli.py generate

Run a local baseline against the generated dataset:

python3 /Users/ll/honeybadger/vmbench_cli.py eval \
  --model llama3.2:latest \
  --host http://127.0.0.1:11434

Host contract:

vmbench_cli.py runs on the host, so http://127.0.0.1:11434 is the normal local default.
The Dockerized MCP server automatically remaps localhost and 127.0.0.1 to http://host.docker.internal:11434 for Ollama-backed baseline runs.

Inspect the manifest, repo map, and available commands with structured JSON output:

python3 /Users/ll/honeybadge

Reviews

Loading reviews...

Quality Signals

Quality Score4000

Installs

Last updatedtoday

Security: AREADME

New

honeybadger

Install

About

Tags

Reviews

Quality Signals

Safety

Details

Embed Badge