kroq86/honeybadger
benchmark copilot for execution-style model evaluation
Platform-specific configuration:
{
"mcpServers": {
"honeybadger": {
"command": "npx",
"args": [
"-y",
"honeybadger"
]
}
}
}Add the config above to .claude/settings.json under the mcpServers key.
vmbench is a small toolkit for testing whether language models can follow a formal virtual-machine execution model.
It gives you:
This repo started as a research workbench around AI-native execution ideas.
The strongest reusable outcome is not a new programming language inside the model. It is a practical benchmark/tooling stack for asking a narrower question:
Can a model follow formal machine semantics on synthetic execution tasks?
Primary surface:
Secondary surface:
Optional surface:
Generate the core benchmark dataset:
python3 /Users/ll/honeybadger/vmbench_cli.py generateRun a local baseline against the generated dataset:
python3 /Users/ll/honeybadger/vmbench_cli.py eval \
--model llama3.2:latest \
--host http://127.0.0.1:11434Host contract:
vmbench_cli.py runs on the host, so http://127.0.0.1:11434 is the normal local default.localhost and 127.0.0.1 to http://host.docker.internal:11434 for Ollama-backed baseline runs.Inspect the manifest, repo map, and available commands with structured JSON output:
python3 /Users/ll/honeybadgeLoading reviews...