Engineering
glyphbox avatar

glyphbox

A framework for an LLM-based NetHack agent that dynamically synthesizes Python code in a secure sandbox to perform complex dungeon exploration and gameplay actions via a high-level API.

Introduction

Glyphbox provides a robust, sandbox-restricted environment for large language models to interact with the NetHack 3.6.7 simulator via the NetHack Learning Environment (NLE). By exposing a high-level Python API, the system allows LLMs to reason about game states, manage inventories, perform pathfinding, and execute combat strategies through code synthesis. Designed for AI research in reinforcement learning and agentic behavior, it effectively bridges the gap between raw ASCII screen inputs and high-level behavioral planning. Users can deploy the agent to navigate complex dungeon layouts, handle diverse combat encounters, and optimize resource management, all while maintaining a secure execution boundary that prevents unauthorized system access.

  • Enables LLMs to issue Python-based tool calls for real-time game interactions with the NetHackAPI.

  • Implements a hardened execution sandbox with strict AST validation and forbidden-call filtering to ensure secure code evaluation.

  • Provides a modular skill system for defining reusable behaviors like exploration, melee combat, and resource consumption as asynchronous Python functions.

  • Integrates with modern LLM providers including OpenAI and Anthropic to support sophisticated decision-making and strategic planning in stochastic environments.

  • Offers diagnostic tools such as TUI visualization, session recording via asciinema, and log analysis scripts for performance evaluation and debugging.

  • The system requires an API key for the chosen LLM provider and standard Python scientific dependencies installed via uv.

  • Inputs consist of raw 24x80 ASCII screen frames, message buffers, and internal game statistics, which the agent processes to output actionable Python code.

  • Developers can define new agent skills by following the template in SKILL.md, ensuring they return specific SkillResult objects with stop reasons and success metrics.

  • Performance is highly dependent on the model's ability to reason about stateful API calls; recommend using models capable of robust code generation like Claude 3 or GPT-4.

  • The execution loop is subject to strict signal-based timeouts, and all imports are forbidden, requiring agents to rely on provided globals like Direction, Position, and the NetHackAPI instance.

Repository Stats

Stars
8
Forks
1
Open Issues
0
Language
Python
Default Branch
main
Sync Status
Idle
Last Synced
May 4, 2026, 01:46 AM
View on GitHub