reflect-appworld-failure
Analyze AppWorld task failures to extract specific API patterns and generate actionable playbook bullets with concrete code examples.
Introduction
The reflect-appworld-failure skill is a critical component of the ACE (Agentic Context Engineering) framework designed for autonomous agents operating within the AppWorld environment. Its primary purpose is to transform execution failures into persistent, reusable knowledge that prevents repetitive errors. When an agent encounters an exception, timeout, or logic failure while interacting with applications like Spotify, Venmo, Gmail, or Calendar, this skill acts as a reflective bridge, converting error logs into structured, actionable intelligence. It identifies root causes—ranging from incorrect API naming conventions and missing authentication steps to improper data structure navigation—and formalizes the solution into a standardized JSON schema. This ensures that the agent's playbook evolves dynamically, improving success rates across subsequent tasks.
-
Root cause identification for common failures like API misuse, logic errors, and authentication timeouts.
-
Automatic extraction of design patterns, such as mandated API sequence orders (e.g., login before search) and correct method naming conventions.
-
Generation of high-quality, actionable bullets containing specific code snippets that demonstrate the correct API interaction pattern.
-
Integration with the broader ACE context management system for TF-IDF based retrieval and conflict detection.
-
Metadata-rich output including evidence tracking (task ID), confidence scoring, and categorical tagging for efficient indexing.
-
The skill requires a structured input format, including task instructions, used applications, error messages, and failed code snippets.
-
Outputs are strictly validated against a JSON schema to ensure compatibility with generator and curator workflows.
-
Designed for developers and automated agents working on AppWorld task automation, it requires consistent use of apis.supervisor.complete_task() to signal finalization.
-
Users should focus on identifying generalizable patterns rather than task-specific quirks to maximize the utility of generated bullets across diverse scenarios.
-
Use the generated confidence level (high, medium, low) to determine whether a bullet should be automatically applied or reviewed by a human-in-the-loop.
-
Effectively bridges the gap between raw execution error logs and the long-term context evolution of the agentic system.
Repository Stats
- Stars
- 27
- Forks
- 3
- Open Issues
- 2
- Language
- Python
- Default Branch
- main
- Sync Status
- Idle
- Last Synced
- May 3, 2026, 05:29 PM