ALTK Components¶
We summarize the components currently in ALTK in the table below.

| Lifecycle Step | Component | Problem | Description | Performance | Resources |
|---|---|---|---|---|---|
| Pre-LLM | Spotlight | Agent does not follow instructions in the prompt. | SpotLight enables users to emphasize important spans within their prompt and steers the LLMs attention towards those spans. It is an inference-time hook and does not involve any training or changes to model weights. | 5 and 40 point accuracy improvements | Paper |
| Pre-LLM | Retrieval Augmented Thinking | Agent struggles to select the best collaborator or tool to answer the query. | Retrieval Augmented Thinking gives hints to the agent regarding which collaborator(s) might be best to consult to answer a user query. | Improved 25% accuracy in selecting the correct collaborator/tool. improvements | |
| Pre-tool | Refraction | Agent generates inconsistent tool sequences. | Verify the syntax of tool call sequences and repair any errors that will result in execution failures. | 48% error correction | Demo |
| Pre-tool | SPARC | The agent calls incorrect tools (in the wrong order, redundantly, etc.) or uses incorrect or hallucinated arguments. | Evaluates tool calls before execution, identifying potential issues and suggesting corrections with reasoning for tool selection or argument values, including the corrected values. | Achieved 88% accuracy in detecting tool-calling mistakes and +15% improvement in end-to-end tool-calling agent pass^k performance across GPT-4o, GPT-4o-mini, and Mistral-Large models. | |
| Post-tool | JSON Processor | Agent gets overwhelmed with large JSON payloads in its context. | If the agent calls tools which generate complex JSON objects as responses, this component will use LLM based Python code generation to process those responses and extract relevant information from them. | +3 to +50 percentage point gains observed across 15 model from various families and sized on a dataset with 1298 samples | Paper, Demo |
| Post-tool | Silent Review | Tool calls return subtle semantic errors that aren’t handled by the agent. | A prompt-based approach to identify silent errors in tool calls (errors that do not produce any visible or explicit error message); Determines whether the tool response is relevant, accurate and complete based on the user's query | 4% improvement observed in end-to-end agent accuracy | |
| Post-tool | RAG Repair | Agent isn’t able to recover from tool call failures. | Given a failing tool call, this component attempts to use an LLM to repair the call while making use of domain documents such as documentation or troubleshooting examples via RAG. This component will require a set of related documents to ingest | 8% improvement observed on models like GPT-4o | Paper |
| Pre-Response | Policy Guard | Agent returns responses that violate policies or instructions. | Checks if the agent's output adheres to the policy statement and repairs the output if it does not | +10 point improvement in accuracy | Paper |
| Build-Time | Tool Enrichment | Tool does not have clear metadata or docstrings for the agent | Generate tool and parameter descriptions to enhance tool calling. | +10 point improvement in correct tool invocations | Paper |
| Build-Time | Test Case Generation | Test agent to call the correct tool with the right arguments | Generate user utterances to test its behavior across a variety of scenarios. | Enhances test coverage, prevents runtime issues, and strengthens regression testing. | |
| Build-Time | Tool Validation | Validate that agent call the correct tool with the right arguments for these test cases | Invoke the agent with test utterances and identify different types of tool-selection and argument-related errors. | In our evaluations, a major source of errors stemmed from incorrect generations of input schemas, particularly parameter type or value mismatches, which was observed in 13% to 19% of test cases. Based on this error taxonomy, the module provides targeted recommendations for tool repair. | Paper |