ALTK Components¶

We summarize the components currently in ALTK in the table below.

altk_lifecycle

Lifecycle Step	Component	Problem	Description	Performance	Resources
Pre-LLM	Spotlight	Agent does not follow instructions in the prompt.	SpotLight enables users to emphasize important spans within their prompt and steers the LLMs attention towards those spans. It is an inference-time hook and does not involve any training or changes to model weights.	5 and 40 point accuracy improvements	Paper
Pre-LLM	Retrieval Augmented Thinking	Agent struggles to select the best collaborator or tool to answer the query.	Retrieval Augmented Thinking gives hints to the agent regarding which collaborator(s) might be best to consult to answer a user query.	Improved 25% accuracy in selecting the correct collaborator/tool. improvements
Pre-tool	Refraction	Agent generates inconsistent tool sequences.	Verify the syntax of tool call sequences and repair any errors that will result in execution failures.	48% error correction	Demo
Pre-tool	SPARC	The agent calls incorrect tools (in the wrong order, redundantly, etc.) or uses incorrect or hallucinated arguments.	Evaluates tool calls before execution, identifying potential issues and suggesting corrections with reasoning for tool selection or argument values, including the corrected values.	Achieved 88% accuracy in detecting tool-calling mistakes and +15% improvement in end-to-end tool-calling agent pass^k performance across GPT-4o, GPT-4o-mini, and Mistral-Large models.
Post-tool	JSON Processor	Agent gets overwhelmed with large JSON payloads in its context.	If the agent calls tools which generate complex JSON objects as responses, this component will use LLM based Python code generation to process those responses and extract relevant information from them.	+3 to +50 percentage point gains observed across 15 model from various families and sized on a dataset with 1298 samples	Paper, Demo
Post-tool	Silent Review	Tool calls return subtle semantic errors that aren’t handled by the agent.	A prompt-based approach to identify silent errors in tool calls (errors that do not produce any visible or explicit error message); Determines whether the tool response is relevant, accurate and complete based on the user's query	4% improvement observed in end-to-end agent accuracy
Post-tool	RAG Repair	Agent isn’t able to recover from tool call failures.	Given a failing tool call, this component attempts to use an LLM to repair the call while making use of domain documents such as documentation or troubleshooting examples via RAG. This component will require a set of related documents to ingest	8% improvement observed on models like GPT-4o	Paper
Pre-Response	Policy Guard	Agent returns responses that violate policies or instructions.	Checks if the agent's output adheres to the policy statement and repairs the output if it does not	+10 point improvement in accuracy	Paper
Build-Time	Tool Enrichment	Tool does not have clear metadata or docstrings for the agent	Generate tool and parameter descriptions to enhance tool calling.	+10 point improvement in correct tool invocations	Paper
Build-Time	Test Case Generation	Test agent to call the correct tool with the right arguments	Generate user utterances to test its behavior across a variety of scenarios.	Enhances test coverage, prevents runtime issues, and strengthens regression testing.
Build-Time	Tool Validation	Validate that agent call the correct tool with the right arguments for these test cases	Invoke the agent with test utterances and identify different types of tool-selection and argument-related errors.	In our evaluations, a major source of errors stemmed from incorrect generations of input schemas, particularly parameter type or value mismatches, which was observed in 13% to 19% of test cases. Based on this error taxonomy, the module provides targeted recommendations for tool repair.	Paper