Skip to content

Offline Analysis

So far, we have discussed how to refract on a single tool call or a sequence of tool calls. While this is useful at the time of execution, to generate immediate feedback, the output of refraction can also be used offline to improve model accuracy or even provide feedback to agent builders on where their agent is going wrong.

Furthermore, we can do things offline that we cannot at runtime: e.g. 1) we can compute new metrics based on evaluation criterion (e.g. goodness of sequence with respect to the stated goals of a sequence); 2) identify redundant steps; 3) generate aggregate statistics on a set of observed calls.

💡 Got ideas for more fun stuff we can measure? Open an issue.

7.1 Offline Basics

The basic API for running in offline-mode is the same as runtime. We already saw this once on the NESTFUL data here.

However, the interpretation of the output is different from runtime usage. For example, in the following situation, the runtime response would be to repair the call(s) and execute by using the corrected call API here, or send it downstream to a reflector component as illustrated here. In offline mode, on the other hand, we need to cache this response as feedback to be used later.

  var4 = TripadvisorSearchLocation(query="London")
- var5 = TripadvisorSearchHotels(geoId="$var3.ID$", checkIn="2024-08-15", checkOut="2024-08-18")
?                                              ^

+ var5 = TripadvisorSearchHotels(geoId="$var4.geoId$", checkIn="2024-08-15", checkOut="2024-08-18")
?                                           ^ +++ ^

Multiple Fixes

Some of the restriction of runtime usage, such as sub-second response, does not apply in the offline mode. So, even on the use cases that overlap with runtime usage, we can do more: such as allow for multiple or additional (computationally expensive but interesting) fixes, longer sequences, larger catalogs, more mappings, and so on. 😌

We can generated multiple fixes for the same problem like so:

from refraction.batch_actions import run_all_modes

run_all_modes(
    sequence=[
        {"name": "Spotify_Scraper_Get_Artist_Overview", "arguments": {"artistId": "$var1.id$"}}
    ],
    catalog=...,
    mappings=...,
    memory_objects={
        "var1": {
            "artist_id": 12345,
        }
    },
)

Here, the reference in the assignment to the parameter artistId is messed up. The refraction call will refract the original tool call across different conditions and generate multiple possible fixes as feedback. Note in the following:

  1. The call can be fixed by using the reference $var1.artist_id$ from memory instead.
  2. We can attempt to extract artistId directly from context / user utterance.
  3. We can make an extra function call Spotify_Scraper_Get_Artist_Overview to get the required information.
  4. Some of the conditions produce the same fix.
- var1 = Spotify_Scraper_Get_Artist_Overview(artistId="$var1.id$")
+ var1 = Spotify_Scraper_Get_Artist_Overview(artistId="$var1.artist_id$")
?                                                            +++++++
- var1 = Spotify_Scraper_Get_Artist_Overview(artistId="$var1.id$")
+ var10 = Spotify_Scraper_Get_Artist_Overview(artistId="$var1.artist_id$")
?     +                                                       +++++++
+ ask(artistId)
- var1 = Spotify_Scraper_Get_Artist_Overview(artistId="$var1.id$")
?                                                       -  ^^

+ var1 = Spotify_Scraper_Get_Artist_Overview(artistId="$artistId$")
?                                                         ^ +++
+ ask(artistId)
+ var1 = Spotify_Scraper_List_Artist_Albums_Singles(artistId="$artistId$")
  var1 = Spotify_Scraper_Get_Artist_Overview(artistId="$var1.id$")
+ ask(artistId)
- var1 = Spotify_Scraper_Get_Artist_Overview(artistId="$var1.id$")
?                                                       -  ^^

+ var1 = Spotify_Scraper_Get_Artist_Overview(artistId="$artistId$")
?                                                         ^ +++

For a set of calls, we can gather these insights into a report. More later on this later.

7.2 Sequence-Level Analysis

In the offline-mode, we have additional options to play with, to decide if a sequence passes the refraction test or not, particularly in the context of the goodness of a sequence.

From the perspective of the AI planning community, a plan can take up 3 forms:

  1. Sound (it is executable)
  2. Valid (it is sound + achieves whatever goals it was supposed to)
  3. Optimal (it is valid + the best of valid plans according to some metric)

The metric in (3) is "least cost" by default, but to move from (1) to either of (2) or (3), we need goal annotations.

7.2.1 Goal Annotations for Testing

At runtime, we only use the soundness check. This is because we want to allow an agent to do what it wants to (agent-knows-best mode), especially since we are not given an additional information on what is to be expected from a sequence.

During offline analysis, we can allow for more stricter checks by allowing for goal specifications. This is particularly true if we already have ground truth data to check with or the developer is available to annotate their tests for more detailed analysis. For example, you can annotate a refraction call with goal annotations as follows:

Operator Goals

An operator goal is a tool call that must be there in the input sequence.

from nl2flow.compile.schemas import Step

result = refract(
    ...,
    goals=[Step(name="SkyScrapperFlightSearch")],
)

Memory Goals

A memory goal is a variable that must be there in memory after completion of the input sequence.

from nl2flow.compile.schemas import MemoryItem

result = refract(
    ...,
    goals=[MemoryItem(item_id="flightId")],
)

Step Goals

Operator goals can also be (partially or wholly) instantiated to indicate there must be a particular tool call in the input trajectory.

result = refract(
    ...,
    memory_objects={...},
    goals=[
        Step(
            name="SkyScrapperFlightSearch",
            parameters=[
                "destinationSkyId",
                "destinationEntityId",
                "date",
            ],
            maps=[
                "$var2.skyId$",
                "$var2.entityId$",
                "2024-08-18",
            ],
        )
    ],
)

Once you have goal annotations, you can check how good your plan is with respect to those goals, beyond just the soundness check we have been doing so far.

7.2.2 Validity

By default, once you specify a goal in the input to refraction, it checks for validity.

result = refract(
    sequence=[
        {
            "name": "SkyScrapperSearchAirport",
            "arguments": {"query": "New York"},
        },
        {
            "name": "SkyScrapperSearchAirport",
            "arguments": {"query": "San Juan"},
        },
        {
            "name": "SkyScrapperFlightSearch",
            "arguments": {
                "originSkyId": "...",
                "destinationSkyId": "...",
                "originEntityId": "...",
                "destinationEntityId": "...",
                "date": "...",
            },
        },
    ],
    catalog=...,
    mappings=...,
    goals=[MemoryItem(item_id="flightId")],
)

assert result.report.determination

7.2.3 Optimality

You can also specify an optimality check like this. If not goals are specified, the refractor will assume that each mentioned tool call must appear at least once.

result = refract(
    tool_calls,
    tools,
    goals=[Step(name="concur")],
    report_type=SolutionQuality.OPTIMAL,
)

Consider the following tool calls. The corresponding tool specs are here. In the above call, we have required that this be an optimal sequence for the concur call.

[
    {
        "name": "w3",
        "arguments": {"email": "tchakra2@ibm.com"},
        "label": "var1",
    },
    {
        "name": "author_workbench",
        "arguments": {"id": "$var1.id$"},
        "label": "var2",
    },
    {
        "name": "hr_bot",
        "arguments": {"id": "$var1.id$", "email": "tchakra2@ibm.com"},
        "label": "var3",
    },
    {
        "name": "concur",
        "arguments": {
            "employee_info": "$var3.info$",
            "travel_justification": "$var2.papers$",
        },
        "label": "var4",
    },
]

Here the hr_bot can be called with a different optional input that allows us to reduce one call. In the refracted output shown below, the w3 call has been removed, and the hr_bot call has been moved up. The id from this call is then reused for the call to author_workbench that has been moved further down. The final call to concur has been adjusted accordingly.

- var1 = w3(email="tchakra2@ibm.com")
?    ^   ^^

+ var3 = hr_bot(email="tchakra2@ibm.com")
?    ^   ^^^^^^

- var2 = author_workbench(id="$var1.id$")
?                                 ^

+ var2 = author_workbench(id="$var3.id$")
?                                 ^

- var3 = hr_bot(id="$var1.id$", email="tchakra2@ibm.com")
- var4 = concur(employee_info="$var3.info$", travel_justification="$var2.papers$")
?    ^

+ var1 = concur(employee_info="$var3.info$", travel_justification="$var2.papers$")
?    ^

Unnecessary calls

In the previous example, we saw how the original plan all contributed to the goal but was suboptimal. It can also be that the plan has unnecessary steps that do not contribute to the goal at all.

result = refract(
    tool_calls,
    tools,
    goals=[MemoryItem(item_id="employee_info")],
    report_type=SolutionQuality.OPTIMAL,
)

The final step is removed and the call to hr_bot is adjusted to reflect that this is enough to achieve this memory goal.

- var1 = w3(email="tchakra2@ibm.com")
?    ^   ^^

+ var3 = hr_bot(email="tchakra2@ibm.com")
?    ^   ^^^^^^

- var2 = author_workbench(id="$var1.id$")
- var3 = hr_bot(id="$var1.id$", email="tchakra2@ibm.com")
- var4 = concur(employee_info="$var3.info$", travel_justification="$var2.papers$")

Serendipity

Finally, the plan will also adjust to serendipitous resolution of the state of the world. Consider that we have already executed the hr_bot, then the optimality call will again adjust the remaining call accordingly.

result = refract(
    sequence=tool_calls,
    catalog=tools,
    memory_objects={
        "id": 213213,
        "var1": {
            "info": "...",
        },
    },
    goals=[Step(name="concur")],
    report_type=SolutionQuality.OPTIMAL,
)

The refractor now removes the unnecessary call to hr_bot, reassigns the call to author_workbench to items already in memory, and adjusts the call to concur accordingly.

- var1 = w3(email="tchakra2@ibm.com")
- var2 = author_workbench(id="$var1.id$")
?                              -----

+ var2 = author_workbench(id="$id$")
- var3 = hr_bot(id="$var1.id$", email="tchakra2@ibm.com")
- var4 = concur(employee_info="$var3.info$", travel_justification="$var2.papers$")
?    ^                             ^

+ var1 = concur(employee_info="$var1.info$", travel_justification="$var2.papers$")
?    ^                             ^

7.3 Compression vs Refraction

One of the interesting aspects of doing post-hoc analysis is that we do not have to follow the agent-knowns-best strategy, as mentioned before, anymore. Instead, we can look at an entire trajectory and look for stuff that need not be there. We call this compression. This is in fact one of the modes that trigger internally when we called refraction to run on all modes previously; but you can invoke it directly as well.

from refraction import compress

result = compress(
    sequence=sequence, catalog=..., mappings=...
)

Unlikely the refraction call that will preserve and attempt to fix all tool calls in a trajectory, the compression call will attempt to remove duplicate, and possibly corrupt, calls.

Thrashing

A special case of duplicates is when an agent is stuck in a sequence of bad calls. Consider this trajectory in from a ReAct agent trying to solve a NESTFUL task.

var1 = TripadvisorSearchLocation(query="Costa Rica")
var2 = TripadvisorSearchHotels(geoId="$var1.geoId$", checkIn="2024-12-01", checkOut="2024-12-15")
var3 = SkyScrapperSearchAirport(query="New York")
var4 = SkyScrapperSearchAirport(query="Costa Rica")
var5 = SkyScrapperFlightSearch(originSkyId="$var3.skyId$", destinationSkyId="$var4.skyId$", date="2024-12-01")
var6 = SkyScrapperSearchAirport(query="New York")
var7 = SkyScrapperFlightSearch(originEntityId="$var6.entityId$", destinationSkyId="$var4.skyId$", date="2024-12-01")
var8 = SkyScrapperSearchAirport(query="New York")
var9 = SkyScrapperFlightSearch(originSkyId="$var8.skyId$", destinationSkyId="$var4.skyId$", originEntityId="$var8.entityId$", date="2024-12-01")

There are several things to consider here for the compressor:

  • Step producing var4 should remain -- multiple correct tool invocations with different instantiation
  • Step producing var5 should go -- wrong parameters
  • Step producing var6 should go -- this is a correct call but unnecessarily repeated to support a failed call
  • Step producing var7 should go -- wrong parameters again, thrashing on wrong call
  • Step producing var8 should go -- back to thrashing on correct call to support a wrong call
  • Step producing var9 should go -- but either this or one of the previous wrong invocations needs to be fixed

The compression call removes all the bad calls and consolidates a single correct call, absent in the original trajectory, as shown below. During offline analysis, you can use this to monitor thrashing behavior of an agent.

+ var7 = SkyScrapperSearchAirport(query="Costa Rica")
  var1 = TripadvisorSearchLocation(query="Costa Rica")
  var2 = TripadvisorSearchHotels(geoId="$var1.geoId$", checkIn="2024-12-01", checkOut="2024-12-15")
  var3 = SkyScrapperSearchAirport(query="New York")
+ var4 = SkyScrapperFlightSearch(originSkyId="$var7.skyId$", destinationSkyId="$var3.skyId$", date="2024-12-01", originEntityId="$var7.entityId$", destinationEntityId="$var3.entityId$")
- var4 = SkyScrapperSearchAirport(query="Costa Rica")
- var5 = SkyScrapperFlightSearch(originSkyId="$var3.skyId$", destinationSkyId="$var4.skyId$", date="2024-12-01")
- var6 = SkyScrapperSearchAirport(query="New York")
- var7 = SkyScrapperFlightSearch(originEntityId="$var6.entityId$", destinationSkyId="$var4.skyId$", date="2024-12-01")
- var8 = SkyScrapperSearchAirport(query="New York")
- var9 = SkyScrapperFlightSearch(originSkyId="$var8.skyId$", destinationSkyId="$var4.skyId$", originEntityId="$var8.entityId$", date="2024-12-01")

7.5 Aggregate Analysis

With all these pieces in place, we can run aggregate analysis on a bunch of trajectories and produce 1) specific feedback per sample for offline improvement of models / agents, and 2) aggregate statistics highlighting areas for improvement.

7.5.1 Error Tagging

In order to produce aggregate statistics the first step is to tag the errors in a tool call (this is done internally but if you ever need to tag elsewhere, you can do it like this).

Tagging a single call

from nestful import SequenceStep
from nestful.errors import tag_sequence_step

tool_call = {"name": "TripadvisorSearchHotels", "arguments": {"geoId": "$var4.locationId$", ...}}
tagged_call = tag_sequence_step(SequenceStep(**tool_call), ground_truth=..., memory={...})

To reveal the error tags, inspect tagged_call.errors. Note that some errors might appear in multiple forms -- e.g. an assignment to a missing memory item can be counted as a wrong assignment as well (but not the other way round).

[
    ErrorTag(error_type=<ErrorType.WRONG_ASSIGNMENT: 'wrong_assignment'>, info={'geoId': '$var4.locationId$'}),
    ErrorTag(error_type=<ErrorType.MADE_UP_ASSIGNMENT: 'made_up_assignment'>, info='locationId'),
    ErrorTag(error_type=<ErrorType.MISSING_MEMORY: 'missing_memory'>, info='$var4.locationId$')
]

Tagging a sequence object

You can also tag an entire sequence together, which will reveal both step-level as well as sequence-level diffs.

from nestful import SequencingData
from nestful.errors import tag_sequence

tool_calls = [...]
tagged_sequence = tag_sequence(
    SequencingData(output=[SequenceStep(**tool_call) for call in tool_calls],
    ground_truth=...,
    memory={...},
    catalog=...,
)

To reveal the error tags, inspect tagged_sequence.errors.

[
    ErrorTag(error_type=<ErrorType.MADE_UP_API: 'made_up_api'>, info='Tripadvisor_Search_Hotels'),
    ErrorTag(error_type=<ErrorType.NEW_CALL: 'new_call'>, info='Tripadvisor_Search_Hotels'),
    ErrorTag(error_type=<ErrorType.WRONG_ASSIGNMENT: 'wrong_assignment'>, info={'query': 'London'}),
    ErrorTag(error_type=<ErrorType.WRONG_ASSIGNMENT: 'wrong_assignment'>, info={'query': 'New York'}),
    ErrorTag(error_type=<ErrorType.MADE_UP_ASSIGNMENT: 'made_up_assignment'>, info='geoId'),
    ErrorTag(error_type=<ErrorType.MISSING_MEMORY: 'missing_memory'>, info='$var4.geoId$'),
    ErrorTag(error_type=<ErrorType.WRONG_ASSIGNMENT: 'wrong_assignment'>, info={'query': 'London'}),
    ErrorTag(error_type=<ErrorType.WRONG_ASSIGNMENT: 'wrong_assignment'>, info={'query': 'New York'}),
    ErrorTag(error_type=<ErrorType.MADE_UP_ASSIGNMENT: 'made_up_assignment'>, info='geoId'),
    ErrorTag(error_type=<ErrorType.MISSING_MEMORY: 'missing_memory'>, info='$var4.geoId$')
]

7.5.2 Report Generation

Now we have everything in place to run the offline mode in batch and generate a report about the individual and aggregated goodness of the generated trajectories. There are multiple outcomes of report:

  1. For each trajectory, it should identify if errors occurred and of what type, and if they were fixable on the spot;
  2. Quality of the trajectory per their validity/optimality and possibility of compression;
  3. Aggregated pass/fail/quality statistics the entire dataset; and
  4. Per API/tool statistics for common failure modes.

Run as sequence

The default batch runner rates a trajectory as if it were generated up front. Here, data is a list of 3-tuple containing a sequence, its ground truth (optional), and the catalog.

from refraction.batch_actions import run_all_batch

results: BatchResults = run_all_batch(data)
Validating sequence 7/10


  var1 = NewsAPISearchByKeyWord(query="UK sports news", language="en", region="GB")
  var2 = RedditTopPostsBySubreddit(subreddit="sports", time="day")


Validating sequence 8/10


  var1 = NewsAPISearchByKeyWord(query="India politics", language="en", region="IN")
  var2 = RedditTopPostsBySubreddit(subreddit="india", time="week")


Validating sequence 9/10
Error generating step predicate: 'nlp.extract_topics'
Error generating step predicate: 'nlp.filter_posts'


  var1 = NewsAPISearchByKeyWord(query="2024 US election")
- var2 = NLP.extract_topics(text="$var1.articles$")
- var3 = RedditTopPostsBySubreddit(subreddit="news", time="day", query="$var2.topics$")
?                                                              -----------------------

+ var3 = RedditTopPostsBySubreddit(subreddit="news", time="day")
- var4 = NLP.filter_posts(posts="$var3.posts$", query="2024 US election")
+ var3 = NewsAPISearchByKeyWord(query="2024 US election")


Validating sequence 10/10

Average time taken: 3.42 sec
Success Rate: 6/10
Compression Rate: 0.11
Troubled Indices: 0, 1, 3, 8

Time taken: 34.28 secs

Run step-by-step

The above call ran post-hoc refraction on the whole sequence. You can also run batch analysis assuming the trajectory generates and executes one step at a time. This makes a difference if the output of each step is noisy. For example, a repeated tool call might be deemed unnecessary at the trajectory level but necessary in hindsight if a tool call failed in the middle. You can run the step by step mode like this.

run_all_batch(..., run_step_by_step=True)