Scaling Characteristics¶

While introducing the concept of refraction, we talked about how one of the key considerations is response time. To reiterate, the edit call must be insignificant compared to a reasoning call to an LLM call. Otherwise, we might as well reason again by reflection.

In the following sections, we will explore how the refraction API scales with different aspects of a sequence of API calls. These results are from running the debugger on the NESTFUL dataset, as described in more detail here.

5.1 Counting Seconds¶

⚠️ The times reported here are only to be treated relatively. Absolute times, of course, depend on the infrastructure running the code.

5.2 Scaling with Number of Tokens¶

As we discussed previously here, the debugger decomposes the input sequence into a set of possible tokens to enforce. Hence, the complexity of the optimization step increases with the number of steps as well as the number of parameters in each step. However, based on the results below, the impact of this on the time to debug seems to be negligible compared to other factors.

5.2.1 Time taken per instance¶

Average time: 0.678 secs

5.2.2 Time taken per length of sequence¶

5.2.3 Time taken per length of sequence x parameters¶

5.2.4 Time taken with edit distance¶

Refraction on a set of tokens is likely to be faster if there are less edits to make. The intuition here is similar to how planning problems with longer solutions plans can be harder to solve more often than not. Hence, running the debugger to get a YES/NO signal (e.g. to send back for reflection) is likely to be faster than generating all the edits.

5.3 Scaling with the underlying facts¶

Apart from the input sequence itself, the other input to the debugger is the API specs or signatures. This has two possible impacts: 1) the size of the catalog i.e. the number of APIs; and 2) how coupled they are i.e. the number of possible mappings. These inputs seem to dominate the token size from the point of view of scaling.

5.3.1 Scaling with the Size of the Catalog (with no mappings)¶

5.3.2 Scaling with the Number of Maps (with all APIs)¶

5.4 Impact of Recovery Patterns on Scaling¶

Finally, the more recovery patterns the debugger has to reason about, the more time it takes. Primary among this is the presence of defensive actions because it effective doubles the number of edits (due to confirmations on newly introduced items in the sequence).

5.4.1 Defensive Actions¶

	Default	With defensive actions
Average time taken	0.678 secs	0.699 secs

Of course, since we are running on the ground truth at the moment, this effect is not that pronounced since there are hardly any edits to make. We will find out more in the future.