Skip to content

Scaling Characteristics

375ddb57-2730-42c4-bedd-9e8de6b754d3.png

While introducing the concept of refraction, we talked about how one of the key considerations is response time. To reiterate, the edit call must be insignificant compared to a reasoning call to an LLM call. Otherwise, we might as well reason again by reflection.

In the following sections, we will explore how the refraction API scales with different aspects of a sequence of API calls. These results are from running the debugger on the NESTFUL dataset, as described in more detail here.

5.1 Counting Seconds

⚠️ The times reported here are only to be treated relatively. Absolute times, of course, depend on the infrastructure running the code.

5.2 Scaling with Number of Tokens

As we discussed previously here, the debugger decomposes the input sequence into a set of possible tokens to enforce. Hence, the complexity of the optimization step increases with the number of steps as well as the number of parameters in each step. However, based on the results below, the impact of this on the time to debug seems to be negligible compared to other factors.

5.2.1 Time taken per instance

Average time: 0.678 secs

7b711a55-e484-43e3-9897-5f766c2e0ef2.png

5.2.2 Time taken per length of sequence

70978103-5e20-46de-9f5c-a64b2d4f3394.png

5.2.3 Time taken per length of sequence x parameters

d9614509-ed7b-4712-9d9a-a530ab263f65.png

5.2.4 Time taken with edit distance

Refraction on a set of tokens is likely to be faster if there are less edits to make. The intuition here is similar to how planning problems with longer solutions plans can be harder to solve more often than not. Hence, running the debugger to get a YES/NO signal (e.g. to send back for reflection) is likely to be faster than generating all the edits.

5.3 Scaling with the underlying facts

Apart from the input sequence itself, the other input to the debugger is the API specs or signatures. This has two possible impacts: 1) the size of the catalog i.e. the number of APIs; and 2) how coupled they are i.e. the number of possible mappings. These inputs seem to dominate the token size from the point of view of scaling.

5.3.1 Scaling with the Size of the Catalog (with no mappings)

a5ff9410-f3db-4d14-879b-230dbafad7a6.png

5.3.2 Scaling with the Number of Maps (with all APIs)

4f679d3e-a6b3-474a-af65-b4cacd2d26f5.png

5.4 Impact of Recovery Patterns on Scaling

Finally, the more recovery patterns the debugger has to reason about, the more time it takes. Primary among this is the presence of defensive actions because it effective doubles the number of edits (due to confirmations on newly introduced items in the sequence).

5.4.1 Defensive Actions

Default With defensive actions
Average time taken 0.678 secs 0.699 secs

Of course, since we are running on the ground truth at the moment, this effect is not that pronounced since there are hardly any edits to make. We will find out more in the future.