AI agent billing: meter multi-step agents and tool calls
Metering AI agents is harder than metering single LLM calls. One "agent run" can fan out into 20 tool calls and 50 LLM calls. AIPricingLab handles agent-level and step-level metering with composite events and atomic reservations.
Last updated: 2026-05-10
The problem
AI agents - research agents, code agents, browser agents - make many sub-calls per user request. One "summarize this PDF" ask might trigger 12 LLM calls and 4 tool invocations.
Should you bill per-agent-run? Per-LLM-call? Per-token? The answer depends on your pricing, but the metering layer needs to express all three views without you re-instrumenting every call.
You also need quota gating: "this user has $5 of agent runs remaining" needs to block the next agent BEFORE it kicks off 100 sub-calls and burns $20.
The solution
Reserve the agent run upfront, before the first sub-call: vevee.reserve(userId, "agent.run", expectedCost, { agent: "research" }). Use an upper-bound estimate.
Track each sub-step as it happens with a child event tagged with the parent run ID. AIPricingLab's match-rule engine routes each sub-event into the right limit groups (agent runs, LLM tokens, tool calls).
On success, commit the reservation and reconcile actual cost vs reserved upper bound (refund the difference via a track event). On error, release.
Example
Agent reserves a budget upfront. Sub-calls track separately for analytics, but the reservation guards the wallet.
import { createClient } from "@vevee/sdk";
import { nanoid } from "nanoid";
const vevee = createClient({ apiKey: process.env.VEVEE_KEY! });
export async function runResearchAgent(userId: string, query: string) {
const runId = nanoid();
const budgetCents = 50;
const r = await vevee.reserve(userId, "agent.run", budgetCents, {
agent: "research",
runId,
});
if (!r.allowed) return { error: "limit_reached", reasons: r.reasons };
let actualCents = 0;
try {
while (!done()) {
const stepCost = await runOneStep(runId);
actualCents += stepCost;
// Track each step for analytics (does NOT increment the agent.run group)
await vevee.track(userId, "agent.step", 1, { runId, kind: "llm" });
if (actualCents > budgetCents) break; // safety stop
}
await vevee.commit(r.reservationId!);
if (actualCents < budgetCents) {
await vevee.track(userId, "agent.run.refund", budgetCents - actualCents, { runId });
}
return { runId, costCents: actualCents };
} catch (err) {
await vevee.release(r.reservationId!);
throw err;
}
}Pick the billable unit deliberately
For most agent products the right billable unit is the agent run, not the underlying token count. Users buy "5 research runs / month", not "100k tokens of agent activity." Token cost still matters internally as a margin guardrail - track it as a sub-event, but don't expose it to the customer.
Upper-bound reservations protect against runaway agents
A pathological agent loop can chew through hundreds of sub-calls. By reserving a budget upfront and breaking the loop when actual cost exceeds it, you cap the blast radius. The reserved quota is held atomically, so two parallel agent runs can't both pass.
Per-sub-event analytics without quota impact
You probably want to see "this agent on average makes 14 tool calls and 38 LLM calls" - but you don't want each of those incrementing the user's agent.run quota. Use match rules so agent.step events route only into analytics groups, not into quota groups.
Tool calls and external API costs
If your agent calls paid external APIs (search, scraping, code execution), track those as separate events and either include them in the agent.run cents budget or in their own limit group ("100 tool calls / month"). Your pricing is your call.
Frequently asked questions
How do I estimate the upfront budget for a reservation?
Use a conservative upper bound based on agent type and config. For research agents, 50 cents is usually safe. Refund the unused portion at commit time so the user only pays what they actually used.
What if the agent times out mid-run?
If your code crashes, the reservation auto-releases after 60 seconds. If your agent legitimately runs longer, call vevee.commit() on partial success and track a refund for the unused portion.
Can I bill on both runs and total tokens?
Yes. Define two limit groups: agent.run (count) and llm.tokens (tokens). One sub-event can hit both. Free plans usually only enforce runs; usage-based premium plans can also enforce tokens for safety.
How do I show the user what their agent run cost?
Each event's response includes counters and remaining quota. Surface those in your UI. Or use vevee.usage(userId) on the client with a pk_live_ key for a live "X runs left this month" display.
Other use cases
LLM usage metering: track tokens per end-user, across providers
Meter LLM token usage per end-user across OpenAI, Anthropic, Gemini, Mistral, and any other provider. Composite events for prompt + completion tokens, real-time per-user limits, atomic enforcement. The drop-in pattern for AI apps.
Use caseImage generation quotas: per-user limits for DALL·E, Flux, Stable Diffusion
Enforce per-user quotas on image generation across DALL·E, Flux, Stable Diffusion, Midjourney API, and Replicate. Atomic reservation pattern stops parallel renders from overshooting. Free tier, premium tier, hard caps - drop in.
Use caseFreemium AI SaaS: ship a free → paid funnel without a backend
Build a freemium AI product where the free plan has hard quotas, the paid plan unlocks more, and "you have used 80% of your free renders" nudges drive upgrades. Drop-in implementation, ten minutes from zero to live.
Use caseToken-based pricing: charge users for actual AI consumption
Charge AI app users by tokens, requests, or compute seconds. Pre-paid credits, post-paid invoicing, hybrid models - implementation patterns and trade-offs from someone who has shipped all three.