LLM usage metering: track tokens per end-user, across providers
Meter LLM token usage per end-user across OpenAI, Anthropic, Gemini, Mistral, and any other provider. Composite events for prompt + completion tokens, real-time per-user limits, atomic enforcement. The drop-in pattern for AI apps.
Last updated: 2026-05-10
The problem
Your AI app calls OpenAI (or Anthropic, or Gemini). Different users send different prompt sizes, hit different models, and pay you on different plans.
You need to know: how many tokens has each user used this month, across all of their requests, across every provider you call? And: when they hit their plan limit, can the next call return a clean 429 instead of running anyway?
Building this yourself means a counter table, a periodic reset job, atomic increments under concurrency, and an admin UI to debug it. Most teams underestimate how hard the concurrency story is - until two parallel requests both pass canUse().
The solution
AIPricingLab models LLM usage as composite events. One LLM call becomes one event with prompt_tokens and completion_tokens metadata, and AIPricingLab routes it into every limit group it matches - total tokens, per-model tokens, per-tier tokens, anything you defined.
Atomic reserve / commit / release means parallel requests can't both pass the same quota. Reservations auto-release after 60 seconds if your code crashes mid-call.
You don't need to write the counter logic, the period reset, the dashboard, or the developer SDK. Drop in @vevee/sdk, define limit groups in the dashboard, and ship.
Example
After your OpenAI call, you know the actual token counts. Send them as one composite event with prompt and completion as metadata. AIPricingLab counts it against every matching limit group on the user's plan.
import OpenAI from "openai";
import { createClient } from "@vevee/sdk";
const vevee = createClient({ apiKey: process.env.VEVEE_KEY! });
const openai = new OpenAI();
export async function chat(userId: string, prompt: string) {
// Reserve up to a 4k token bound atomically.
const r = await vevee.reserve(userId, "llm.tokens", 4000, { model: "gpt-4o" });
if (!r.allowed) {
throw new Error(`limit_reached: ${r.reasons?.join(",")}`);
}
try {
const res = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: prompt }],
});
const used =
(res.usage?.prompt_tokens ?? 0) +
(res.usage?.completion_tokens ?? 0);
// Refund the unused portion.
await vevee.commit(r.reservationId!);
if (used < 4000) {
await vevee.track(userId, "llm.tokens.refund", 4000 - used, {
reservationId: r.reservationId!,
});
}
return res.choices[0]?.message;
} catch (err) {
await vevee.release(r.reservationId!);
throw err;
}
}Why composite events
A single LLM call generates two billable quantities: prompt tokens (cheap) and completion tokens (more expensive). With composite events, both flow through one event and AIPricingLab routes them into the right limit groups based on metadata. You don't need to track them as separate events.
Concurrency: why naive metering breaks
The pattern canUse → call OpenAI → track has a race: two concurrent requests can both see "user has 100 tokens left, allow" before either has tracked. Both then track 100 tokens - total of 200 over a 100-token quota. AIPricingLab's reserve / commit / release closes this race by holding the reservation atomically.
Provider-agnostic by design
AIPricingLab does not know or care that you're calling OpenAI. The same pattern works for Anthropic, Gemini, Mistral, Replicate, fal, your self-hosted Llama 3 model, or all of the above behind a router. You decide what an event means; AIPricingLab counts it.
Token bundles vs per-request units
For most apps, token-level metering is right. For some - e.g. an app where each user gets "10 chats per day" regardless of length - request-level metering is simpler. AIPricingLab supports both: define a limit group with unit "tokens" or "count" and you're done.
Frequently asked questions
Do I have to send tokens or can I use request count?
Either. AIPricingLab supports four units: count, tokens, seconds, and cents. Pick the one that maps to how you want to charge your users.
How do I handle streaming responses where I do not know token counts upfront?
Reserve an upper bound, stream the response, then commit and track a refund event for the unused portion. The example above shows the pattern.
Can I bill per-token cost in cents instead of token count?
Yes. Define a limit group with unit "cents" and track the dollar value rather than the token count. AIPricingLab does not care.
Does this work with multiple AI providers in one app?
Yes. Use metadata like { provider: "openai", model: "gpt-4o" } and define limit groups that match per-provider, per-model, or globally. One image render or LLM call can hit multiple groups simultaneously.
Other use cases
Image generation quotas: per-user limits for DALL·E, Flux, Stable Diffusion
Enforce per-user quotas on image generation across DALL·E, Flux, Stable Diffusion, Midjourney API, and Replicate. Atomic reservation pattern stops parallel renders from overshooting. Free tier, premium tier, hard caps - drop in.
Use caseAI agent billing: meter multi-step agents and tool calls
Metering AI agents is harder than metering single LLM calls. One "agent run" can fan out into 20 tool calls and 50 LLM calls. AIPricingLab handles agent-level and step-level metering with composite events and atomic reservations.
Use caseFreemium AI SaaS: ship a free → paid funnel without a backend
Build a freemium AI product where the free plan has hard quotas, the paid plan unlocks more, and "you have used 80% of your free renders" nudges drive upgrades. Drop-in implementation, ten minutes from zero to live.
Use caseToken-based pricing: charge users for actual AI consumption
Charge AI app users by tokens, requests, or compute seconds. Pre-paid credits, post-paid invoicing, hybrid models - implementation patterns and trade-offs from someone who has shipped all three.