Use case

LLM usage metering: track tokens per end-user, across providers

Meter LLM token usage per end-user across OpenAI, Anthropic, Gemini, Mistral, and any other provider. Composite events for prompt + completion tokens, real-time per-user limits, atomic enforcement. The drop-in pattern for AI apps.

Last updated: 2026-05-10

The problem

Your AI app calls OpenAI (or Anthropic, or Gemini). Different users send different prompt sizes, hit different models, and pay you on different plans.

You need to know: how many tokens has each user used this month, across all of their requests, across every provider you call? And: when they hit their plan limit, can the next call return a clean 429 instead of running anyway?

Building this yourself means a counter table, a periodic reset job, atomic increments under concurrency, and an admin UI to debug it. Most teams underestimate how hard the concurrency story is - until two parallel requests both pass canUse().

The solution

Vevee models LLM usage as composite events. One LLM call becomes one event with prompt_tokens and completion_tokens metadata, and Vevee routes it into every limit group it matches - total tokens, per-model tokens, per-tier tokens, anything you defined.

Atomic reserve / commit / release means parallel requests can't both pass the same quota. Reservations auto-release after 60 seconds if your code crashes mid-call.

You don't need to write the counter logic, the period reset, the dashboard, or the developer SDK. Drop in @vevee/sdk, define limit groups in the dashboard, and ship.

Example

After your OpenAI call, you know the actual token counts. Send them as one composite event with prompt and completion as metadata. Vevee counts it against every matching limit group on the user's plan.

import OpenAI from "openai";
import { createClient } from "@vevee/sdk";

const vevee = createClient({ apiKey: process.env.VEVEE_KEY! });
const openai = new OpenAI();

export async function chat(userId: string, prompt: string) {
  // Reserve up to a 4k token bound atomically.
  const r = await vevee.reserve(userId, "llm.tokens", 4000, { model: "gpt-4o" });
  if (!r.allowed) {
    throw new Error(`limit_reached: ${r.reasons?.join(",")}`);
  }

  try {
    const res = await openai.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: prompt }],
    });
    const used =
      (res.usage?.prompt_tokens ?? 0) +
      (res.usage?.completion_tokens ?? 0);

    // Refund the unused portion.
    await vevee.commit(r.reservationId!);
    if (used < 4000) {
      await vevee.track(userId, "llm.tokens.refund", 4000 - used, {
        reservationId: r.reservationId!,
      });
    }
    return res.choices[0]?.message;
  } catch (err) {
    await vevee.release(r.reservationId!);
    throw err;
  }
}

Why composite events

A single LLM call generates two billable quantities: prompt tokens (cheap) and completion tokens (more expensive). With composite events, both flow through one event and Vevee routes them into the right limit groups based on metadata. You don't need to track them as separate events.

Concurrency: why naive metering breaks

The pattern canUse → call OpenAI → track has a race: two concurrent requests can both see "user has 100 tokens left, allow" before either has tracked. Both then track 100 tokens - total of 200 over a 100-token quota. Vevee's reserve / commit / release closes this race by holding the reservation atomically.

Provider-agnostic by design

Vevee does not know or care that you're calling OpenAI. The same pattern works for Anthropic, Gemini, Mistral, Replicate, fal, your self-hosted Llama 3 model, or all of the above behind a router. You decide what an event means; Vevee counts it.

Token bundles vs per-request units

For most apps, token-level metering is right. For some - e.g. an app where each user gets "10 chats per day" regardless of length - request-level metering is simpler. Vevee supports both: define a limit group with unit "tokens" or "count" and you're done.

Frequently asked questions

Do I have to send tokens or can I use request count?

Either. Vevee supports four units: count, tokens, seconds, and cents. Pick the one that maps to how you want to charge your users.

How do I handle streaming responses where I do not know token counts upfront?

Reserve an upper bound, stream the response, then commit and track a refund event for the unused portion. The example above shows the pattern.

Can I bill per-token cost in cents instead of token count?

Yes. Define a limit group with unit "cents" and track the dollar value rather than the token count. Vevee does not care.

Does this work with multiple AI providers in one app?

Yes. Use metadata like { provider: "openai", model: "gpt-4o" } and define limit groups that match per-provider, per-model, or globally. One image render or LLM call can hit multiple groups simultaneously.

Other use cases