How to track OpenAI API usage by user (with quotas, in real time)
Step-by-step guide to tracking OpenAI API usage per end-user with real-time quotas. Pure-TypeScript pattern using @vevee/sdk. Concurrency-safe, provider-agnostic, ten-minute integration.
Last updated: 2026-05-10
OpenAI does not give you a per-end-user usage view. You see your own org-level token spend in the OpenAI dashboard, but your dashboard does not know which of your users sent which prompts. To track usage by user - and gate calls when they hit a quota - you need a layer in your own code.
This guide shows the smallest, correct implementation: a few lines of @vevee/sdk in your existing chat handler. No counter tables, no cron jobs, no lock primitives. About ten minutes of work end-to-end.
Step-by-step
1. Install the SDK
Add @vevee/sdk to your backend project. Zero runtime dependencies; works in Node, Deno, Bun, and edge runtimes.
pnpm add @vevee/sdk
# or npm install @vevee/sdk2. Create an app and a plan in the dashboard
Sign up at vevee.org (free up to 1M events/month). Create a workspace, then an app. In the app, define a plan with a limit group: name "tokens", unit "tokens", quota whatever you want (e.g. 100,000 / month, calendar-anchored). Match rule: event_type = "openai.chat".
3. Initialize the client on your server
Use your sk_live_ key from the dashboard. Keep it server-only - never ship it to the browser.
import { createClient } from "@vevee/sdk";
export const vevee = createClient({
apiKey: process.env.VEVEE_KEY!,
});4. Wrap your OpenAI call with reserve / commit / release
Reserve an upper-bound token estimate before the call. On success, commit and refund the unused difference. On failure, release. Reservations auto-release after 60 seconds.
import OpenAI from "openai";
import { vevee } from "./vevee";
const openai = new OpenAI();
export async function chat(userId: string, messages: any[]) {
const reservation = await vevee.reserve(userId, "openai.chat", 4000, {
model: "gpt-4o",
});
if (!reservation.allowed) {
throw new Error("limit_reached");
}
try {
const res = await openai.chat.completions.create({ model: "gpt-4o", messages });
const used =
(res.usage?.prompt_tokens ?? 0) +
(res.usage?.completion_tokens ?? 0);
await vevee.commit(reservation.reservationId!);
if (used < 4000) {
await vevee.track(userId, "openai.chat.refund", 4000 - used, {
reservationId: reservation.reservationId!,
});
}
return res;
} catch (err) {
await vevee.release(reservation.reservationId!);
throw err;
}
}5. Assign a plan to each user
When a user signs up (or upgrades), call vevee.upsertSubscription. The call is idempotent - calling it with the same plan ID is a no-op, so it is safe to call on every login.
await vevee.upsertSubscription({
userId: "user_8FbKq2",
planId: "plan_free",
});6. Show usage to the user
On the client, use a pk_live_ public key. It can only read the calling user's own counters - safe to ship in browser code.
// In the browser
import { createClient } from "@vevee/sdk";
const vevee = createClient({ apiKey: PK_LIVE_KEY });
const usage = await vevee.usage(currentUserId);
const tokens = usage.counters.find(c => c.label === "Tokens");
// → { quota: 100_000, count: 8_421, remaining: 91_579 }Why this works under concurrency
The naive pattern - check if user is allowed, call OpenAI, increment counter - has a race: two parallel requests can both pass the check before either has incremented. With reserve / commit / release, the reserve is atomic and creates a 60-second hold. Two parallel requests cannot both reserve more than the user has remaining.
Streaming responses
For streaming responses, you do not know the actual completion token count until the stream ends. The pattern is the same: reserve an upper bound, stream the response, commit at the end, and track a refund for the unused portion. The reservation holds quota during the stream so a parallel request can not race past you.
Multiple OpenAI models with different costs
If gpt-4o-mini and gpt-4o have different per-token costs to you, model that with multiple limit groups: a "total tokens" group plus a "premium tokens" group with match rule { model: ["gpt-4o", "gpt-4-turbo"] }. One event hits both. Free users get 0 premium tokens, paid users get 50,000.
Adding Anthropic, Gemini, or other providers
Same pattern. Use a different event type - "anthropic.chat", "gemini.chat" - and a separate (or shared) limit group. AIPricingLab does not know or care which provider you are calling.
Frequently asked questions
Do I need to log every individual OpenAI call?
You only need to track what counts toward limits. The event row is persisted regardless, so you have an audit trail. AIPricingLab does not store prompt or response content - only event metadata you choose.
What if my server crashes between reserve and commit?
The reservation auto-releases after 60 seconds. The user gets their quota back. Worst case: a 60-second window during which their effective quota is reduced.
Can I use this without a paid plan if I am just testing?
Yes. AIPricingLab is free up to 1M events / month with no credit card required.
How does this differ from rate-limiting middleware?
Rate limits are short-window (per second / per minute) and provider-agnostic. AIPricingLab quotas are long-window (per day / per month) and plan-aware. Use both - they are complementary.
Other guides
How to implement per-user rate limits in your AI app
Per-user rate limits for AI apps need atomic enforcement, plan awareness, and refundable reservations. Here is the pattern that works under load - using @vevee/sdk and ~30 lines of code.
9 min · GuideHow to implement usage-based pricing for an AI product
Decide between per-token, per-action, and hybrid pricing. Implement quotas, refunds, and Stripe meter sync. Step-by-step with code, from a developer who has shipped all three models.
6 min · GuideThe reserve / commit / release pattern: atomic AI quota enforcement
Why naive AI usage metering breaks under concurrency, and the only correct pattern that fixes it. Reserve / commit / release explained, with full TypeScript example and the failure modes it prevents.
8 min · GuideHow to charge AI app users by token usage (with refunds and live balance)
Step-by-step: charge users for actual token consumption with pre-paid credits, post-paid invoicing, and a live balance display. Atomic reservations, accurate refunds, and Stripe meter sync.
10 min · GuideBuild a freemium image generator: end-to-end tutorial
Ship a freemium AI image generator with hard caps, upgrade prompts, real-time analytics, and a live "renders left" badge - using @vevee/sdk + your image provider of choice. Full code, ten-minute build.