AIPricingLabGuide · 8 min read
Guide · 8 min read

How to track OpenAI API usage by user (with quotas, in real time)

Step-by-step guide to tracking OpenAI API usage per end-user with real-time quotas. Pure-TypeScript pattern using @vevee/sdk. Concurrency-safe, provider-agnostic, ten-minute integration.

Last updated: 2026-05-10

OpenAI does not give you a per-end-user usage view. You see your own org-level token spend in the OpenAI dashboard, but your dashboard does not know which of your users sent which prompts. To track usage by user - and gate calls when they hit a quota - you need a layer in your own code.

This guide shows the smallest, correct implementation: a few lines of @vevee/sdk in your existing chat handler. No counter tables, no cron jobs, no lock primitives. About ten minutes of work end-to-end.

Step-by-step

1. Install the SDK

Add @vevee/sdk to your backend project. Zero runtime dependencies; works in Node, Deno, Bun, and edge runtimes.

pnpm add @vevee/sdk
# or npm install @vevee/sdk

2. Create an app and a plan in the dashboard

Sign up at vevee.org (free up to 1M events/month). Create a workspace, then an app. In the app, define a plan with a limit group: name "tokens", unit "tokens", quota whatever you want (e.g. 100,000 / month, calendar-anchored). Match rule: event_type = "openai.chat".

3. Initialize the client on your server

Use your sk_live_ key from the dashboard. Keep it server-only - never ship it to the browser.

import { createClient } from "@vevee/sdk";

export const vevee = createClient({
  apiKey: process.env.VEVEE_KEY!,
});

4. Wrap your OpenAI call with reserve / commit / release

Reserve an upper-bound token estimate before the call. On success, commit and refund the unused difference. On failure, release. Reservations auto-release after 60 seconds.

import OpenAI from "openai";
import { vevee } from "./vevee";

const openai = new OpenAI();

export async function chat(userId: string, messages: any[]) {
  const reservation = await vevee.reserve(userId, "openai.chat", 4000, {
    model: "gpt-4o",
  });
  if (!reservation.allowed) {
    throw new Error("limit_reached");
  }

  try {
    const res = await openai.chat.completions.create({ model: "gpt-4o", messages });
    const used =
      (res.usage?.prompt_tokens ?? 0) +
      (res.usage?.completion_tokens ?? 0);

    await vevee.commit(reservation.reservationId!);

    if (used < 4000) {
      await vevee.track(userId, "openai.chat.refund", 4000 - used, {
        reservationId: reservation.reservationId!,
      });
    }

    return res;
  } catch (err) {
    await vevee.release(reservation.reservationId!);
    throw err;
  }
}

5. Assign a plan to each user

When a user signs up (or upgrades), call vevee.upsertSubscription. The call is idempotent - calling it with the same plan ID is a no-op, so it is safe to call on every login.

await vevee.upsertSubscription({
  userId: "user_8FbKq2",
  planId: "plan_free",
});

6. Show usage to the user

On the client, use a pk_live_ public key. It can only read the calling user's own counters - safe to ship in browser code.

// In the browser
import { createClient } from "@vevee/sdk";

const vevee = createClient({ apiKey: PK_LIVE_KEY });

const usage = await vevee.usage(currentUserId);
const tokens = usage.counters.find(c => c.label === "Tokens");
// → { quota: 100_000, count: 8_421, remaining: 91_579 }

Why this works under concurrency

The naive pattern - check if user is allowed, call OpenAI, increment counter - has a race: two parallel requests can both pass the check before either has incremented. With reserve / commit / release, the reserve is atomic and creates a 60-second hold. Two parallel requests cannot both reserve more than the user has remaining.

Streaming responses

For streaming responses, you do not know the actual completion token count until the stream ends. The pattern is the same: reserve an upper bound, stream the response, commit at the end, and track a refund for the unused portion. The reservation holds quota during the stream so a parallel request can not race past you.

Multiple OpenAI models with different costs

If gpt-4o-mini and gpt-4o have different per-token costs to you, model that with multiple limit groups: a "total tokens" group plus a "premium tokens" group with match rule { model: ["gpt-4o", "gpt-4-turbo"] }. One event hits both. Free users get 0 premium tokens, paid users get 50,000.

Adding Anthropic, Gemini, or other providers

Same pattern. Use a different event type - "anthropic.chat", "gemini.chat" - and a separate (or shared) limit group. AIPricingLab does not know or care which provider you are calling.

Frequently asked questions

Do I need to log every individual OpenAI call?

You only need to track what counts toward limits. The event row is persisted regardless, so you have an audit trail. AIPricingLab does not store prompt or response content - only event metadata you choose.

What if my server crashes between reserve and commit?

The reservation auto-releases after 60 seconds. The user gets their quota back. Worst case: a 60-second window during which their effective quota is reduced.

Can I use this without a paid plan if I am just testing?

Yes. AIPricingLab is free up to 1M events / month with no credit card required.

How does this differ from rate-limiting middleware?

Rate limits are short-window (per second / per minute) and provider-agnostic. AIPricingLab quotas are long-window (per day / per month) and plan-aware. Use both - they are complementary.

Other guides