AIPricingLabGuide · 7 min read
Guide · 7 min read

How to implement per-user rate limits in your AI app

Per-user rate limits for AI apps need atomic enforcement, plan awareness, and refundable reservations. Here is the pattern that works under load - using @vevee/sdk and ~30 lines of code.

Last updated: 2026-05-10

Most "rate limit" libraries (Upstash Redis, Cloudflare WAF, express-rate-limit) gate by IP or user ID against a fixed window - fine for hitting an HTTP endpoint, wrong for gating an AI call.

AI apps need rate limits that are: per-end-user (not per-IP), plan-aware (free vs paid get different limits), atomic under concurrency (two parallel calls can't both pass), and refundable when the AI call fails.

This guide walks through the right pattern.

Step-by-step

1. Decide on the unit and window

Common combos: requests/minute (anti-abuse), images/day (cost cap), tokens/month (margin cap). You can stack multiple - AIPricingLab applies all matching limit groups to a single event.

2. Define limit groups in the dashboard

Create a limit group per cap. For example: "anti-abuse: 60 requests / minute / user", "cost cap: 100 image renders / month / user", "margin cap: 100k tokens / month / user". Match rules route events into them.

3. Reserve before the AI call

reserve() atomically holds quota. If allowed=false, return a clean 429 to the caller - do not call OpenAI / Replicate / etc.

const r = await vevee.reserve(userId, "image.render", 1, { model: "flux-pro" });
if (!r.allowed) {
  return new Response(
    JSON.stringify({ error: "rate_limited", reasons: r.reasons }),
    { status: 429 },
  );
}

4. Commit on success, release on failure

Wrap the AI call in try/catch. On success, commit. On failure, release. The reservation auto-releases after 60 seconds if your code crashes.

try {
  const result = await callAiProvider();
  await vevee.commit(r.reservationId!);
  return result;
} catch (err) {
  await vevee.release(r.reservationId!);
  throw err;
}

5. Surface remaining quota in your response headers

Conventional rate-limited APIs return X-RateLimit-Remaining and X-RateLimit-Reset headers. Read those from the reservation's counters and include them in your response. Clients can self-throttle.

const counter = r.counters[0];
const headers = new Headers();
headers.set("X-RateLimit-Limit", String(counter.quota));
headers.set("X-RateLimit-Remaining", String(counter.remaining));
if (counter.resetsAt) {
  headers.set("X-RateLimit-Reset", counter.resetsAt);
}

Plan-aware limits

Free users get tighter limits than paid users. Implement this by setting different quotas per plan, then assigning each user the right plan via vevee.upsertSubscription. When they upgrade, you change the plan; counters and rules adjust automatically.

Stacking short and long windows

You can have a "60 requests / minute" anti-abuse limit AND a "1000 requests / month" cost cap on the same plan. Both are checked. The user is blocked if either is at quota. AIPricingLab handles the stacking - your code does not need a separate library per window.

Avoiding accidental DoS on yourself

A subtle gotcha: your API endpoint itself may have its own connection-pool or worker limits. AIPricingLab gates AI calls but does not gate ingress. Pair AIPricingLab quotas with a Cloudflare or Vercel rate limit on the ingress to protect your own infrastructure from abuse.

When to use a separate Redis-based rate limiter

For sub-second per-IP anti-DDoS bursts, a Redis-based rate limiter is still faster (lower latency, no network hop to AIPricingLab). The two coexist: ingress rate limit at the edge for 100 req/sec, AIPricingLab quotas for cost-relevant per-user limits.

Frequently asked questions

Can I rate-limit by IP or do I have to use a user ID?

AIPricingLab is per-user-id. For pre-auth endpoints, pass a stable identifier like the IP or session ID as the userId. After auth, switch to the real user ID.

How does this compare to express-rate-limit or @upstash/ratelimit?

Those are great for short-window per-IP HTTP rate limiting. AIPricingLab is for plan-aware, long-window, refundable-on-failure quotas tied to AI events. Use both.

How fast is reserve() - does it add latency?

Reserve is a single HTTP call to AIPricingLab - typically 30-100ms p99 from a EU/US region. If your AI call takes 2-10 seconds, this is a rounding error.

What if I want to allow unlimited usage on a plan?

Set the limit group quota to a very high number, or omit limit groups from that plan entirely. canUse and reserve return allowed=true when no limit group matches AND the plan opts in to unmatched_event allow.

Other guides