Blog / Methodology
AI Image Detection Accuracy: How to Evaluate Detectors Without Getting Fooled
On this page
Every AI image detector marketing page advertises an accuracy number. 95%. 98.7%. 99.1%. 99.5%. The numbers look credible — they're at the level you'd expect a real classifier to hit — but they're nearly useless for buying decisions because they're not measuring the same thing.
This guide is for buyers, builders, and engineers who need to evaluate detection APIs before integrating them into something that matters. We'll cover what accuracy actually means, where the numbers come from, the metrics that matter more than headline accuracy, how to run your own evaluation, and the questions that separate honest vendors from optimistic ones.
What "accuracy" means — and doesn't
Headline accuracy in detection-API marketing typically means: the percentage of images in a benchmark dataset where the API's verdict matched ground truth. A claim of "99.1% accuracy" usually means: on a balanced test set of 10,000 images (5,000 real, 5,000 AI-generated), the detector correctly classified 9,910.
That's a real metric. It's also a misleading one for at least four reasons:
1. The benchmark composition is rarely disclosed. A detector trained heavily on Midjourney v6 outputs will look great on a benchmark heavy on Midjourney v6 outputs and underperform on Stable Diffusion or Flux outputs. Without knowing what's in the test set, "99% accuracy" tells you almost nothing about real-world performance.
2. Real-world image distributions are not 50/50. In a content moderation pipeline, AI-generated images might be 1% or 5% or 30% of the total — but rarely 50%. At a 1% positive rate, even a 99% accurate detector flags as many false positives (legitimate real images marked as AI) as true positives (AI images correctly caught). Accuracy stops being the right metric.
3. Adversarial robustness isn't measured. Most public benchmarks include "clean" AI-generated images straight from generators. Real fraud and content-moderation cases involve images that have been compressed, downsampled, recompressed, color-shifted, and cropped — sometimes deliberately to defeat detectors. Accuracy on adversarial inputs is typically 5-20 percentage points lower than on clean test sets.
4. Calibration usually isn't reported. A detector that says "95% confidence AI" should be right 95% of the time it makes that claim. Many detectors aren't well-calibrated — they're over-confident at the extremes (claiming 99% confidence when reality is 80%). You can have a 99%-accurate detector whose confidence scores are useless for downstream risk-scoring.
The result: two detectors both advertising "99.1% accuracy" can perform very differently in production, sometimes by 10-20 absolute percentage points on the cases that matter to you.
The metrics you should actually care about
For buying decisions, the metrics worth asking about, in priority order:
1. Per-model accuracy
Don't accept aggregate accuracy. Ask: what's the accuracy on:
- Midjourney v7 outputs (~30% of generative images in the wild today)
- Flux Pro outputs
- DALL-E 4 outputs
- Stable Diffusion 4 outputs
- Sora-image outputs (newest, hardest)
- Inpainted/outpainted images (where part of a real image has been edited with AI)
A vendor who can quote per-model accuracy is doing the homework. A vendor who can't is reporting averages over an opaque mix.
2. False positive rate at production threshold
If you're going to deploy the detector at, say, "flag any image with confidence > 70%", what's the false positive rate at that threshold? On real-world (not benchmark) data?
A detector with 99% overall accuracy can have a 5% false positive rate at the threshold you'll deploy at, which makes it operationally noisy. Always evaluate at the actual threshold you'll use.
3. Precision and recall, not accuracy
For imbalanced production data (where AI-generated images are a small minority of total), the relevant metrics are:
- Precision — of all images the detector flags as AI, what fraction actually are? High precision = low false positive rate = analysts trust the alerts.
- Recall — of all the AI images that pass through, what fraction does the detector catch? High recall = low false negative rate = bad actors don't slip through.
These trade off. You can tune any classifier to higher precision (fewer false positives, more false negatives) or higher recall (more false positives, fewer false negatives) by moving the decision threshold. The right tradeoff depends on the cost of each error type for your use case:
- Content moderation: false negatives (missed AI content) usually cost more than false positives (annoyed users), so tune for recall.
- Insurance fraud detection: false negatives (paid-out fraud) cost a lot more than false positives (claims sent to SIU), so tune for recall but expect higher SIU volume.
- Marketplace product listings: false positives (real-product listings flagged) frustrate sellers and hurt revenue, so tune for precision and accept that some AI-generated listings will slip through.
The vendor's reported "99% accuracy" tells you nothing about precision or recall. Ask for both.
4. Calibration
A well-calibrated detector's confidence scores match reality. When it says "85% confidence AI," it should be right 85% of the time. You can measure this with a calibration plot: bin predictions by confidence, then for each bin, compute the actual accuracy on that bin's predictions.
For risk-scoring downstream of the detector, calibration matters more than accuracy. If the detector is well-calibrated, your downstream risk model can combine its score with other signals trustingly. If it's miscalibrated, you're either under-flagging (over-confident at low confidence) or over-flagging (over-confident at high confidence) in unpredictable ways.
5. Adversarial robustness
The honest test: take real AI-generated images, apply common adversarial transforms (re-compression, downsampling, slight color shift, crop, stack of light noise), then evaluate. Many detectors lose 10-30 percentage points of accuracy on adversarial inputs.
Some vendors publish adversarial-robustness numbers. Most don't. If you're protecting against motivated adversaries (fraud, disinformation), this is the metric that determines whether you'll catch them.
6. Latency
Detection that takes 5 seconds is unusable in real-time content moderation. Detection that takes 200ms is borderline. Detection that takes <100ms is invisible to users. Latency at p50, p95, and p99 matters for production deployment.
7. Throughput and cost
For high-volume use cases (millions of scans per month), the cost per scan and the maximum throughput per second matter as much as accuracy. A 99.5% accurate detector that costs $0.05 per scan and tops out at 10 RPS isn't useful in a pipeline scanning 1M images per day. A 98.5% accurate detector at $0.001 per scan and 1,000 RPS often is.
8. Coverage of edge cases
Specific cases that matter:
- Inpainted/outpainted images — only part of the image is AI-generated. Does the detector return a region heatmap?
- Heavy real-world editing — HDR, beauty filters, aggressive denoising. Real photos that look "AI-like."
- Compressed and re-compressed — the path most images take through social platforms.
- Low-resolution thumbnails — sometimes the only image you have.
- Non-photographic styles — illustrations, anime, paintings, 3D renders. Most detectors struggle here.
How to run your own evaluation
If accuracy matters enough to make a buying decision around, run your own evaluation. The process:
1. Build a test set that matches your production distribution.
- 1,000 to 10,000 images is plenty
- Sample from your actual production traffic (or production-similar sources)
- Include a mix of: clean real photos, clean AI-generated images from the major models, edited real photos, compressed images, low-resolution images, edge cases relevant to your use case
- Establish ground truth labels — for AI images, know which generator produced them; for real images, ideally have provenance
2. Score each candidate API on your test set.
- Run every image through every API
- Record: verdict, confidence score, latency, region heatmap (if any), model attribution (if any)
- Be careful about rate limits and concurrent connections — most providers have free-tier limits that won't sustain a full evaluation; plan for paid usage during eval
3. Compute the metrics that matter for your use case.
- Confusion matrix at multiple thresholds (0.5, 0.7, 0.9 are reasonable starting points)
- Precision-recall curve
- Calibration plot
- Per-model breakdown
- Latency distribution
- Per-error-mode analysis (which categories of image does each API get wrong?)
4. Inspect the false positives and false negatives.
This is where the value is. The aggregate metrics tell you which API is best on average; the error analysis tells you why. Patterns matter:
- Does one API mostly fail on Sora-image outputs and the other mostly fail on Stable Diffusion? Choose based on which generators dominate your traffic.
- Does one API have a higher false positive rate on heavily-edited real photos? If your traffic includes lots of those, that's a dealbreaker.
- Does the calibration on borderline cases make sense to you? An API that's mostly wrong in the 50-70% confidence range is hard to use in a risk-scoring system.
5. Pilot in production before committing.
Run the chosen API on a small fraction (5-10%) of production traffic for 2-4 weeks. Compare its verdicts against your existing process or against another API in shadow mode. Real production data has distribution drift that a static test set never fully captures.
What honest vendor disclosures look like
Patterns of vendor honesty (signs you can trust the numbers):
- Discloses test set composition. Names the generators, the proportion of each, the version numbers, the date the eval was run.
- Reports per-model accuracy, not just aggregate.
- Publishes adversarial-robustness numbers on at least one common adversarial transform (recompression is the standard).
- Reports false positive rate at deployment threshold, not just accuracy.
- Provides a calibration plot or expected calibration error.
- Distinguishes accuracy on "clean" vs "in-the-wild" images.
- Shares sample false positives and false negatives so you can evaluate whether the error modes match your use case.
- Re-runs evals quarterly as new generators ship and old ones improve, with version-stamped results.
Patterns of vendor optimism (signs to be skeptical):
- A single accuracy number with no breakdown
- "Trained on millions of images" without specifying sources or composition
- Refuses to share a sample of failed cases
- Hasn't updated the published number in 12+ months despite multiple new generators shipping
- Marketing copy claims "100% accuracy" anywhere — that's never true; if a vendor says it, they're either misleading or don't understand the field
Our own numbers
For full disclosure: our AI Image Detector API currently reports 99.1% aggregate accuracy on our internal benchmark, which is composed roughly:
- 25% Midjourney (v6, v7, mixed)
- 20% Flux (Schnell, Dev, Pro)
- 15% Stable Diffusion (v3, v4, plus several popular fine-tunes)
- 10% DALL-E (v3, v4)
- 10% Sora-image
- 5% inpainted/outpainted real photos
- 5% generated illustrations and paintings
- 10% real photos (mix of professional, smartphone, edited, and compressed)
Per-model accuracy varies from 96.5% (Sora-image, the newest and hardest) to 99.7% (DALL-E, which embeds C2PA manifests we cross-check). False positive rate at our default threshold (0.7 confidence) is 1.4% on real-world traffic. Adversarial-robustness on standard recompression: 96.2%.
We update the benchmark and re-train the model every 6-8 weeks as new generators ship. Our most recent benchmark and calibration plot is shared with Pro and Enterprise customers; we'll publish a public version on /changelog/ (coming soon — see our site audit for the planned page).
How to interpret accuracy claims in this market
A pragmatic mental model:
- Anything over 99% headline accuracy on a non-disclosed benchmark — treat as "probably 95-98% in your real conditions, maybe lower on adversarial inputs."
- Anything between 95-99% — likely honest within reasonable error bars.
- Anything below 90% — old or under-resourced detector; not competitive in 2026.
- "Up to 99.5% accuracy" in marketing copy — the "up to" is doing work; assume the worst-case is 5-10 points lower.
The detector you choose will be wrong on some images. The questions are: which images, how often, in which direction (false positive vs false negative), and how confident is it in its wrong answer? Only an evaluation on your own data answers those.
What makes a detector get better over time
If you're committing long-term to a detection vendor, ask about their re-training cadence:
- How often is the model re-trained? Quarterly is the floor; monthly is better.
- What's their data pipeline? Are they ingesting new generators' outputs as they ship? How quickly after a new model release?
- Do they incorporate customer feedback? False positives and false negatives that customers flag should flow back into training data.
- Do they have a research team or customer-only access to research? Vendors investing in research are more likely to keep up; vendors selling a 2023 model without re-training will fall behind.
Detection is a moving target. The vendor's commitment to keeping up matters as much as their current numbers.
Frequently asked questions
Is 99% accuracy good enough for production?
Depends on volume and stakes. For low-volume manual-review workflows, 99% is fine — you'll catch most cases and the false positives are manageable. For high-volume automated decisions (auto-approving claims, auto-blocking content), even 99.5% accuracy means thousands of errors per million decisions, which may or may not be acceptable depending on per-error cost.
Why don't all vendors publish adversarial-robustness numbers?
Honest answer: because most are bad at it. Adversarial transforms degrade most detectors significantly, and the public numbers would be embarrassing. The vendors who are good at it tend to publish; the rest stay quiet.
Can I trust a vendor's calibration claim?
Verify it on your own test set. Calibration is easy to test — run the API on labeled data, bin by confidence, compare predicted accuracy to actual accuracy in each bin. A vendor whose claims survive this test is reliable; a vendor whose calibration is way off should make you nervous about everything else they claim.
How do I evaluate detection accuracy on illustrations and non-photographic content?
Build a separate test set for that content type and evaluate separately. Most detectors are weaker on illustrations because their training data is photo-heavy. Some vendors offer separate models for photographic vs illustrative content; if you care about both, ask whether they have specialized models.
What's the relationship between accuracy and confidence threshold?
Inverse — sort of. At a low threshold (flag anything with confidence > 0.3), you catch nearly all AI images but also false-flag many real ones. At a high threshold (flag only confidence > 0.95), you have very few false positives but miss many AI images. The "operating point" you choose is a business decision, not a technical one. Headline accuracy is usually reported at the threshold that maximizes F1 score, which is rarely the threshold you'd actually deploy at.
Should I use multiple detection APIs?
For high-stakes use cases, yes. Two independent detectors that agree give you much higher confidence than either alone. Two detectors that disagree route the image to human review, which is exactly the right outcome for borderline cases. The cost is roughly 2x the per-scan price; the accuracy improvement is usually worth it.
Detection accuracy is a real metric and matters a lot, but the headline number is only the starting point. Per-model accuracy, false positive rate at deployment threshold, adversarial robustness, calibration, and per-error-mode analysis are what separate good detectors from average ones — and what separate honest vendors from optimistic ones.
If you want to evaluate our own API against your own data, start with the free tier — 500 scans per month, no credit card. We'll send you our most recent benchmark composition and calibration plot on request so you can run apples-to-apples comparisons.
Try the AI Image Detector API
500 free scans per month. No credit card. Sub-100ms detection with model attribution and region heatmaps.
Get an API key →