What types of AI-generated images can this API detect?

Our API detects images from all major AI generators including Midjourney (v4–v6), DALL-E 2 & 3, Stable Diffusion (1.5, XL, 3.0), Flux, Adobe Firefly, Google Imagen, and 15+ other models. We also detect hybrid images that combine AI generation with manual editing.

How accurate is the AI Image Detector API?

Our detection model achieves 99.1% accuracy on our benchmark dataset of 10M images. We maintain separate precision and recall metrics per generator: 99.4% for Midjourney, 98.9% for DALL-E, 99.2% for Stable Diffusion, and 98.7% for Flux. We continuously retrain as new models emerge.

Can the API detect partially AI-edited images?

Yes. Beyond fully synthetic images, our API detects AI inpainting, outpainting, face swaps, background replacements, and generative fill edits. The pixel-level heatmap response highlights exactly which regions were AI-modified.

How does source model attribution work?

Our classifier analyzes frequency-domain signatures, noise patterns, and structural artifacts unique to each generator. It returns the most likely source model with a confidence score. When multiple generators were used (e.g., Stable Diffusion base + Midjourney upscale), it flags the image as multi-model.

Is there a free tier?

Yes. The free tier includes 500 image scans per month with full API access, model attribution, and heatmap generation. No credit card required.

Blog / Guide

How to Detect Deepfakes: Visual Tells and Detection Tools (2026)

May 6, 202612 min read

On this page

What "deepfake" means in 2026
Layer 1: Visual tells you can spot by eye
Layer 2: Audio analysis
Layer 3: Frame-level forensics
Layer 4: Detection APIs
Layer 5: Provenance and chain of custody
A practical workflow for verifying a single video
High-volume use cases: pipeline design
What about detecting fraud beyond detection?
Frequently asked questions

Deepfakes started as a curiosity in 2017 and became a billion-dollar fraud category by 2025. The FBI's IC3 report logged $4.2B in deepfake-driven fraud losses in the US alone in 2025, more than triple the 2023 figure. Insurance carriers, banks, dating apps, journalists, courts, and HR teams all now routinely encounter synthetic videos that look indistinguishable from real footage at first glance.

The good news: detection has kept pace. The visual tells that worked in 2020 ("the eyes don't blink right") still work on amateur deepfakes, and modern detection APIs catch most professional-grade ones. The hard ones — sophisticated fraud-grade deepfakes engineered specifically to defeat detectors — require defense-in-depth.

This guide covers the full detection stack for video deepfakes. Some techniques overlap with detecting AI-generated still images, but video adds powerful temporal signals that don't exist in stills.

What "deepfake" means in 2026

The word has shifted. "Deepfake" originally meant a face swap — replacing one person's face in existing video with another's. By 2026 it's used more broadly:

Face swaps — original meaning. The body and scene are real footage; only the face has been replaced.
Lip-sync deepfakes — the original video is unchanged except for the mouth, which has been re-animated to match dubbed audio.
Full-synthesis videos — the entire video is generated from scratch using models like Sora, Veo, Kling, and Runway. No real footage involved.
Voice-cloning deepfakes — audio-only fakes using cloned voices, often paired with a still photo or simple lip-sync video.
Identity-puppet deepfakes — a real person's likeness is reanimated in real time using a different person's facial movements (the "puppet").

Detection techniques differ across these categories. We'll cover each.

Layer 1: Visual tells you can spot by eye

Most amateur deepfakes still fail the eye test if you know where to look. The five tells that remain reliable in 2026:

1. Eye and reflection inconsistency. Real eyes show consistent reflections of the lighting environment. Both eyes in a portrait should show the same scene reflected. Deepfakes often get this wrong because the model is reasoning about each eye independently. Look at:

Catchlights (the bright reflection of light sources) — same shape, same location relative to the iris on both eyes?
The whites of the eyes — symmetric? Same redness/clarity?
Pupils — both responding the same way to the lighting? Same dilation?

2. Edge artifacts at the face boundary. Face-swap deepfakes paste a generated face onto real footage. The transition at the hairline, jaw, and neck is the hardest part. Look for:

Slight shimmer or warping along the hairline that doesn't appear elsewhere in the frame.
Mismatched skin tones between face and neck (especially noticeable when the lighting is dramatic).
Unnatural smoothing along the jaw — real skin has texture, especially under shadows.

3. Temporal flicker. Frame-by-frame, real video has consistent micro-details — a stray hair, a freckle, a skin highlight — that move smoothly between frames. Deepfakes often have details that flicker or drift slightly because the model regenerates each frame. Slow the video down to 0.25x speed and watch the face for a few seconds. If something seems to shimmer or pulse, that's a tell.

4. Audio-video desync (lip-sync deepfakes). Lip-sync models in 2026 are excellent but still occasionally drift. Watch for:

Plosive consonants (P, B, M) — does the mouth fully close on each one?
Sibilants (S, SH, F) — does the lip and tongue position match?
Long vowels — does the mouth shape hold steady or does it flicker?

If you have access to good headphones, also listen for the audio itself. Voice-cloning models in 2026 are very good but they can have a slight "smoothness" — fewer mouth-noise artifacts (lip smacks, breath, throat clicks) than real speech.

5. Scene logic and physics. This is where Sora-class full-synthesis videos still slip:

Do hands holding objects pass through them?
When a person turns their head, does the back of the head match the front?
Do shadows and reflections persist correctly when the camera moves?
Do multiple people in the same scene cast lighting consistent with the same environment?

Honest disclosure: each of these tells fails on the highest-end deepfakes from 2026. Together, they catch most amateur and mid-tier fakes. Professional fraud-grade deepfakes engineered to defeat the eye require layer-2+ detection.

Layer 2: Audio analysis

Voice deepfakes are easier to detect than visual ones, partly because audio has fewer dimensions and partly because human auditory perception is more sensitive to artifacts than visual perception is.

What audio detectors look for:

Spectral artifacts. Cloned voices often have characteristic patterns in their frequency spectrum — slightly elevated noise in the 4-8 kHz range, unnatural smoothness in the formant transitions, or frequency masking inconsistencies that real recordings don't have.

Prosody patterns. Cloned voices struggle to reproduce the natural rhythm and pacing of human speech. Sentence-final intonation, emphasis patterns, and the subtle pauses around emotional content all deviate slightly from real speech in ways that classifier models can detect.

Background ambient consistency. Real recordings have ambient noise — HVAC hum, room reverb, distant traffic — that's consistent throughout. Deepfakes often have either implausibly clean audio or ambient noise that's slightly too consistent (real ambient noise has random variation that synthetic ambient noise doesn't reproduce well).

Compression history. Like images, audio carries forensic traces of its compression history. Cloned voices typically have a single compression generation; real recordings often have multiple (recorded → uploaded → re-encoded by a platform → downloaded → re-uploaded).

Most modern deepfake-detection APIs analyze audio alongside video. If you're verifying a video, run it through a detector that covers both modalities — single-modality detectors miss fakes where the visual is good but the audio is cloned, and vice versa.

Layer 3: Frame-level forensics

This is what dedicated deepfake-detection APIs do under the hood. The math:

Optical flow consistency. Optical flow measures pixel-level motion between consecutive frames. Real video has flow patterns consistent with the laws of motion (objects move in continuous trajectories, surfaces deform smoothly). Deepfakes often have flow inconsistencies at the boundaries of synthesized regions because the model generates each frame partly independently.

Frequency-domain fingerprints. Generative models — both image and video — leave characteristic signatures in the Fourier transform of their outputs. Diffusion models, GAN-based models, and autoregressive models each have distinctive fingerprints. Trained classifiers can identify these even on heavily compressed video.

Compression artifact analysis. Real video usually goes through a single round of codec compression (H.264, H.265, AV1). Deepfakes often go through multiple rounds — generation pipeline output → compression → re-rendering → final compression — leaving compounded artifacts that are detectable.

Biological signal extraction. Real videos of real people contain subtle biological signals: heart-rate-driven micro-color-changes in skin (rPPG), micro-expressions during speech, and breathing-driven chest motion. Several research groups have shown that deepfake faces lack these signals, or have them in patterns that don't match real cardiovascular activity. As of 2026, rPPG-based detection is one of the most robust techniques against deepfakes that defeat other detectors.

Layer 4: Detection APIs

The practical choice for most teams. Modern deepfake-detection APIs combine layers 2 and 3 into a single endpoint. The good ones report:

A confidence score — probability that the video contains synthetic content
A modality breakdown — visual vs audio vs both
A region heatmap — which parts of the frame are most suspicious, frame-by-frame
A temporal heatmap — which segments of the video timeline are most suspicious
Source-model attribution when possible — which generator produced the deepfake

Latency for video detection is necessarily higher than for stills. A 30-second clip typically takes 2-8 seconds to fully analyze on a modern API. For real-time moderation at scale, most providers offer a fast-screen mode that analyzes a sample of frames (say, 1 frame per second) and falls back to deep analysis only on flagged content.

Pricing is usually metered by video-second, not by API call. Plan accordingly: at $0.01 per second of video, a content moderation pipeline reviewing 1,000 minutes of user-uploaded video per day costs about $600/month.

Layer 5: Provenance and chain of custody

For the highest-stakes cases — court evidence, journalism, large insurance claims — even a confident detection-API result isn't enough on its own. You need a documented chain of custody.

The questions you should be able to answer for any video that matters:

Who originally captured it? With what device? What's the device's serial number or unique identifier?
When was it captured? Does the timestamp match metadata, GPS, weather records, lighting conditions in the frame?
Where was it captured? Does the geographic claim match what's visible (signage, language, terrain, time of day)?
How did it get to you? Every hand-off — phone → email → platform upload → forwarded to investigator — should be logged.
Has it been edited? Is there an unmodified original somewhere with C2PA metadata, EXIF, or device signatures?

C2PA-compatible cameras (Sony, Leica, Canon's 2025+ pro lines) sign video at capture time. If the original video has a signed C2PA manifest, you have cryptographic proof of origin that no detection API can give you. Same goes for major social platforms that are starting to attach upload-time C2PA assertions in 2026.

A practical workflow for verifying a single video

For a casual case (someone shared a video and you want to know if it's real):

Do the visual check. Watch at 0.25x speed once through. Look for the layer-1 tells.
Reverse-search a key frame. Take a screenshot of an early frame, drag it into Google Images and TinEye. Has this footage appeared before?
Run it through a deepfake-detection API. Most have free tiers that allow a few minutes of video per month.
Make your call.

For a high-stakes case (legal, insurance fraud, journalism):

Establish chain of custody first. Before doing any detection work, document where the video came from and every hand-off.
Request the original file. Platform copies (downloaded from Twitter, TikTok, YouTube) have been re-encoded and stripped of most useful metadata. The original from the device that captured it has 100x more forensic value.
Run two independent detection APIs. False positives are costly in high-stakes contexts; require agreement between two detectors.
Get human expert review. Forensic-grade deepfake detection is a real specialty; for cases that will reach a courtroom, work with a vendor or expert who can produce a defensible written report.

High-volume use cases: pipeline design

If you're moderating user-uploaded video at scale (social platforms, dating apps, marketplaces), the architecture pattern that works in 2026:

Upload → cheap pre-screen. A fast classifier (50-200ms per video) catches the obvious cases. Roughly 95-99% of uploaded video should pass at this stage.
Suspicious uploads → deep analysis. Anything flagged by the pre-screen goes to a full deepfake-detection API. This is more expensive (a few seconds, a few cents) but produces a confident verdict.
High-confidence flags → action. Auto-block, soft-flag, or queue for human review depending on policy.
Borderline → human review. A small fraction of cases will be ambiguous; route to a moderation team.
Audit logging on everything. Every detection result, every decision, every override — log it. This is critical for trust-and-safety audits, regulatory compliance, and improving the pipeline over time.

We have a complete guide to AI image moderation pipelines that covers similar architecture for image moderation.

What about detecting fraud beyond detection?

Detection-API verdicts aren't the only signal you have. For fraud-specific use cases (insurance, identity verification, KYC), pair detection with behavioral signals:

Liveness checks — for ID verification, require live video capture (random head movements, expressions on demand) so the user can't submit a pre-recorded deepfake.
Device and session signals — device fingerprint, IP reputation, session timing patterns. Deepfake fraud rings often share devices.
Cross-modal consistency — if a claim includes a video, a written statement, and supporting documents, do they all tell the same story? Sophisticated AI-generated fraud is now common but full-stack consistency is still hard to fake.
Outcome statistics — if claims with video evidence have suspiciously different outcomes than claims without, you have a portfolio-level signal even before any individual claim is investigated.

We dive into the deepfake-fraud landscape in detail in Deepfake fraud is a $40B problem.

Frequently asked questions

Is there a free deepfake detector I can use right now?

Yes — most major detection APIs (ours included) offer a free tier with enough monthly capacity for casual verification of a few videos per week. Browser extensions from several detection providers also let you right-click any video on YouTube, Twitter/X, or TikTok and get an inline verdict.

How accurate are deepfake detectors in 2026?

State-of-the-art detectors report 95-99% accuracy on benchmarks (datasets like FaceForensics++, DFDC, and the 2025 EU Deepfake Detection Challenge). Real-world accuracy on adversarial deepfakes — those engineered specifically to defeat detectors — drops to 80-90%. This is why defense-in-depth (combining multiple detection methods plus chain-of-custody) is the standard for high-stakes cases.

What's the difference between a deepfake and an AI-generated video?

In strict usage, "deepfake" means a manipulation of real video (face swap, lip-sync) while "AI-generated video" means synthesis from scratch. In 2026 the terms are used interchangeably for any synthetic video meant to mislead. Detection techniques cover both.

Can deepfakes be detected on mobile devices?

Yes — lightweight on-device deepfake detectors are now common in mobile SDKs. Accuracy on-device is typically 5-10 percentage points lower than cloud detection, but latency is much faster (under 500ms for a short clip) and there are no privacy implications from sending video off-device.

Are voice deepfakes harder to detect than visual ones?

Counter-intuitively, no. Audio has fewer dimensions and fewer ways to fool detectors. Modern voice-deepfake classifiers report 97%+ accuracy on standard benchmarks, vs 95% on visual deepfakes. The audio-only fraud cases (cloned-voice phone scams) are dangerous specifically because most victims don't have access to detection tools, not because the deepfakes are technically hard to flag.

Is there a single magic detection technique?

No. Anyone selling a 100%-accurate deepfake detector is overselling. The honest 2026 answer is that defense-in-depth — combining provenance, classifier-based detection, behavioral signals, and human expert review where stakes are high — gets you to roughly 99% reliability for the cases that matter.

Deepfake fraud is real and growing, but detection has kept up. The methods in this guide — visual inspection, audio analysis, frame-level forensics, detection APIs, and provenance — work in combination. Don't rely on any single layer.

If you're building deepfake detection into a product, start with our free API tier — it covers both image and video detection, returns calibrated confidence scores, and gives you the layer-by-layer breakdown so you can make informed decisions on borderline cases.

Try the AI Image Detector API

500 free scans per month. No credit card. Sub-100ms detection with model attribution and region heatmaps.

Get an API key →