AI text detector with honest limits and stylometry signal

A stylometric signal for text: three statistical components plus an honest limit. No percent score, no AI verdict, no promise the field cannot keep.

An AI text detector scans writing for statistical traces of machine-generated style. On this page you see three stylometric values - perplexity proxy, n-gram uniformity, sentence-length variance - plus a categorical label. Proving that a specific text came from an AI is not what this kind of tool does. No publicly available detector does it reliably.

Stylometry signal

AI text detector

Three stylometric components, a hint, never a verdict. Instead of a percent score you see where the text sits statistically - and a permanent note about what this signal cannot do.

Language
Try a sample
0 of 50,000 characters

The signal appears here once you paste text.

Can AI text be detected reliably?

No, not reliably. Peer-reviewed research from 2023 onward shows a clear pattern: on unmodified text from known models, public detectors can hit reasonable accuracy. Performance drops sharply once text is paraphrased, lightly edited, mixed with human writing, or generated by an unfamiliar model - down toward chance under active paraphrasing attacks (Sadasivan et al., 2023). Detectors also systematically over-flag non-native English writers as AI, a well-documented bias with serious consequences when these tools enter grading workflows.

The structural problem cuts deep. Language models are explicitly trained to produce statistically plausible word sequences, the kind humans typically write. A user who lightly edits an AI output or weaves in their own sentences shifts the statistical signature with a handful of edits. What remains are hints, not proof.

How does this detector compute the signal?

You see three values for your text: a perplexity proxy (how 'surprising' the character sequences look against natural prose), the n-gram uniformity (how strongly word pairs repeat), and the sentence-length variance. From those three values a label follows from a fixed vocabulary, never a percent score and never a binary verdict.

What each component measures:

  • Perplexity proxy. A simple character-trigram model compares your text to a small reference distribution of natural prose. High = unusual sequences, more typical of humans. Low = highly predictable. This is an approximation, not a true LLM-based perplexity score.
  • n-gram uniformity. We count word pairs and see how often they repeat. Base-model output without style prompts tends to lean on stock phrases (It is important to..., In conclusion...), and that pattern shows up here as a heuristic.
  • Sentence-length variance. The standard deviation of sentence lengths in words. Humans vary more; uninstructed base-model output in a standard register stays in a tighter band. Instruction-tuned models prompted for varied style can match human variance.

From these three values the label follows by a simple rule. The result reads as one of four fixed sentences, never a percent score:

  • Two or more components at 0.40 or below: AI pattern (full text: "Contains patterns often seen in AI-generated text").
  • Two or more at 0.60 or above with a mean of 0.50 or higher: human pattern (full text: "Contains patterns often seen in human-written text").
  • All three clustered tightly around 0.50: "No clear signal".
  • Otherwise: "Mixed".

The thresholds are heuristic and visible so you can argue with them.

Who gets falsely flagged?

Three groups can land in AI-pattern territory under this kind of heuristic: non-native English writers, text in formal genres like law or technical writing, and heavily edited drafts. Each can produce values that resemble AI patterns without any AI involved. Only the non-native bias is well documented empirically; the formal-genre and heavy-editing risks are plausible as heuristics but not supported by the same body of evidence.

GroupWhy the signal misfires
Non-native English writersSimpler syntax and a narrower vocabulary mimic base-model output
Formal genres (law, technical, academic)Genre conventions can suppress sentence-length variance (heuristic)
Heavily edited draftsMultiple revision rounds can smooth out the statistical signature (heuristic)

Studies from 2023 onward found that mainstream detectors flag essays by non-native writers as AI-generated at significantly higher rates than essays by native speakers - the strongest evidence comes from studies using TOEFL essays compared with US-educated student writing (Liang et al., 2023). Treating a detector signal as proof systematically introduces bias against groups that already face heavier scrutiny.

What should teachers do with this signal?

Treat the signal as a reason to talk, never as evidence. When the label reads "AI pattern", ask the writer about their process: what sources they used, how long they sat with a passage, what they discarded along the way. If the conversation shows they understood their own text, the signal is overruled, regardless of which component flagged what.

Three rules keep the tool useful instead of harmful:

  1. Never confront a student on the signal alone. The false-positive rate is too high for that weight of decision.
  2. Don't route student writing through a detector pre-screen as part of grading. A tool that brands work "suspicious" shifts the burden of proof unfairly.
  3. State the limitation aloud before any class-wide use. No publicly available detector reliably identifies AI text. Bring that into the conversation with colleagues who believe otherwise.

Frequently Asked Questions

How accurate are AI text detectors?

Accurate enough for marketing claims, not accurate enough for decisions with consequences. Vendors often advertise very high claimed accuracy rates, but research under realistic conditions finds detection rates that drop sharply with even light paraphrasing. For consequential decisions, multiple independent methods plus human review should be combined, never a single tool used alone.

What does the perplexity proxy measure?

It counts how 'surprising' your character trigrams look against a small reference distribution. Predictable sequences are typical of machine writing patterns; unusual ones lean human. It is an approximation, not a true LLM-based perplexity score. The reference distribution is just a few paragraphs of natural prose, used as the baseline.

What should I do if the signal reads "AI pattern"?

Talk, don't assume. Ask the writer about sources, drafts, the order of their argument. If the answer holds together, the signal is overruled. False positives are documented most clearly for non-native writers; formal genres and heavily edited drafts are plausible risk cases, in either situation a snap judgment does the most harm.