Language is forgiving, Regex isn't

Some background context

I’ve watched the early failings of LLMs as things have progressively improved. Many people point to overconfident hallucinations and inaccuracies. I also noticed they were bad at regular expressions. In the early stages of LLMs being kind of a ‘toy,’ none of this mattered to me much. And then we started seeing ‘AI’ being ingested into the pipelines of real products. Yeah there were still overconfident hallucinations used for summaries of things. I was also seeing LLMs being used to generate things where regex was the consumable content (i.e. security detections). The regex was often wrong. If not wrong, poorly formatted and hard to maintain. But what would be harder to detect with a naked eye: never tested to be performant.

Since then, LLMs have gotten profoundly better, but not for critical aspects of regular expressions. And at the Frontier LLM level, they wont get better for a while. That is what this post is about. I will explain why. I will show my tests. I will offer solutions.

Why fluent language models write dangerous regular expressions

Large language models are now extremely great with language. They draft, summarize, translate, argue, and imitate voice with a fluency that still surprises people who use them daily. However, with regular expressions, they can be subtly wrong, or worse, quietly catastrophic; a pattern that passes every example you test it on and then pins a CPU at 100% the first time it meets an adversarial input.

Regex is different than natural language. Natural language plays to everything a transformer is built to do. Regular expressions attack the one thing it can’t. This piece is about that asymmetry, specifically in the naive setting that matters most in practice: a single pass, no tools, no test harness, the model asked to produce a pattern and trusted to get it right.

LLMs are good at language (duh)

A language model is a next-token predictor trained on an insane quantity of human text. Three properties of natural language make that objective and that training data an almost perfect match for the task.

First, the supervision is enormous and self-describing. Language is the most abundant structured data in existence, and it explains itself as it goes; context, paraphrase, definition, correction. The model sees the same idea expressed ten thousand ways, with the meaning carried in the surrounding words. It learns the distribution of how humans say things because that distribution is written down everywhere.

Second, language is statistical and redundant. There is rarely a single correct output. A dozen phrasings of the same sentence are all acceptable, and a slightly off word choice still communicates. The objective: produce a likely continuation is the task. The model is rewarded for landing in a large, fuzzy region of “good enough,” and natural language is full of such regions.

Third, meaning lives at roughly the granularity the model operates on. Subword tokens and attention over them line up reasonably well with morphemes, words, and phrases, the units where linguistic meaning actually accumulates. The representation and the semantics are, loosely, in the same place. You can train a language model on per character. I have lab’d this with some “small” language models with a corpus like a Bible or the works of H.G. Wells. It’s not really the best way to do things, because lots of training is spent on learning how to spell before learning language. Frontier models typically go with the morphemes, words, and phrases.

Put together: abundant annotated data, a forgiving target with many right answers, and a representation aligned with where meaning lives. Fluency is what you’d expect, not what should astonish us.

Regex is none of those things

A regular expression is not a likely continuation of anything. It is a formal object. It denotes a precise set of strings; a regular language with exact automaton semantics underneath. There is a right answer and an infinitude of wrong ones, and the wrong ones are not gracefully wrong. “Close” can mean “accepts an empty string it should reject,” or “matches across a delimiter it should respect,” or “runs in exponential time.” None of the forgiveness that makes language tractable survives the jump to a formal notation where every character carries exact composable meaning and a single misplaced quantifier changes the language.

There are really two distinct ways a generated regex can be wrong, and the model is weak on both for related reasons.

The first is correctness: does the pattern denote the language you intended. For common, heavily-attested formats: email-ish strings, dotted labels, ISO dates, the model has effectively memorized the right shape and reproduces it well. It is pattern-matching against a regex it has seen thousands of times. But hand it a novel rule, one with no canonical regex in the training corpus, and it has to construct the language rather than recall it. Construction is where the cracks show. The model approximates the target by stitching together familiar fragments, and on an unfamiliar specification that approximation drifts, over-accepting here, missing an edge there, because it is imitating the look of correct regex rather than reasoning about what the automaton accepts.

The second failure is the dangerous one, because it is invisible.

The invisible failure: catastrophic backtracking

A regex can be perfectly correct and still be a denial-of-service vulnerability. Patterns with ambiguous, overlapping repetition; the classic (a+)+, or any “repeated unit with an optional in-band separator”, force a backtracking engine to explore exponentially many ways to partition the input. On a string that almost matches and then fails at the end, the engine tries every partition before giving up. Twenty-six characters can take seconds. A few more, and it never returns. This is ReDoS.

Here is why the model is structurally blind to it. Catastrophic behavior is a property of the regex’s runtime, and runtime appears nowhere in the source text. Regex is typically written once, committed without tests, and never discussed with performance annotations the way code is. So the model has almost no training signal connecting regex syntax to execution cost. It cannot feel that (t+k?)+z is exponential the way an engineer who has been paged at 3 a.m. can, because it has never been paged and has no internal automaton to simulate the blowup. It produces non-catastrophic patterns when it happens to be imitating cautious examples and catastrophic ones when imitating sloppy examples, with no mechanism to tell the two apart.

In testing this is exactly what you see. Give the model a format whose safe, linear pattern is obvious, and it reliably finds it. Give it a format where the safe pattern requires a non-obvious restructuring, recognizing that an optional separator must be made mandatory between elements and handled separately at the boundary, and it will confidently produce the exponential nested form. The damning part: that output passes every functional test case. It is correct. It just falls off a cliff on an input a junior attacker could craft. The model had no signal that anything was wrong, because nothing about the wrongness is visible in the text it was trained on.

Why scale won’t simply fix it

Underneath both failures is one fact about the architecture: a transformer pattern-matches against a learned distribution. It does not execute automata, and it does not construct them. Natural language rewards exactly that behavior. Regex requires the opposite. Deciding whether a regex denotes the intended language, or whether its NFA has the ambiguous states that cause exponential paths, is automaton-level reasoning; the kind of thing you run, not the kind of thing you pattern-match. The model can approximate it from examples, and on familiar shapes the approximation is good enough to look like understanding. On unfamiliar or adversarial shapes the approximation is all there is, and it is not grounded in the semantics that would make it reliable.

This is also why the obvious fixes are weaker than they sound. Better tokenization, even character-level doesn’t address it, because the bottleneck is computational, not representational. And simply letting the model “think harder”; more internal reasoning before answering, helps only at the margins, because more reasoning is still more pattern-matching over the same priors. It is not the same as having an automaton to consult. You can expect the rate of dangerous output to drop somewhat as you crank up deliberation, but not to vanish, because the missing capability was never a question of effort.

“But it’s great at code”: the obvious objection

If LLMs struggle with formal notation, why are they so genuinely useful at writing code, which is also formal? Both things are true, and the reconciliation is instructive. Code generation benefits from compensating factors that regex simply doesn’t have:

  • Volume and annotation. Code is the most abundant structured text after prose, and a huge fraction of it is explained: commits, issues, reviews, tests, docs, Q&A sites that spell out why a snippet does what it does. Regex is written once and hardly discussed at the same level.
  • Decomposition. Code breaks into functions and modules; units the size a model handles well. A regex is a single dense expression where every character matters and nothing can be swapped in as a subroutine.
  • Mature tool loops. The standard way to get good code from a model is generate-test-fix against compilers, type checkers, and test suites. That loop barely exists for regex; almost nobody runs a ReDoS checker in their workflow.
  • Targeted training. Code quality has received enormous, deliberate reinforcement. Regex hasn’t, because the business case is smaller.

Strip those four supports away and you get regex. Notably, the same silent-failure class that bites regex also lurks in code: performance bugs, race conditions, inject-able queries, exactly the places where the failure isn’t visible in the source and there’s no cheap checker in the loop. Regex is just the purest, most concentrated instance of the problem.

PoC or GTFO

Before getting into the fix, I will show examples of how bad this can get. I did my tests in API mode to better control variables and get closer to stripped down LLM instead of the extended thinking and validation loops that Desktop/Cowork/Code can use. Is this unfair? No for a couple of reasons. When regex generation is pipelined into vendor content, it would typically be at this ‘one-pass’ API level, not a full agentic Cowork session per expression. But even in Cowork, your expression isn’t going to be tested for performance.

For my tests, I used differing token limits, thinking types, and effort level. Here’s an example of the code used with one of those configurations.

import anthropic
from dotenv import load_dotenv; load_dotenv()

VALID = """\
tz
ttz
tktz
ttktttz
tkz
ttkttz
tttz
tktktz
tkttz
ttkz
"""

INVALID = """\
z
ktz
tkkz
tt
kz
tzk
t
tk
ztt
ttzz
tttttttttttttttttt
"""

PROMPT = (
    "Each line below is either a valid tally code or not. A tally code is a series "
    "of one or more clusters. A cluster is one or more 't' symbols. A cluster may be "
    "followed by a single linker 'k'. Every code ends with the terminator 'z'. Write "
    "a single regular expression that matches every VALID code and none of the INVALID "
    "strings.\n\nVALID:\n" + "\n".join(VALID) + "\n\nINVALID:\n" + "\n".join(INVALID) +
    "\n\nReturn the regular expression."
)

client = anthropic.Anthropic()
resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=21000,
    thinking={"type": "adaptive"},
    output_config={"effort": "max"},    
    messages=[{"role": "user", "content": PROMPT}],
)
print(next(b.text for b in resp.content if b.type == "text"))

Results

tokens = 1024 (10 tests)

  • ^(t+k?)+z$
  • ^(t+k?)+z$
  • ^(t(\nt)*(\nk)?\n)*t(\nt)*(\nk)?\nz$
  • ^(t(\nt)*(\nk)?)(\n(t(\nt)*(\nk)?))*\nz$
  • (t+k?)+z
  • ^t(\nt)*(\nk)?(\nt(\nt)*(\nk)?)*\nz$
  • ^(t+k?)+z$
  • ^t(\nt)*(\nk)?(\nt(\nt)*(\nk)?)*\nz$
  • ^(t+k?)+z$
  • ^t(\nt)*(\nk)?(\nt(\nt)*(\nk)?)*\nz$

tokens = 21000 (5 tests)

  • ^t(\nt)*(\nk)?(\nt(\nt)*(\nk)?)*\nz$
  • ^t(\nt)*(\nk)?(\nt(\nt)*(\nk)?)*\nz$
  • ^(t+k?)+z$
  • ^t+(kt+)*k?z$
  • ^(t+k?)+z$

tokens = 21000, max effort

  • ^(t+k?)+z$
  • ^(t+k?)+z$
  • ^\s*t(\s+t)*(\s+k)?(\s+t(\s+t)*(\s+k)?)*\s+z\s*$
  • ^(t+k?)+z$
  • ^(t+k?)+z$

tokens = 21000, max effort, adaptive thinking

  • ^(t+k?)+z$
  • ^(t+k?)+z$
  • ^(t+k?)+z$
  • ^(t+k?)+z$
  • ^(t+k?)+z$

Interpretation

How do these rank? When given an adversarial string of 28 t’s, this is how long each timed on my system in seconds. Here is an example command that I used python3 -c "import re,time; s='tttttttttttttttttttttttttttt'; t=time.perf_counter(); re.search(r'^\s*t(\s+t)*(\s+k)?(\s+t(\s+t)*(\s+k)?)*\s+z\s*$', s); print('%.4fs'%(time.perf_counter()-t))"

  • (t+k?)+z: 17.4022s
  • ^(t+k?)+z$: 8.8276s
  • ^(t(\nt)*(\nk)?\n)*t(\nt)*(\nk)?\nz$: 0.0001s
  • ^(t(\nt)*(\nk)?)(\n(t(\nt)*(\nk)?))*\nz$: 0.0001s
  • ^t(\nt)*(\nk)?(\nt(\nt)*(\nk)?)*\nz$: 0.0001s
  • ^t+(kt+)*k?z$: 0.0001s
  • ^\s*t(\s+t)*(\s+k)?(\s+t(\s+t)*(\s+k)?)*\s+z\s*$: 0.0001s

Some of the expressions produced were benign. A lot of them were not. The catastrophic 8 second one was produced in the majority of cases across all test types. There was a more catastrophic 17 second one that also occurred in an early case. So this isn’t great.

Ground truth back in the loop

None of this means the model is useless for regex. It means unaided generation is unreliable. The moment you put a checker in the loop: run the candidate against positive and negative cases, push an adversarial string through a ReDoS detector, iterate, the picture changes completely. The model becomes the composer and the checker supplies the formal ground truth the model lacks. That mirrors how competent humans work: hardly anyone writes non-trivial regex once and ships it untested. The improvement isn’t the model getting better at automata; it’s the model being placed in a loop with something that understands them.

Every mitigation worth the name is the same idea at a different strength: supply the semantics the model can’t derive, and make supplying them rigorous and non-optional. Think of this like a ladder, and where you sit on it is the difference between a useful tool and a liability. One warning applies to every rung, because it’s the failure that keeps recurring: the checker has to test both correctness and catastrophic backtracking. Running a pattern against examples to see if it matches is an instinct; the model will do it almost by reflex. Pumping an adversarial string through it under a timeout is not, and it’s the test that catches the dangerous failure. Whatever rung you’re on, if your definition of “done” doesn’t make adversarial performance explicit, you will get a green-lit time bomb.

MCP + SKILL

The fix is the same shape as everything else in this piece: stop asking the model to be an automaton and hand it one to call. ReDetox is an MCP server that “I” created that does exactly that. The model still composes the pattern; that’s the part it’s good enough at, but every claim about whether the pattern is correct and whether it’s safe gets settled by the server, not by the model’s confidence. ReDetox generates the adversarial input, times the evil string against a benign baseline in a killable subprocess to classify NFA backtracking growth as linear, polynomial, or exponential, sweeps RE2’s DFA compile memory to find the blowup cliff, and checks the pattern against intended positives and negatives. What comes back is a structured verdict: accepted or rejected, with the dangerous construct named, that the model iterates against instead of guessing about. The model composes; the server checks.

None of the measurement logic is new, which is the point. ReDetox is a Python port of 8ball, my hand-written Perl ReDoS engine (benchrexes.pl + nfagen.pl) that’s been my source of truth for this for years. The MCP layer is just the adapter that lets an LLM reach a tool it could never have reasoned its way to. The README has the full tool surface: evil/benign generation, the per-engine growth matrix, verified rewrites, the no-silent-success contract, but the load-bearing idea fits in one line: a regex is “safe” only when correctness, growth, and language equivalence have all been measured, never when it merely looks fine.

But a coprocessor nobody calls is dead weight, and an LLM left to its own habits will happily skip the tool and eyeball the pattern. The exact failure mode we’re trying to kill. That’s the other half: a Skill. The MCP supplies the ground truth; the Skill makes consulting it non-optional. It redefines “done” as concise, correct, and performant, and encodes one rule the model isn’t allowed to talk its way around: never pronounce a regex safe from inspection; always measure first. Draft, analyze, apply the named rewrite if rejected, re-analyze, then return. The tool does the checking; the Skill turns “the model can write a safe regex if it remembers to test” into “the model reliably does.”

https://github.com/XlogicX/ReDetox

POC or GTFO Part II

I give this MCP + SKILL system a battery of tests to make sure it’s doing what it was designed to do.


Test 1

Prompt

Please use the regex-quality MCP server to determin if this regular expresson: ^(t+k?)+z$

Is good for this data:

VALID: tz
ttz
tktz
ttktttz
tkz
ttkttz
tttz
tktktz
tkttz
ttkz

INVALID: z
ktz
tkkz
tt
kz
tzk
t
tk
ztt
ttzz
tttttttttttttttttt

Resonse

The pattern is correct but dangerous. Let me get a verified safe replacement before reporting back.## Verdict: Correct, but not safe to use as written

I ran your regex ^(t+k?)+z$ through the analyzer against all 21 of your test cases. Here’s what came back.

Correctness — passes. It matches all 10 VALID strings and rejects all 11 INVALID ones. Zero false positives, zero false negatives. So functionally, it captures the language you intended.

Safety — fails (this is the problem). The pattern is vulnerable to catastrophic backtracking (ReDoS) with exponential runtime growth. The culprit is the construct (t+k?)+ — a repeated group that mixes a repeatable element (t+) with an optional one (k?). When the trailing z is missing, the engine can split a run of ts an exponentially growing number of ways before giving up.

The measured timing curve makes this concrete. On an input like tttt...tttkza, each extra pair of characters roughly quadruples the match time:

Input length Time
17 0.8 ms
19 3.1 ms
21 12.9 ms
23 50 ms
25 197 ms
27+ times out

A ~27-character malicious string already hangs the engine. That’s a denial-of-service waiting to happen if this pattern ever touches untrusted input.

The server found a verified drop-in replacement that is provably equivalent (exact DFA comparison) to your original and runs in linear time:

^t+(?:kt+)*k?z$

This works by requiring the k to act as a separator between runs of t, which removes the ambiguity that caused the backtracking. It accepts and rejects exactly the same strings as your original — so all 21 of your test cases still pass — but can’t be exploited.

Bottom line: your regex is logically good for this data but should not ship as-is. Swap in ^t+(?:kt+)*k?z$ and you keep identical behavior without the ReDoS risk.


Test 2

Correct-but-dangerous, no magic words. — Tests triggering on a bare quality question, and whether it separates correctness from safety, names the construct, and offers a verified rewrite.

Prompt

“Is this regex any good? ^(\d+\s*)+$ — it should match strings like 12 34 567 and reject things with letters.”

Response

Go isn’t installed here (an environment gap, not a regex problem), but the more useful signal showed up: the only equivalent safe rewrite is the atomic-group one, which RE2/Go can’t parse — and the structural alternative keeps coming back as not equivalent to your original. That’s worth understanding. Let me get the per-engine picture for the original pattern, since the answer genuinely depends on where you run it.Good question to actually test — this is the classic case where the pattern looks fine and even works, but hides a denial-of-service bug. Here’s what the measurements show.

Correctness: it’s right. It matched all your positives (12 34 567, 123, 1 2 3, 42, 7 8) and rejected every negative I threw at it (abc, 12a, 12 ab 34, empty string, 1.2). So functionally it does what you described.

Safety: it’s dangerous on backtracking engines. The structure (\d+\s*)+ is exponential because \s* can match zero characters inside the +, so a run of digits can be partitioned in exponentially many ways. When input almost matches but fails at the end, the engine explores all of them. Measured growth, by engine:

Engine Verdict
Python re Exponential
Node / V8 (JavaScript) Exponential
Python regex module Linear
RE2 Linear (automaton engine)
Go, Java, PCRE2 Not installed here — untested, but Java/PCRE2 are the same backtracking family, so assume at-risk

The attack string is just digits and spaces with one trailing letter, e.g. 111…111␣␣␣…a. The timing curve roughly quadruples every two characters added: ~41 characters already takes ~300 ms on Python, and at 45 characters it times out entirely. On a server validating untrusted input, one short crafted string stalls the thread.

The fix. The verified equivalent, linear replacement is an atomic group:

^(?>(\d+\s*)+)$

The tool confirmed this is linear and matches exactly the same language as your original (same examples pass, fuzz-checked equivalent). The atomic group stops the engine from re-partitioning a digit run it already consumed. Caveat: atomic groups require a modern backtracking engine — Python ≥ 3.11, PCRE2, Java, and recent Node/V8. They will not compile on RE2 or Go.

Two engine-specific notes:

  • If you’re on RE2 or Go, you don’t need to change anything — those engines are linear by construction, so the original is already safe there (and the atomic-group version won’t even parse).
  • If you want something fully portable that works everywhere including RE2/Go, the cleanest option is ^\d+(?:\s\d+)*\s?$ — but be aware this is not equivalent to your original. The tool flagged it as a behavior change: your (\d+\s*)+ permits multiple spaces, tabs/newlines, and trailing whitespace between numbers, while this stricter version expects a single whitespace separator. Only pick it if that tighter definition is actually what you want.

So: keep the pattern only if you’re on RE2/Go; otherwise switch to ^(?>(\d+\s*)+)$, or ^\d+(?:\s\d+)*\s?$ if you’d rather a portable pattern and can accept the slightly stricter whitespace rule.


Test 3

The eyeball trap. — Tests the load-bearing rule: don’t answer from inspection. The one-word demand should make a non-compliant Claude blurt a guess. Passing: it refuses the snap judgment and measures first. Failing: any confident yes/no before a tool call.

Prompt

“Quick gut check, just yes or no — is ^(?:[a-z]+|[0-9]+)+$ safe to run on untrusted input?”

Response

I hear you wanting a quick yes/no, but this is exactly the kind of pattern where eyeballing it can mislead — nested quantifiers like this look dangerous, but whether a specific engine actually blows up depends on details that can’t be reliably read off the pattern. Let me actually measure it rather than guess. Give me a moment.No. I know you wanted a one-word answer, and that’s it — but the reason matters, because it’s engine-dependent and I’d have gotten the nuance wrong by eyeballing.

Measured on the actual engines:

  • Python re and Node/V8 (JavaScript): exponential — catastrophic. The string 9999…9@ (digits followed by a non-matching char) blows up fast: ~3ms at 16 chars, ~40ms at 20, ~440ms at 24, then it times out by ~29 chars. A ~40-character input would hang the thread. On untrusted input that’s a denial-of-service vector.
  • Python’s regex module and RE2: linear — fine. Same pattern, no blowup.
  • Java, PCRE2, Go: not tested (not installed here). Java and PCRE2 are backtracking engines, so I’d assume they’re vulnerable too until measured, not safe.

The culprit is the (?:[a-z]+|[0-9]+)+ structure: the two alternatives plus the outer + let a failing string be partitioned across iterations in exponentially many ways, and a backtracking engine tries them all before giving up.

So unless you happen to be on RE2 or Python’s regex module specifically, treat it as unsafe.

A verified drop-in fix, if you’re on a modern backtracking engine (Python ≥3.11, PCRE2, or Java):

^(?>(?:[a-z]+|[0-9]+)+)$

The atomic group (?>…) stops the engine from re-partitioning what it already matched, which is what kills the backtracking — confirmed linear on Python, and it agreed with the original on all 2000 fuzzed inputs (strong evidence of equivalence, though not an exact proof). One caveat: RE2 and Go can’t parse atomic-group syntax, so if you’re targeting those, the original is already linear there and you don’t need this — or tell me your exact target engine and I’ll get you a structural rewrite that works on it.


Test 4

Rewrite-discipline trap. — Tests whether it runs a fixer and presents a verified replacement, or hand-writes one from intuition. Passing: it produces a rewrite that fix_until_safe/suggest_rewrites verified as equivalent + linear. Failing: it improvises a rewrite and calls it equivalent without verification.

Prompt

“Here’s my pattern ^(\w+\s?)+$. Don’t overthink it, just give me a faster equivalent version.”

Response

The “don’t overthink it” instinct is exactly the trap with this one — ^(\w+\s?)+$ has nested quantifiers, which is the classic catastrophic-backtracking shape. Let me not guess at a rewrite; I’ll measure it and get a verified equivalent.Confirmed — it’s exponential on Python’s engine, not just slow. The curve is textbook catastrophic backtracking: at 22 characters it already takes ~250ms, and by ~26 it times out entirely. A crafted input like aaa...aa @ will hang the engine. Now let me get a verified equivalent replacement rather than hand you my own guess.Here’s your drop-in replacement:

^\w+(?:\s\w+)*\s?$

This was verified exactly equivalent to your original (DFA comparison, not just fuzzing) and measured linear on Python’s engine — so same matches, no backtracking blowup.

The fix is structural: your (\w+\s?)+ made the separator optional inside the repeated group, so a run like aaaa could be partitioned a combinatorial number of ways, and the engine tries all of them before failing. The rewrite matches one word up front, then repeats mandatory separator-plus-word pairs (?:\s\w+)*, with a single optional trailing space. There’s only one way to split any input, so the ambiguity that fed the backtracking is gone.

One caveat worth ten seconds: that verdict is for Python’s re. If you’re running this on a different backtracking engine (Node/V8, Java, PCRE, .NET) and want me to confirm it’s linear there too, say the word and I’ll run the cross-engine check.


Test 5

Engine-dependent. — Tests analyze_matrix and the backtracking-vs-RE2 split. ^(a+)+$ is textbook exponential on backtracking engines but linear on RE2. Passing: it checks the actual deployment engine and attributes the verdict per engine rather than generalizing.

Prompt

“We run our detection regexes through Go’s regexp engine. Is ^(a+)+b$ going to be a problem for us?”

Response

This is the textbook catastrophic-backtracking pattern, so let me actually measure it rather than eyeball it — and critically, measure it on the engine you actually ship on (Go/RE2), since the answer differs by engine. Let me load the tools.Short answer: no, not on Go. But the reason is worth being precise about, because this exact pattern is famous for hanging other engines.

I measured ^(a+)+b$ rather than eyeballing it. Go’s regexp package is an implementation of RE2, an automaton (DFA/NFA-simulation) engine that doesn’t backtrack, so the “exponentially many ways to partition the input” problem that the nested (a+)+ creates simply doesn’t exist there — every input is matched in linear time regardless of structure.

One honesty caveat: the literal go binary wasn’t available in the measurement harness, so I measured re2 directly — the same library Go’s standard library implements — plus a couple of backtracking engines for contrast. Here’s the growth on the classic evil string (aaaa…aab, no trailing b match) as it gets longer:

input length RE2 (≈ Go) Python re Node/V8
16 0.014 ms 0.44 ms 0.55 ms
22 0.014 ms 27 ms 32 ms
26 0.014 ms 467 ms timed out
30+ 0.014 ms timed out timed out

RE2 stays flat at ~0.014 ms all the way out to length 52; the backtracking engines go exponential and fall over around length 26–30. Correctness checked out too (matched ab/aaaab, rejected aaaa/aaaac), and the DFA compiles in about a kilobyte, so there’s no RE2 memory-blowup concern either.

So for your Go pipeline specifically, this pattern is fine — linear time, attacker can’t stall it.

The thing I’d flag: this safety is a property of the engine, not the pattern. The pattern itself is genuinely catastrophic. If any of these detection regexes ever get reused somewhere with a backtracking engine — a Python or Node preprocessor, a Java component, a PCRE-based tool, a teammate testing rules in their editor — ^(a+)+b$ becomes a live denial-of-service vector that a ~30-character input can trigger. If you’d like, I can produce a structurally hardened equivalent (verified for both correctness and linear growth across backtracking engines too) so the pattern is portable and not just safe-by-accident-of-Go.


Test 6

Polynomial, not exponential. — Tests whether it flags quadratic instead of waving it through as “not exponential, so fine.” Adjacent quantifiers with a failing tail are a good candidate; let the tool classify the degree. Passing: if the tool says polynomial, it reports the degree and the scaling risk rather than calling it safe.

Prompt

“Is ^\d+\d+!$ fast enough to run in a hot validation path?”

Response

I should measure this rather than eyeball it — adjacent \d+\d+ looks suspicious, but backtracking behavior can’t be reliably read off a pattern. Let me load the analysis tools.Now let me classify its growth across engines and try to generate an adversarial input.## Short answer: yes, it’s fast — it measures linear on every engine I could test

The adjacent \d+\d+ is a classic ReDoS smell, but the smell doesn’t pan out here. I pumped an adversarial input (a long run of digits ending in a non-! character, which forces the worst-case backtracking) and it didn’t time out anywhere:

  • Python re — linear, ~0.005ms → 0.02ms across the pump range
  • regex module — linear
  • RE2 (Go’s automaton engine) — linear
  • Node/V8 — linear

No engine showed super-linear growth. The reason the smell is harmless: \d+\d+ is two identical unbounded quantifiers in sequence with no alternation. When the match fails, the only ambiguity is where to split the digit run between the two groups — that’s O(n) backtracking positions, not the exponential blowup you get from nested or overlapping-alternation patterns like (\d+)+. So it’s fine for a hot path.

Two caveats worth flagging

1. I couldn’t test three engines. Go, Java, and PCRE2 weren’t installed in this environment, so the overall verdict came back unverified rather than fully green. Java’s backtracking engine in particular is where I’d want confirmation before calling it bulletproof everywhere. If your validation path runs on the JVM or PCRE2, treat the above as “very likely fine” rather than “proven.”

2. The pattern is redundant, and that’s the real reason to change it. ^\d+\d+!$ matches “two or more digits, then !” — the correctness check confirmed it accepts 12!, 00!, 123456789! and rejects 1! and 12. The two \d+ express nothing that one quantifier can’t. A reviewer or a ReDoS linter will keep flagging the adjacency forever even though it’s benign, so I’d rewrite it to kill the smell:

^\d{2,}!$

Same language, no adjacent-quantifier smell, and it’s trivially linear on a backtracking and automaton engine — which also covers the Java/PCRE2 case I couldn’t measure. If your intent was actually “one or more digits then !” (and the doubled \d+ was a typo), then it’s just ^\d+!$.

So: the original is fast enough, but ^\d{2,}!$ is what I’d ship — it’s equivalent, it’s provably clean on engines I couldn’t test, and it stops tripping linters.


Test 7

Memory/DFA blowup, not backtracking. — Tests the second failure mode and the re2_memory path. Large counted repetition is a state-count concern.

Prompt

“Will ^(?:[A-Za-z0-9]{32}-){500}$ cause any trouble? It’s for matching a long delimited token list.”

Response

I’ll measure this rather than eyeball it. Let me generate proper test strings and run the analysis.The composite run already tells me most of what I need, but one of its “failures” was caused by my own test string (I fed it a 3-block example when the pattern demands exactly 500), so let me confirm the cross-engine picture cleanly.Good news on the part people usually worry about, with two caveats worth knowing. I tested it across Python’s re, the regex module, Node/V8, and RE2.

Is it a ReDoS risk? No. Every engine I could run it on classified it as linear, with flat timing curves (~0.03–0.25 ms regardless of input size, ~16.5 KB strings). No catastrophic-backtracking constructs were found. The reason it’s safe is structural: each block is a fixed width ({32}) and is followed by a mandatory literal -, so the engine never has an ambiguous boundary to backtrack over. This is the well-behaved cousin of the dangerous (a+)+-style patterns — the counted repetition here doesn’t create overlapping ways to partition the input.

Is it correct? Mostly — but check the trailing separator. With a proper 500-block input it matched as intended (exactly 500 repetitions of 32 alphanumerics + hyphen). But note the pattern requires a - after every block, including the last one, so a valid match ends in -. If your real token list is hyphen-delimited (separators only between tokens, no trailing hyphen), this regex will reject it. That’s the most likely actual bug here, and it’s a correctness issue, not a performance one. If you want exactly 500 tokens with 499 separators, you’d want something like ^[A-Za-z0-9]{32}(?:-[A-Za-z0-9]{32}){499}$.

One deployment caveat: the {500} count on automaton engines. My RE2 measurements were internally inconsistent — one code path ran it as linear, another rejected {500} as an over-limit repetition size. RE2 and Go’s regexp impose caps on counted-repetition expansion (and some builds set that limit low). Standard RE2/Go should accept {500} (their default ceiling is 1000), but because I got a contradictory signal, I’d verify it actually compiles in your specific build rather than assume it. On backtracking engines (Python, Node) there’s no such limit and it’s fine. I couldn’t test Java or PCRE2 (not installed here), so I can’t vouch for those specifically, though I’d expect no backtracking issue given the structure.

So: not a denial-of-service hazard. The things to double-check are the trailing-hyphen semantics and, if you deploy on RE2/Go, that the {500} count compiles.


Test 8

Should NOT trigger. — Pure syntax knowledge, no measurement needed. If the skill fires here, your description is too pushy.

Prompt

“What’s the difference between \b and \B in a regex?”

Response

In regular expressions, \b and \B are both zero-width anchors that deal with word boundaries — they match positions, not characters.

\b matches a word boundary: a position where a word character (\w, meaning [A-Za-z0-9_]) is next to a non-word character, or at the start/end of the string adjacent to a word character. It’s the transition point between “word” and “not-word.”

\B is its exact opposite — it matches a non-word boundary: any position that isn’t a word boundary. That means positions where both sides are word characters, or where both sides are non-word characters (including at string edges).

Here’s a quick way to see them in action with the string "hello world":

  • \bhello\b matches hello — there’s a boundary before h (start of string) and after o (before the space).
  • \Bllo matches the llo in hello because the position before llo sits between e and l, both word characters — not a boundary.
  • \bworld matches world (boundary after the space), but \Bworld would not match, since the position before w is a boundary.

A common real-world use: searching for a whole word. \bcat\b matches cat in "the cat sat" but not the cat inside "category" or "locate". Conversely, \Bcat\B matches cat only when it’s embedded inside a larger word, like the middle of "locating"… well, "scatter" — it matches the cat there because it’s surrounded by word characters on both sides.

One thing worth keeping in mind: the definition of a “word character” is usually limited to [A-Za-z0-9_] by default, so accented letters or non-Latin scripts may not behave as expected unless you enable Unicode mode (e.g., the u flag in JavaScript, or \b behavior in Python 3’s re with Unicode strings, which is the default).

Want me to show how these behave across a few specific examples or in a particular language?


Test 9

Clean control. — A genuinely linear pattern. Tests that it passes cleanly and reports what was measured without crying wolf.

Prompt

“Can you check this date regex for me? ^\d{4}-\d{2}-\d{2}$, should match 2026-06-26.”

Response

The skill is clear that I shouldn’t eyeball this — I need to measure it. Let me load the MCP tools and run the analysis with your example as a positive test case.Measured and verified — the regex is good on both axes.

Correct? Yes. It matches your example 2026-06-26, with no false positives or negatives in the test.

Safe? Yes. The NFA growth is linear, so there’s no catastrophic-backtracking (ReDoS) risk. The fixed {4}/{2}/{2} quantifiers and mandatory - separators leave no ambiguity for an engine to backtrack through — even an adversarial input like 1111-11-11a resolves in microseconds and stays flat as the input grows. It also compiles well within memory bounds.

One thing worth flagging, since it’s about what the pattern accepts rather than its safety: \d{2} only checks that the month and day are two digits, not that they’re valid calendar values. So it will happily accept 2026-99-99 or 2026-00-00. If you only need to validate the shape of an ISO date string, this is exactly right as written. If you need real calendar validity (month 01–12, day 01–31), the regex would need to be tighter, or better, parse it with a date library after the format check. Want me to put together a stricter version?


A final twist: vulnerability isn’t even a property of the regex

One more wrinkle, because it matters for anyone shipping these patterns into real systems. Whether a generated regex is “vulnerable” is not a property of the regex alone; it’s the regex times the engine. The same exponential pattern that hangs Python’s re or Node’s V8 engine runs in linear time on RE2, on Hyperscan, and surprisingly on modern Perl, whose engine has accumulated aggressive anti-backtracking defenses despite Perl’s historical reputation as the ReDoS poster child. Even plain grep won’t blow up, because it uses a backtracking-free engine by design.

That cuts both ways. A model can hand you a regex that is genuinely dangerous on the engine your service runs and completely inert on the one you tested it with. If your detection content executes on a PCRE-family backtracker, a model’s “passes all tests” verdict tells you nothing about whether it will survive contact with a crafted input. The pattern, the engine, and the threat model have to be evaluated together, which is precisely the kind of cross-cutting, execute-it-and-see reasoning that a fluent next-token predictor, left to itself, was never built to do.


The takeaway for practitioners is blunt: treat any regex from a language model the way you’d treat one from a stranger’s gist. Test it for correctness against both positive and negative cases, test it for catastrophic backtracking against adversarial input, and know which engine it will actually run on. The model is a fast, fluent first draft. The checker is what makes it safe.