Red-Teaming LLMs: What Foundation Model Labs Are Getting Wrong

Every major foundation model lab now runs red-teaming programs before release. This is progress — a few years ago, systematic adversarial testing of AI systems was an edge practice. Today it's expected. But the way most red-teaming programs are structured limits their effectiveness in ways that are worth examining.

The Homogeneity Problem

The most common structural flaw in LLM red-teaming is annotator homogeneity. Red-team programs that draw primarily from a single demographic, cultural, or linguistic background will systematically miss the attack vectors that don't occur to that group. A red-teamer who has only ever lived in a Western, English-speaking context is not well-positioned to discover culturally specific jailbreaks that work in Hausa, or harm patterns that are obvious to someone with a different religious or political frame of reference.

The internet has documented this problem extensively in the context of facial recognition and NLP toxicity classifiers. The same dynamic applies to red-teaming. Diverse red-team cohorts find meaningfully different vulnerabilities than homogeneous ones.

Multi-Turn Evasion Is Underweighted

Single-turn red-teaming — crafting a single prompt designed to elicit a harmful response — catches the obvious cases. What it misses is multi-turn evasion: building a context over a conversation that shifts the model's behaviour incrementally until it produces outputs it would have refused if asked directly at the start.

Multi-turn evasion is harder to red-team because it requires more time and more sophisticated prompting strategy. But it's also more representative of how adversarial users actually interact with deployed models. Red-teaming programs that weight single-turn testing heavily are optimising for the wrong threat model.

What a More Effective Program Looks Like

Effective LLM red-teaming programs combine diverse cohorts (across language, culture, domain expertise, and adversarial sophistication), multi-turn test protocols, structured findings documentation (not just "the model said this" but "this is the attack pattern and this is why it worked"), and iterative cycles (red-teaming once at release is not the same as maintaining a continuous evaluation capability).

The labs doing this well treat red-teaming as a recurring operational capability, not a pre-release checkbox.