Stance Validity

Stance Is Not a Construct: LLM Validity Gaps in Annotation

Working Paper

Large language models are fast becoming political science’s default coders, and on simple labels they agree with humans often enough to seem trustworthy — yet the constructs we actually care about are layered and multidimensional, and validation almost never looks beneath the top label. So what is an LLM’s annotation measuring: the theoretical construct, or a surface reading of the text? I evaluate open-source LLMs on a hand-annotated corpus of Turkish tweets about anti-Americanism, scoring them both on overall stance and on the theoretical dimensions underneath it. The result is a consistent construct validity gap: models detect stance reasonably well but miss much of the dimensional structure beneath it, and clever prompting does not close the gap — a finding I turn into a practical pre-deployment diagnostic for anyone using LLM annotation in a measurement pipeline.

Concept: Measurement validity in LLM annotation
Methods: LLM evaluation across open-source models and prompt designs
Data: Hand-annotated Turkish-language tweets on anti-Americanism
Presented: MPSA 2026 · PolMeth and APSA 2026 upcoming

Y. Emre Tapan

Navigate

Stance Validity

Stance Is Not a Construct: LLM Validity Gaps in Annotation

Research Map