Stance Is Not a Construct: LLM Validity Gaps in Annotation
Working Paper
Large language models are fast becoming political science’s default coders, and on simple labels they agree with humans often enough to seem trustworthy — yet the constructs we actually care about are layered and multidimensional, and validation almost never looks beneath the top label. So what is an LLM’s annotation measuring: the theoretical construct, or a surface reading of the text? I evaluate open-source LLMs on a hand-annotated corpus of Turkish tweets about anti-Americanism, scoring them both on overall stance and on the theoretical dimensions underneath it. The result is a consistent construct validity gap: models detect stance reasonably well but miss much of the dimensional structure beneath it, and clever prompting does not close the gap — a finding I turn into a practical pre-deployment diagnostic for anyone using LLM annotation in a measurement pipeline.