The use of established personality inventories to evaluate large language models (LLMs) such as GPT-4 or LLaMA-2 has gained traction in recent research. However, this study presents compelling evidence that such psychometric instruments are not directly transferable to artificial systems.
Specifically, the authors demonstrate that LLMs frequently agree with semantically opposing items (e.g., endorsing both introversion and extraversion) and fail to reproduce the factor structures typically observed in human personality data, such as the Big Five.
Drawing on key concepts from psychometrics – especially measurement invariance and construct validity – the authors argue that personality tests designed for humans cannot be meaningfully applied to LLMs without extensive adaptation. The paper calls for a cautious and theoretically grounded approach when assessing latent traits in artificial agents.
Literature
Sühr, T., Dorner, F. E., Samadi, S., & Kelava, A. (2023). Challenging the validity of personality tests for large language models. arXiv preprint, arXiv:2311.18351. https://doi.org/10.48550/arXiv.2311.18351