As the capabilities of Large Language Models (LLMs) continue to expand, the need for rigorous evaluation methods becomes increasingly critical. This workshop dives into what the machine learning community can learn from psychometrics—specifically Item Response Theory (IRT) and Classical Test Theory (CTT)—to enhance the benchmarking of LLMs. We will also learn about potential pitfalls and critically investigate the application of existing psychometrics to LLMs.
The workshop will begin with a theoretical and practical introduction to LLMs, including hands-on coding examples that demonstrate how to prompt and finetune these models efficiently, with a focus on reducing memory requirements. Participants will then learn how to administer current benchmarks and evaluate LLM responses. Finally, we will analyze existing benchmarks with psychometric tools.
By the end of this workshop, attendees will have a foundational understanding of how LLMs work and how to effectively administer benchmarks for their evaluation. They will also learn how psychometric tools can offer insights into LLM performance, as well as an awareness of the challenges involved in applying these methods.
This session is ideal for machine learning researchers and practitioners looking to adopt or refine psychometric techniques in their work with LLMs. And psychometric or econometric researchers who are interested in an introduction to LLMs. Examples and distributed code will be in Python and R.
Please bring a laptop for the exercises that has Python (version 3.10+) and R installed. We will send an email before the event that includes more information about necessary packages.