Large Language Models (LLMs) are often evaluated using data similar to what they’ve seen in training. Or sometimes they may even have the test data in the training dataset without the knowledge of the people who train it. This can lead to biased results.
The Elo score compares two models directly. This method is fairer because it doesn’t rely on previously seen data. When using Elo scores, models are ranked by direct comparison. This gives us a clearer idea of how they perform against each other, rather than just by looking at test scores.
In areas like programming, Elo scores could also help us understand which models are more effective.
Links to leaderboards
- LMSys Chatbot Arena Leaderboard - note that you can choose different categories (Overall, Coding, English etc.)
- EvalPlus Leaderboard evaluates AI coders
- TheFastest.ai - latency benchmarks
- Berkeley Function Calling Leaderboard (aka Berkeley Tool Calling Leaderboard)