Leaderboards have become one of the main ways to measure and compare large language models (LLMs). They help researchers, enterprises, and regulators understand how different models perform across tasks such as reasoning, coding, compliance, and multilingual capabilities.
This guide reviews the most important leaderboards of 2025 and the specialized ones that continue to shape model evaluation.
Top LLM leaderboards in 2025
Vellum.ai LLM Leaderboard
The Vellum leaderboard tracks the newest models released after April 2024. It compares reasoning, context length, cost, and accuracy on cutting-edge benchmarks like GPQA Diamond and AIME.
Open LLM Leaderboard (Vellum)
Vellum’s open-source leaderboard highlights top-performing community models, with updated scores for reasoning and problem-solving.
LLM-Stats (Verified AI)
LLM-Stats updates daily, showing speed, context window, pricing, and performance for models like GPT-5, Grok-4, and Gemini 2.5 Pro.
ScaleAI SEAL
ScaleAI’s SEAL Leaderboard uses private datasets and expert reviews to compare robustness and reliability of frontier models.
LiveBench
LiveBench tests models monthly with contamination-free benchmarks, emphasizing reasoning, coding, and math.
Prompt-to-Leaderboard (P2L)
P2L introduces prompt-specific rankings. Instead of broad scores, it evaluates how models perform on individual prompts — useful for routing in production.
Libra-Leaderboard
Libra is designed to balance raw capability with safety and alignment, reflecting the growing regulatory focus on responsible AI.
MCP-Universe
MCP-Universe benchmarks LLMs in application-specific domains like finance, 3D design, and web browsing. Even frontier models such as GPT-5, Grok-4, and Claude-4.0 are challenged here.
Other leaderboards
LMSYS Chatbot Arena
Chatbot Arena compares models with head-to-head human preference voting and Elo ranking. It remains influential for conversational quality.
Hugging Face Open LLM Leaderboard
The Hugging Face leaderboard evaluates open models using the EleutherAI harness. It continues to shape the open-source ecosystem.
MTEB (Massive Text Embedding Benchmark)
The MTEB leaderboard benchmarks text embedding models across 56 datasets and 112 languages. It is the main reference for embeddings.
OpenCompass: CompassRank
OpenCompass is a versatile hub, popular in Asia, testing multilingual models and compliance-sensitive tasks.
EQ-Bench
EQ-Bench measures emotional intelligence with over 170 prompts, offering a lens into social reasoning and empathy.
Berkeley function-calling leaderboard
The Berkeley leaderboard focuses on structured outputs and function calling, a key feature for enterprise copilots.
CanAiCode leaderboard
CanAiCode benchmarks smaller code-focused LLMs, providing visibility into text-to-code performance.
Open multilingual LLM evaluation leaderboard
This leaderboard evaluates models in 29 languages across TruthfulQA and ARC benchmarks, expanding global coverage.
AlpacaEval leaderboard
AlpacaEval ranks instruction-following models against GPT-4 references, offering a quick check for smaller models.
UGI leaderboard
The Uncensored General Intelligence leaderboard evaluates how models handle sensitive or uncensored content with confidential datasets.
Conclusion
Leaderboards in 2025 are more dynamic, specialized, and tied to real-world enterprise needs. Vellum, LLM-Stats, LiveBench, and MCP-Universe set the pace for evaluating reasoning, speed, and safety at scale. At the same time, focused leaderboards for embeddings, function-calling, or emotional intelligence continue to provide critical signals for specific use cases.
For enterprises, technical benchmarks only tell part of the story. Pairing leaderboard insights with user analytics ensures you know not just how models score, but how people actually use them.
About Nebuly
Nebuly is the user analytics platform for GenAI. We help enterprises connect leaderboard results with real-world adoption, tracking user prompts, satisfaction, and risks across AI products.
See how user analytics complements benchmarking: book a demo with us.