Nebuly is the user analytics platform for GenAI products. We help companies see how people actually use their AI — what works, what fails, and how to improve it.
August 8, 2024

Best LLM Leaderboards: A Comprehensive List

Discover the best LLM leaderboards of 2025, from Vellum and LLM-Stats to MCP-Universe and Prompt-to-Leaderboard. See how GPT-5, Grok-4, and new safety benchmarks are shaping evaluation trends this year. Last updated: September 2025.

Leaderboards have become one of the main ways to measure and compare large language models (LLMs). They help researchers, enterprises, and regulators understand how different models perform across tasks such as reasoning, coding, compliance, and multilingual capabilities.

This guide reviews the most important leaderboards of 2025 and the specialized ones that continue to shape model evaluation.

Top LLM leaderboards in 2025

Vellum.ai LLM Leaderboard

The Vellum leaderboard tracks the newest models released after April 2024. It compares reasoning, context length, cost, and accuracy on cutting-edge benchmarks like GPQA Diamond and AIME.

Open LLM Leaderboard (Vellum)

Vellum’s open-source leaderboard highlights top-performing community models, with updated scores for reasoning and problem-solving.

LLM-Stats (Verified AI)

LLM-Stats updates daily, showing speed, context window, pricing, and performance for models like GPT-5, Grok-4, and Gemini 2.5 Pro.

ScaleAI SEAL

ScaleAI’s SEAL Leaderboard uses private datasets and expert reviews to compare robustness and reliability of frontier models.

LiveBench

LiveBench tests models monthly with contamination-free benchmarks, emphasizing reasoning, coding, and math.

Prompt-to-Leaderboard (P2L)

P2L introduces prompt-specific rankings. Instead of broad scores, it evaluates how models perform on individual prompts — useful for routing in production.

Libra-Leaderboard

Libra is designed to balance raw capability with safety and alignment, reflecting the growing regulatory focus on responsible AI.

MCP-Universe

MCP-Universe benchmarks LLMs in application-specific domains like finance, 3D design, and web browsing. Even frontier models such as GPT-5, Grok-4, and Claude-4.0 are challenged here.

Other leaderboards

LMSYS Chatbot Arena

Chatbot Arena compares models with head-to-head human preference voting and Elo ranking. It remains influential for conversational quality.

Hugging Face Open LLM Leaderboard

The Hugging Face leaderboard evaluates open models using the EleutherAI harness. It continues to shape the open-source ecosystem.

MTEB (Massive Text Embedding Benchmark)

The MTEB leaderboard benchmarks text embedding models across 56 datasets and 112 languages. It is the main reference for embeddings.

OpenCompass: CompassRank

OpenCompass is a versatile hub, popular in Asia, testing multilingual models and compliance-sensitive tasks.

EQ-Bench

EQ-Bench measures emotional intelligence with over 170 prompts, offering a lens into social reasoning and empathy.

Berkeley function-calling leaderboard

The Berkeley leaderboard focuses on structured outputs and function calling, a key feature for enterprise copilots.

CanAiCode leaderboard

CanAiCode benchmarks smaller code-focused LLMs, providing visibility into text-to-code performance.

Open multilingual LLM evaluation leaderboard

This leaderboard evaluates models in 29 languages across TruthfulQA and ARC benchmarks, expanding global coverage.

AlpacaEval leaderboard

AlpacaEval ranks instruction-following models against GPT-4 references, offering a quick check for smaller models.

UGI leaderboard

The Uncensored General Intelligence leaderboard evaluates how models handle sensitive or uncensored content with confidential datasets.

Conclusion

Leaderboards in 2025 are more dynamic, specialized, and tied to real-world enterprise needs. Vellum, LLM-Stats, LiveBench, and MCP-Universe set the pace for evaluating reasoning, speed, and safety at scale. At the same time, focused leaderboards for embeddings, function-calling, or emotional intelligence continue to provide critical signals for specific use cases.

For enterprises, technical benchmarks only tell part of the story. Pairing leaderboard insights with user analytics ensures you know not just how models score, but how people actually use them.

About Nebuly

Nebuly is the user analytics platform for GenAI. We help enterprises connect leaderboard results with real-world adoption, tracking user prompts, satisfaction, and risks across AI products.

See how user analytics complements benchmarking: book a demo with us.

Other Blogs

View pricing and plans

SaaS Webflow Template - Frankfurt - Created by Wedoflow.com and Azwedo.com
blog content
Keep reading

Get the latest news and updates
straight to your inbox

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.