LiveBench.ai

LiveBench.ai

LiveBench.ai Website: https://livebench.ai LiveBench is a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties: LiveBench is designed to limit potential contamination by releasing new questions...
LiveCodeBench

LiveCodeBench

LiveCodeBench Website: https://livecodebench.github.io/leaderboard.html Research Paper: https://arxiv.org/abs/2403.07974 LiveCodeBench is an evaluation framework to assess an LLM’s coding ability. “LiveCodeBench provides holistic and contamination-free...
Aider LLM  Leaderboard

Aider LLM Leaderboard

Aider LLM Leaderboard Website: https://aider.chat/docs/leaderboards/   The Aider LLM Leaderboards specifically evaluate a model’s ability to edit and refactor code. The Code Editing benchmark asks the LLM to edit python source files to complete 133 small...
EQ-Bench

EQ-Bench

EQ-Bench Website: https://eqbench.com/ EQ-Bench is an LLM benchmark framework to measure emotional intelligence. “Why emotional intelligence? One reason is that it represents a subset of abilities that are important for the user experience, and which isn’t...
HELM Leaderboards

HELM Leaderboards

HELM LLM Leaderboards Website: https://crfm.stanford.edu/helm/lite/latest/#/leaderboard More info: https://github.com/stanford-crfm/helm HELM (Holistic Evaluation of Language Models) is a transparent and open framework for evaluating LLMs. It was created by the Center...