Aider LLM Leaderboard

Website: https://aider.chat/docs/leaderboards/

The Aider LLM Leaderboards specifically evaluate a model’s ability to edit and refactor code.

The Code Editing benchmark asks the LLM to edit python source files to complete 133 small coding exercises from Exercism. This measures the LLM’s coding ability, and whether it can write new code that integrates into existing code. The model also has to successfully apply all its changes to the source file without human intervention.

The Code Refactoring benchmark asks the LLM to refactor 89 large methods from large python classes. This is a more challenging benchmark, which tests the model’s ability to output long chunks of code without skipping sections or making mistakes.