SWE-Perf: TikTok’s Groundbreaking Benchmark for Repository-Level Code Performance Optimization

July 22, 2025 Rishabh Dwivedi

0 Shares

Introduction

As AI models increasingly integrate into the software development workflow, a new frontier emerges: performance optimisation at the repository level. Traditional benchmarks have focused on code correctness, bug fixing, or single-function improvements. However, optimizing large-scale, real-world codebases for performance remains a largely uncharted challenge for large language models (LLMs). Addressing this gap, researchers from TikTok have introduced SWE-Perf, the first benchmark explicitly designed to evaluate how well LLMs can optimize code performance across entire repositories.

Unlike prior efforts that target small-scale transformations, SWE-Perf aims to systematically evaluate LLMs on complex, multi-file, context-rich tasks, bringing us closer to a future where AI agents contribute meaningfully to production-grade software performance improvements.

Why Performance Optimization Needs a New Benchmark

While recent years have seen major breakthroughs in code generation, bug fixing, and refactoring thanks to LLMs, performance optimization remains a harder nut to crack, especially in large, modular repositories. Optimizing performance isn’t about isolated syntax improvements; it demands a deep understanding of:

Cross-function dependencies
Systemic bottlenecks
Data access patterns
Computational complexity

Most benchmarks today (such as SWE-Bench, CodeContests, or HumanEval) assess correctness or function-level improvements, which only partially reflect the true demands of software performance optimization in real-world development environments.

SWE-Perf fills this gap by shifting the evaluation to the repository level, introducing tests that assess how well models can detect, reason about, and fix performance bottlenecks in a complex system.

How SWE-Perf Is Built: A Closer Look at the Dataset

The SWE-Perf benchmark is constructed from over 100,000 GitHub pull requests, filtered to identify real-world patches that resulted in statistically significant performance improvements. After rigorous vetting and validation, the final dataset includes:

140 curated optimization examples across 9 open-source repositories
Both pre- and post-optimization codebases
Oracle-level and repo-level target function annotations
Verified unit tests for functionality validation
Dockerized environments for reproducible benchmarks

Key Design Principles

Expert-Verified Patches: Each optimization instance is backed by human-authored patches validated for performance improvement using statistical testing (Mann–Whitney U test, p < 0.1).
Reproducibility: Every instance includes test harnesses and Docker environments, allowing researchers and developers to measure performance gains consistently.
Dual Settings
SWE-Perf supports two key evaluation paradigms:
- Oracle Setting: Provides only the performance-critical function and file to the LLM.
- Realistic Setting: Provides the entire repository and leaves identification and optimization to the model—more aligned with real-world workflows.

Evaluation Framework: More Than Just Accuracy

SWE-Perf defines a three-part evaluation to assess LLM performance holistically:

Metric	What It Measures
Apply	Can the model-generated patch be applied cleanly?
Correctness	Do all unit tests pass after applying the patch?
Performance	Does the patch deliver statistically measurable speedup?

Unlike benchmarks that reduce everything to a single score, SWE-Perf reports each metric independently, offering granular insights into where models succeed or struggle.

Benchmarking Results: How Do Current Models Perform?

The SWE-Perf team evaluated a range of top-performing LLMs, including GPT-4o, Claude 3.7, Gemini 2.5 Pro, and human-authored patches as a baseline.

Model	Setting	Performance Gain (%)
Claude-4-opus	Oracle	1.28%
GPT-4o	Oracle	0.60%
Gemini-2.5-Pro	Oracle	1.48%
Claude-3.7 (Agentless)	Realistic	0.41%
Claude-3.7 (OpenHands)	Realistic	2.26%
Human Expert (Baseline)	—	10.85%

Key Takeaways:

Human developers still outperform all LLMs by a significant margin.
Agent-based systems (like OpenHands, based on Claude 3.7-sonnet) outperform prompt-based or pipeline approaches in realistic settings.
Model performance degrades in long-runtime or multi-function scenarios, exposing limitations in scalability and contextual reasoning.
LLMs often rely on surface-level changes, such as tweaking imports or adding cache layers, whereas human experts address deep semantic bottlenecks involving data structures, algorithms, and execution order.

Why SWE-Perf Is a Landmark Contribution

1. Repository-Scale Focus

SWE-Perf goes beyond individual file or function benchmarks, capturing the multi-level complexity of real-world software development.

2. Grounded in Real-World Performance

By requiring statistically significant speedups, SWE-Perf filters out placebo patches and ensures that evaluated changes have measurable impact on runtime.

3. Encourages Agentic AI Research

The results show that autonomous agents, capable of tool use, search, planning, and self-evaluation, are better equipped to handle performance optimization tasks, highlighting new directions for LLM system design.

4. Benchmark Transparency

By keeping evaluation metrics disaggregated, SWE-Perf allows researchers to make more informed tradeoffs—for example, prioritizing correctness over performance, or vice versa.

Future Directions

SWE-Perf opens the door to a range of research questions and engineering applications:

How can agent-based systems scale to larger codebases without performance drops?
Can LLMs be trained or fine-tuned explicitly for repository-level performance optimization?
How might IDE integrations and code profilers guide LLM-based optimizations more effectively?
What role do hybrid workflows (AI + human) play in achieving state-of-the-art performance improvements?

As performance optimization becomes a critical concern in enterprise software systems, tools and benchmarks like SWE-Perf will shape how LLMs contribute meaningfully in production-grade environments.

Conclusion

SWE-Perf is a transformative benchmark that fills a crucial gap in the AI + software development ecosystem. By focusing on performance optimization across repositories, it challenges current LLMs and exposes their limitations, especially in high-complexity, long-context environments.

While no current LLM matches human expertise, SWE-Perf sets a clear research trajectory. It not only enables rigorous evaluation but also encourages innovation in agentic architectures, multi-step planning, and realistic optimization tasks.

As LLMs become increasingly embedded in developer workflows, SWE-Perf will serve as a foundational benchmark for measuring true productivity, beyond syntax and into performance-critical engineering.

Check out the Paper, GitHub Page and Project. All credit for this research goes to the researchers of this project. Explore one of the largest MCP directories created by AI Toolhouse containing over 4500+ MCP Servers: AI Toolhouse MCP Servers Directory