COFFE: A Code Efficiency Benchmark for Code Generation

Yun Peng1, Jun Wan2, Yichen Li1, Xiaoxue Ren2
1The Chinese University of Hong Kong
2Zhejiang University
FSE 2025

Why COFFE?

Correctness test cases are too small to obtain a stable evaluation. Test cases in typical code generation benchmarks are used to evaluate the correctness of generated code solutions (correctness test cases), however, they usually have small inputs to test corner cases and could not well distinguish the time efficiency between two different solutions.

Architecture

COFFE adds stressful test cases to evaluate time efficiency. Stressful test cases have much larger inputs than correctness test cases, which act like stress testing to distinguish the time efficiency between different code solutions.

Abstract

Code generation has largely improved development efficiency in the era of large language models (LLMs). With the ability to follow instructions, current LLMs can be prompted to generate code solutions given detailed descriptions in natural language. Many research efforts are being devoted to improving the correctness of LLM-generated code, and many benchmarks are proposed to evaluate the correctness comprehensively. Despite the focus on correctness, the time efficiency of LLM-generated code solutions is under-explored. Current correctness benchmarks are not suitable for time efficiency evaluation since their test cases cannot well distinguish the time efficiency of different code solutions. Besides, the current execution time measurement is not stable and comprehensive, threatening the validity of the time efficiency evaluation.

To address the challenges in the time efficiency evaluation of code generation, we propose COFFE, a code generation benchmark for evaluating the time efficiency of LLM-generated code solutions. COFFE contains 398 and 358 problems for function-level and file-level code generation, respectively. To improve the distinguishability, we design a novel stressful test case generation approach with contracts and two new formats of test cases to improve the accuracy of generation. For the time evaluation metric, we propose efficienct@k based on CPU instruction count to ensure a stable and solid comparison between different solutions. We evaluate 14 popular LLMs on COFFE and identify four findings. Based on the findings, we draw some implications for LLM researchers and software practitioners to facilitate future research and usage of LLMs in code generation.

STGen: Stressful Test Case Generation

Phase I: Contract Generation. Contracts are collections of assertion statements that record the type, scale, and internal constraints between the inputs. Providing contracts in the test case generation process can help LLMs understand the dependencies between test inputs. Besides, STGen can easily identify incorrect test cases from the assertion errors contracts raise. This could largely improve the accuracy of stressful test case generation.

Architecture

Phase II: Stressful Test Case Generation. To avoid overlong stressful test cases that hinder the performance of LLMs, we design two new formats of test cases by reformulating the test case generation task into a code generation task: expression test cases and generator test cases. Different from raw test cases that directly provide test inputs, expression and generator test cases contain code to generate test inputs, which greatly shortens the length of test cases.

Architecture

Efficient@k: A stable metric evaluating both correctness and efficiency

Similar with Pass@k. Efficient@k adopts the basic formula of Pass@k to estimate the performance of LLMs via several code solution samples.

Stable measurement. Efficient@k does not use execution time for time efficiency measurement, instead it uses CPU instruction count to measure the time efficiency. GPU instruction count measurement (RSD: 0.005%) is much more stable than execution time measurement (RSD: 2% - 5%).

Relative comparison. Efficient@k does not consider the absolute value of GPU instruction count measurements, as this could be quite different in multiple platforms. Instead, it compares the code solutions with groud truth solutions and collect the correct code solutions faster than ground truth solutions (cf) when calculating the metric.

Architecture

Statistics of COFFE

COFFE directly selects high-quality instances from current popular benchmarks: HumanEval, MBPP, APPS and Code Contests. As file-level code generation problems are more difficult than function-level problems, we do not select all instances from APPS and Code Contests when building COFFE. Instead, we only select the problems in two benchmarks that at least one LLM could successfully solve it. Correctness is the basis before we can evaluate the time efficiency. COFFE provides relatively easy problems so that different LLMs can compare with each other.

Category #Instance #Solution/#Instance #Correctness/#Instance #Stressful/#Instance Source
Function 398 1.00 5.72 4.99 HumanEval, MBPP
File 358 66.93 43.68 4.95 APPS, Code Contests

Time Efficiency of Code Generated by Current LLMs

We evaluate 14 popular LLMs on COFFE. We sample one code solution for each LLM using temperature 0 and evaluate it using efficient@1, speedup and Pass@1. The code solutions in the following results are sampled on Sep 2024.

Model Size Function-level File-level
Efficient@1 Speedup Pass@1 Efficient@1 Speedup Pass@1
Phi3 3.8B 26.65 2.59 43.47 7.36 0.08 22.63
MagicCoder DS-6.7B 21.90 3.04 32.41 12.02 0.10 22.91
CL-7B 29.82 3.41 46.48 5.04 0.14 15.92
CodeLlama 7B 26.65 2.49 38.69 4.26 0.95 8.66
13B 25.60 1.03 41.71 1.16 1.02 2.23
34B 40.37 3.51 64.74 22.87 0.09 53.63
Llama3 8B 27.70 3.91 42.46 0.00 0.21 0.84
70B 40.90 3.30 67.59 38.76 0.14 68.99
StarCoder 15B 38.52 3.52 61.31 21.71 0.10 51.11
WizardCoder 15B 28.76 1.95 48.49 10.08 0.07 20.67
Mixtral 8x7B 25.59 5.14 44.72 8.53 1.43 22.91
DeepSeek V2 236B 46.70 2.79 78.39 41.09 0.18 89.94
DeepSeek V2 Coder 236B 46.97 2.53 79.90 42.25 0.44 78.77
Llama3.1 405B 39.58 3.21 67.34 46.51 0.90 89.11
Claude 3.5 Sonnet - 43.54 4.90 77.64 39.15 0.23 86.59
Gemini 1.5 Pro - 45.12 1.76 75.38 42.64 0.16 75.44
ChatGPT - 37.73 2.46 68.19 39.15 0.12 75.98
GPT-4o - 44.59 8.28 77.64 43.02 1.11 90.78
Findings:
  • The performance of current LLMs drops significantly in efficient code generation, indicating that the code solutions generated by current LLMs are correct but not time-efficient.
  • Compared with function-level code generation, code solutions generated by current LLMs are less efficient in file-level code generation.
  • Larger LLMs generally perform better in correct code generation but do not significantly outperform smaller LLMs in efficient code generation, indicating larger parameter sizes of current LLMs do not contribute much to efficient code generation.

Cite Our Work

@misc{peng2025coffe,
        title={COFFE: A Code Efficiency Benchmark for Code Generation}, 
        author={Yun Peng and Jun Wan and Yichen Li and Xiaoxue Ren},
        year={2025},
        eprint={2502.02827},
        archivePrefix={arXiv},
        primaryClass={cs.SE},
        url={https://arxiv.org/abs/2502.02827}, 
  }