COFFE: A Code Efficiency Benchmark for Code Generation

Why COFFE?

Correctness test cases are too small to obtain a stable evaluation. Test cases in typical code generation benchmarks are used to evaluate the correctness of generated code solutions (correctness test cases), however, they usually have small inputs to test corner cases and could not well distinguish the time efficiency between two different solutions.

COFFE adds stressful test cases to evaluate time efficiency. Stressful test cases have much larger inputs than correctness test cases, which act like stress testing to distinguish the time efficiency between different code solutions.

Abstract

Code generation has largely improved development efficiency in the era of large language models (LLMs). With the ability to follow instructions, current LLMs can be prompted to generate code solutions given detailed descriptions in natural language. Many research efforts are being devoted to improving the correctness of LLM-generated code, and many benchmarks are proposed to evaluate the correctness comprehensively. Despite the focus on correctness, the time efficiency of LLM-generated code solutions is under-explored. Current correctness benchmarks are not suitable for time efficiency evaluation since their test cases cannot well distinguish the time efficiency of different code solutions. Besides, the current execution time measurement is not stable and comprehensive, threatening the validity of the time efficiency evaluation.

To address the challenges in the time efficiency evaluation of code generation, we propose COFFE, a code generation benchmark for evaluating the time efficiency of LLM-generated code solutions. COFFE contains 398 and 358 problems for function-level and file-level code generation, respectively. To improve the distinguishability, we design a novel stressful test case generation approach with contracts and two new formats of test cases to improve the accuracy of generation. For the time evaluation metric, we propose efficienct@k based on CPU instruction count to ensure a stable and solid comparison between different solutions. We evaluate 14 popular LLMs on COFFE and identify four findings. Based on the findings, we draw some implications for LLM researchers and software practitioners to facilitate future research and usage of LLMs in code generation.

STGen: Stressful Test Case Generation

Phase I: Contract Generation. Contracts are collections of assertion statements that record the type, scale, and internal constraints between the inputs. Providing contracts in the test case generation process can help LLMs understand the dependencies between test inputs. Besides, STGen can easily identify incorrect test cases from the assertion errors contracts raise. This could largely improve the accuracy of stressful test case generation.

Phase II: Stressful Test Case Generation. To avoid overlong stressful test cases that hinder the performance of LLMs, we design two new formats of test cases by reformulating the test case generation task into a code generation task: expression test cases and generator test cases. Different from raw test cases that directly provide test inputs, expression and generator test cases contain code to generate test inputs, which greatly shortens the length of test cases.

Efficient@k: A stable metric evaluating both correctness and efficiency

Similar with Pass@k. Efficient@k adopts the basic formula of Pass@k to estimate the performance of LLMs via several code solution samples.

Stable measurement. Efficient@k does not use execution time for time efficiency measurement, instead it uses CPU instruction count to measure the time efficiency. GPU instruction count measurement (RSD: 0.005%) is much more stable than execution time measurement (RSD: 2% - 5%).

Relative comparison. Efficient@k does not consider the absolute value of GPU instruction count measurements, as this could be quite different in multiple platforms. Instead, it compares the code solutions with groud truth solutions and collect the correct code solutions faster than ground truth solutions (c_f) when calculating the metric.

Statistics of COFFE

COFFE directly selects high-quality instances from current popular benchmarks: HumanEval, MBPP, APPS and Code Contests. As file-level code generation problems are more difficult than function-level problems, we do not select all instances from APPS and Code Contests when building COFFE. Instead, we only select the problems in two benchmarks that at least one LLM could successfully solve it. Correctness is the basis before we can evaluate the time efficiency. COFFE provides relatively easy problems so that different LLMs can compare with each other.

Category	#Instance	#Solution/#Instance	#Correctness/#Instance	#Stressful/#Instance	Source
Function	398	1.00	5.72	4.99	HumanEval, MBPP
File	358	66.93	43.68	4.95	APPS, Code Contests

Time Efficiency of Code Generated by Current LLMs

We evaluate 14 popular LLMs on COFFE. We sample one code solution for each LLM using temperature 0 and evaluate it using efficient@1, speedup and Pass@1. The code solutions in the following results are sampled on Sep 2024.

Model	Size	Function-level			File-level
Model	Size	Efficient@1	Speedup	Pass@1	Efficient@1	Speedup	Pass@1
Phi3	3.8B	26.65	2.59	43.47	7.36	0.08	22.63
MagicCoder	DS-6.7B	21.90	3.04	32.41	12.02	0.10	22.91
MagicCoder	CL-7B	29.82	3.41	46.48	5.04	0.14	15.92
CodeLlama	7B	26.65	2.49	38.69	4.26	0.95	8.66
	13B	25.60	1.03	41.71	1.16	1.02	2.23
	34B	40.37	3.51	64.74	22.87	0.09	53.63
Llama3	8B	27.70	3.91	42.46	0.00	0.21	0.84
Llama3	70B	40.90	3.30	67.59	38.76	0.14	68.99
StarCoder	15B	38.52	3.52	61.31	21.71	0.10	51.11
WizardCoder	15B	28.76	1.95	48.49	10.08	0.07	20.67
Mixtral	8x7B	25.59	5.14	44.72	8.53	1.43	22.91
DeepSeek V2	236B	46.70	2.79	78.39	41.09	0.18	89.94
DeepSeek V2 Coder	236B	46.97	2.53	79.90	42.25	0.44	78.77
Llama3.1	405B	39.58	3.21	67.34	46.51	0.90	89.11
Claude 3.5 Sonnet	-	43.54	4.90	77.64	39.15	0.23	86.59
Gemini 1.5 Pro	-	45.12	1.76	75.38	42.64	0.16	75.44
ChatGPT	-	37.73	2.46	68.19	39.15	0.12	75.98
GPT-4o	-	44.59	8.28	77.64	43.02	1.11	90.78

Findings:

The performance of current LLMs drops significantly in efficient code generation, indicating that the code solutions generated by current LLMs are correct but not time-efficient.
Compared with function-level code generation, code solutions generated by current LLMs are less efficient in file-level code generation.
Larger LLMs generally perform better in correct code generation but do not significantly outperform smaller LLMs in efficient code generation, indicating larger parameter sizes of current LLMs do not contribute much to efficient code generation.

Cite Our Work

@misc{peng2025coffe,
        title={COFFE: A Code Efficiency Benchmark for Code Generation}, 
        author={Yun Peng and Jun Wan and Yichen Li and Xiaoxue Ren},
        year={2025},
        eprint={2502.02827},
        archivePrefix={arXiv},
        primaryClass={cs.SE},
        url={https://arxiv.org/abs/2502.02827}, 
  }