Usage of Large Language Model for Code Generation Tasks: A Review
TL;DR Highlight
A survey paper summarizing the current state and limitations of LLM-based code generation.
Who Should Read
Developers and researchers who want a comprehensive overview of where LLM code generation stands today and what challenges remain.
Core Mechanics
- Covers the full landscape of LLM-based code generation: models, benchmarks, prompting strategies, and evaluation metrics
- Identifies key limitations: poor performance on complex multi-file tasks, limited understanding of project-level context, and hallucinated APIs
- Benchmarks like HumanEval are saturated and no longer discriminate between top models; harder benchmarks are needed
- Security and correctness of generated code remain open problems — models frequently produce vulnerable or subtly wrong code
- Highlights the gap between benchmark performance and real-world usability
Evidence
- Comprehensive review of dozens of models and benchmarks published through 2024
- HumanEval pass@1 rates for top models now exceed 90%, indicating benchmark saturation
- Multiple studies show high rates of security vulnerabilities in LLM-generated code
How to Apply
- Use this survey as a reference when deciding which code generation model or prompting strategy to adopt for your use case.
- Don't rely solely on HumanEval scores when comparing models — look at harder benchmarks like SWE-bench or LiveCodeBench.
- Always run static analysis and security scanners on LLM-generated code before merging to production.
Code Example
# Example of improving code generation quality with few-shot prompts (OpenAI API)
import openai
client = openai.OpenAI()
prompt = """
Please write a function based on the examples below.
Example 1:
# Returns the sum of two numbers
def add(a, b):
return a + b
Example 2:
# Returns the product of two numbers
def multiply(a, b):
return a * b
Now implement the following:
# A function that returns the difference between the maximum and minimum values in a list
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are an expert Python developer. Write clear and efficient code."},
{"role": "user", "content": prompt}
],
temperature=0.2 # Low temperature is recommended for code generation
)
print(response.choices[0].message.content)Terminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Refusal in Language Models Is Mediated by a Single Direction
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.