DeepSeek-Coder: Large Language Model이 프로그래밍을 만났을 때 - Code Intelligence의 부상

DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence

Jan 25, 2024•Daya Guo, Qihao Zhu, Dejian Yang +10•View PDF

TL;DR Highlight

오픈소스 코드 LLM인 DeepSeek-Coder(1.3B~33B)가 GPT-3.5를 넘고 CodeLlama-34B를 6.7B 모델로 따라잡은 방법

Who Should Read

코드 자동완성, 코드 생성 기능을 개발하거나 Copilot 같은 툴을 직접 구축하려는 개발자. 오픈소스 코드 모델 중 가장 성능 좋은 것을 찾고 있는 AI 엔지니어.

Core Mechanics

1.3B~33B 파라미터 범위의 오픈소스 코드 모델 시리즈. 87개 프로그래밍 언어로 2조 토큰 학습. 연구 및 상업적 사용 모두 허용하는 라이선스
파일 단위가 아닌 레포지토리 단위로 학습 데이터 구성 - 파일 간 import/dependency를 위상정렬로 재배열해서 실제 프로젝트 구조 반영. 이게 cross-file 코드 완성 성능을 크게 올림
FIM(Fill-In-the-Middle, 중간 빈칸 채우기) 학습을 50% 비율로 적용해서 코드 자동완성 기능 강화. 100%로 하면 FIM은 좋아지지만 일반 코드 생성이 나빠지는 트레이드오프 발견
컨텍스트 길이를 16K 토큰으로 확장 - 긴 파일이나 멀티파일 시나리오 처리 가능. RoPE 스케일링 팩터 조정으로 구현
DeepSeek-Coder-Instruct 33B는 HumanEval에서 GPT-3.5-Turbo를 넘어섬. LeetCode 대회 문제에서는 오픈소스 중 유일하게 GPT-3.5-Turbo 초과
DeepSeek-Coder-v1.5는 일반 LLM(DeepSeek-LLM-7B)에서 계속 사전학습 - 코딩 성능 유지하면서 수학 추론, 자연어 이해 능력 대폭 향상

Evidence

HumanEval 멀티언어 평균: DeepSeek-Coder-Base 33B 50.3%, CodeLlama-Base 34B 41.0% (9%p 차이). 6.7B 모델이 44.7%로 CodeLlama 34B 41.0%를 5배 적은 파라미터로 초과
LeetCode Contest 180문제: DeepSeek-Coder-Instruct 33B 27.8% vs GPT-3.5-Turbo 23.3%. CodeLlama-Instruct 34B는 9.4%에 그침
FIM 코드 완성 (Single-Line Infilling 평균): DeepSeek-Coder-Base 7B 80.7% vs CodeLlama-Base 13B 75.5% - 더 작은 모델이 더 큰 모델 초과
DeepSeek-Coder-v1.5 수학 추론: GSM8K 62.4% (기존 6.7B 대비 43.2%에서 19%p 향상), MATH 24.7% (기존 19.2%에서 5.5%p 향상)

How to Apply

코드 자동완성 도구를 만든다면 DeepSeek-Coder-Base 6.7B를 권장 - FIM 방식으로 앞뒤 컨텍스트를 주면 중간 코드를 채워줌. 토큰 포맷: `<|fim_start|>앞코드<|fim_hole|>뒤코드<|fim_end|>`
복잡한 코딩 문제를 LLM에게 풀게 할 때는 CoT(Chain-of-Thought) 프롬프트 추가 필수 - '먼저 단계별 설명을 작성하고 코드를 짜라'는 지시를 넣으면 LeetCode 하드 문제에서 성능 향상
멀티파일 프로젝트를 컨텍스트로 넣어야 하는 경우, BM25로 관련 파일을 검색해서 512토큰 이내로 크로스파일 컨텍스트를 만들어 붙이면 cross-file 완성 정확도 대폭 향상 (Python EM 기준 9.53% → 16.14%)

Code Example

snippet

# DeepSeek-Coder FIM (Fill-In-the-Middle) 사용 예시
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base")

# FIM 형식으로 중간 코드 채우기
prefix = "def bubble_sort(arr):\n    n = len(arr)\n    for i in range(n):\n"
suffix = "\n    return arr"

# PSM 모드: <fim_start>prefix<fim_hole>suffix<fim_end>
input_text = f"<|fim_start|>{prefix}<|fim_hole|>{suffix}<|fim_end|>"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

# LeetCode 스타일 CoT 프롬프트 예시
problem = """두 정수 배열 nums1, nums2에서 교집합을 구하라.
Please complete the code below to solve the above problem:
```python
class Solution:
    def intersection(self, nums1, nums2):
```"""

# CoT 추가
cot_prompt = problem + "\nYou need first to write a step-by-step outline and then write the code."
print(cot_prompt)

Terminology

FIM (Fill-In-the-Middle)코드 앞부분과 뒷부분을 주고 중간을 채우게 하는 학습 방식. IDE에서 커서 위치에 코드를 자동완성해주는 기능의 핵심 기술.

Repository-level Pre-training파일 하나씩이 아니라 GitHub 레포지토리 전체를 묶어서 학습하는 방식. 파일 간 import 관계를 이해시켜서 실제 프로젝트처럼 여러 파일에 걸친 코드를 잘 짜게 만듦.

BPE (Byte Pair Encoding)텍스트를 토큰(조각)으로 쪼개는 방법. 자주 나오는 글자 조합을 하나의 토큰으로 합쳐서 어휘 사전을 만드는 방식.

RoPE (Rotary Position Embedding)트랜스포머 모델에서 '이 단어가 몇 번째 위치에 있는지'를 알려주는 기술. 회전 행렬을 써서 긴 문맥도 잘 처리할 수 있음.

Cross-file Code Completion현재 파일 하나만 보는 게 아니라 다른 파일들까지 참고해서 코드를 완성하는 작업. 실제 프로젝트에서 다른 모듈의 함수를 가져다 쓸 때 필요한 능력.

CoT (Chain-of-Thought)LLM에게 답 바로 내놓지 말고 단계별로 생각을 풀어쓰게 하는 프롬프트 기법. 수학 문제나 복잡한 코딩 문제에서 정확도가 올라감.

GQA (Grouped-Query Attention)트랜스포머의 어텐션 연산을 그룹으로 묶어서 메모리와 속도를 개선하는 기법. 33B 같은 큰 모델에서 추론 효율을 높이기 위해 사용.

Pass@1LLM이 코드를 한 번 생성했을 때 테스트를 통과하는 확률. 높을수록 첫 시도에 올바른 코드를 만들어낸다는 의미.

Related Resources

Original Abstract (Expand)

The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.