LLM-NER: Advancing Named Entity Recognition with LoRA+ Fine-Tuned Large Language Models
TL;DR Highlight
A methodology paper on fine-tuning LLMs with LoRA+ to boost NER (Named Entity Recognition) performance.
Who Should Read
NLP engineers and researchers working on information extraction tasks who want to efficiently fine-tune LLMs for NER without full fine-tuning costs.
Core Mechanics
- Standard LoRA has a learning rate imbalance between the A and B matrices that slows convergence for NER tasks
- LoRA+ addresses this by setting different learning rates for the A matrix (lower) and B matrix (higher), leading to faster and better convergence
- LLMs fine-tuned with LoRA+ for NER significantly outperform smaller specialized NER models on domain-specific datasets
- The approach is particularly effective for low-resource NER domains where training data is limited — LoRA+'s efficiency matters more here
- Few-shot LoRA+ fine-tuning (even with 100-500 examples) achieves competitive results with full fine-tuning on specialized NER benchmarks
- The paper provides practical guidance on LoRA rank selection, learning rate ratios, and training duration for NER
Evidence
- On CoNLL-2003 NER: LoRA+ fine-tuned LLM achieved F1 92.4 vs. LoRA 91.1 vs. specialized NER model 91.8
- On biomedical NER (low-resource): LoRA+ F1 85.2 vs. LoRA 82.7 vs. BioBERT 83.5
- Convergence speed: LoRA+ reached peak performance in 60% of the training steps required by standard LoRA
How to Apply
- For NER fine-tuning: use LoRA+ with rank=16, learning rate ratio B/A = 16 (B matrix 16x higher learning rate than A matrix), and target the attention + MLP layers.
- If you have < 1K training examples: LoRA+ fine-tuning with a 7B LLM will likely outperform training a dedicated smaller NER model — the LLM's pretrained language understanding is a strong prior.
- For production NER: fine-tune with LoRA+ for performance, then consider LoRA weight merging into the base model for inference efficiency — eliminates adapter overhead.
Code Example
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-hf" # or another LLM
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True, device_map="auto")
# LoRA+ style: increase lora_alpha for learning rate correction effect
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Example output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062
# Example instruction format for NER
ner_prompt = """Extract named entities from the sentence below. Format: [TYPE: entity]
Sentence: Apple's CEO Tim Cook gave a presentation in Seoul.
Answer:"""
# Expected output: [ORG: Apple] [PER: Tim Cook] [LOC: Seoul]Terminology
Related Papers
Show HN: Neural Particle Automata
고정된 격자 대신 움직이는 파티클 위에서 동작하는 Neural Cellular Automata의 확장 버전으로, 형태 생성·포인트 클라우드 분류·텍스처 합성 등 다양한 작업에서 자기조직화 동작을 학습할 수 있다.
The annotated PyTorch training loop
PyTorch 학습 루프의 각 코드 줄이 왜 그 위치에 있어야 하는지, 순서를 바꾸거나 빠뜨렸을 때 어떤 문제가 생기는지를 단계별로 설명한 심층 가이드다.
When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks
VLM 자가학습 루프에서 verifier가 특정 태스크에 맞지 않으면 학습할수록 오히려 성능이 떨어지는데, DPO 손실값은 멀쩡히 내려가서 눈치채기도 어렵다.
The Role of Feedback Alignment in Self-Distillation
LLM이 스스로를 가르칠 때, 피드백을 모델의 추론 흐름에 단계별로 맞추면 GRPO보다 16점 이상 수학 추론 성능이 오른다.
Tiny hackable CUDA language model implementation
CUDA로 작성된 GPT(Generative Pretrained Transformer) 미니멀 구현체로, 텍스트뿐 아니라 모든 바이트 스트림을 학습할 수 있어 LLM 내부 구조를 직접 뜯어보고 싶은 개발자에게 유용하다.
CS336: Language Modeling from Scratch
Stanford에서 운영하는 LLM 전 과정 구현 강의로, 토크나이저부터 데이터 수집, 트랜스포머 구현, 분산 학습, RL 기반 정렬까지 직접 코딩하며 배운다. 이론이 아닌 구현 중심이라 실제로 LLM이 어떻게 작동하는지 깊이 이해하고 싶은 개발자에게 가장 체계적인 커리큘럼 중 하나다.