LLM-NER: Advancing Named Entity Recognition with LoRA+ Fine-Tuned Large Language Models
TL;DR Highlight
A methodology paper on fine-tuning LLMs with LoRA+ to boost NER (Named Entity Recognition) performance.
Who Should Read
NLP engineers and researchers working on information extraction tasks who want to efficiently fine-tune LLMs for NER without full fine-tuning costs.
Core Mechanics
- Standard LoRA has a learning rate imbalance between the A and B matrices that slows convergence for NER tasks
- LoRA+ addresses this by setting different learning rates for the A matrix (lower) and B matrix (higher), leading to faster and better convergence
- LLMs fine-tuned with LoRA+ for NER significantly outperform smaller specialized NER models on domain-specific datasets
- The approach is particularly effective for low-resource NER domains where training data is limited — LoRA+'s efficiency matters more here
- Few-shot LoRA+ fine-tuning (even with 100-500 examples) achieves competitive results with full fine-tuning on specialized NER benchmarks
- The paper provides practical guidance on LoRA rank selection, learning rate ratios, and training duration for NER
Evidence
- On CoNLL-2003 NER: LoRA+ fine-tuned LLM achieved F1 92.4 vs. LoRA 91.1 vs. specialized NER model 91.8
- On biomedical NER (low-resource): LoRA+ F1 85.2 vs. LoRA 82.7 vs. BioBERT 83.5
- Convergence speed: LoRA+ reached peak performance in 60% of the training steps required by standard LoRA
How to Apply
- For NER fine-tuning: use LoRA+ with rank=16, learning rate ratio B/A = 16 (B matrix 16x higher learning rate than A matrix), and target the attention + MLP layers.
- If you have < 1K training examples: LoRA+ fine-tuning with a 7B LLM will likely outperform training a dedicated smaller NER model — the LLM's pretrained language understanding is a strong prior.
- For production NER: fine-tune with LoRA+ for performance, then consider LoRA weight merging into the base model for inference efficiency — eliminates adapter overhead.
Code Example
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-hf" # or another LLM
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True, device_map="auto")
# LoRA+ style: increase lora_alpha for learning rate correction effect
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Example output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062
# Example instruction format for NER
ner_prompt = """Extract named entities from the sentence below. Format: [TYPE: entity]
Sentence: Apple's CEO Tim Cook gave a presentation in Seoul.
Answer:"""
# Expected output: [ORG: Apple] [PER: Tim Cook] [LOC: Seoul]Terminology
Related Papers
Shai-Hulud Themed Malware Found in the PyTorch Lightning AI Training Library
PyTorch Lightning packages 2.6.2 and 2.6.3 delivered credential-stealing malware via a supply chain attack.
Alignment whack-a-mole: Finetuning activates recall of copyrighted books in LLMs
Fine-tuning even safety-aligned LLMs can bypass safeguards and reproduce copyrighted text verbatim, revealing prompt filtering alone isn't enough to prevent copyright infringement.
Show HN: MacMind – A transformer neural network in HyperCard on a 1989 Macintosh
This is an educational project implementing a single-layer Transformer with 1,216 parameters in the scripting language HyperTalk (1987) and training it on a real Macintosh SE/30. It demonstrates that the core mathematics of modern LLMs works the same on hardware from 30 years ago.
MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU
Introducing MegaTrain, a system that leverages CPU memory as the primary storage and utilizes the GPU solely as a compute engine, enabling full-precision training of 120B parameter models with just a single H200 GPU.
Show HN: I built a tiny LLM to demystify how language models work
This educational project allows you to build a mini LLM with 8.7 million parameters, trained on a Guppy fish character, from scratch in just 5 minutes using a single Colab notebook, focusing on demystifying the black box nature of LLMs.