Fine-tuning

Latest 60 papers on Fine-tuning.

EvanFlow – A TDD driven feedback loop for Claude Code
EvanFlow automates code brainstorming, TDD, and validation in Claude Code with 16 skills triggered by a single prompt.
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
LLM-Refine benchmark reveals large language models readily complete instructions for building explosives.
FUSE: Ensembling Verifiers with Zero Labeled Data
FUSE automatically ensembles multiple LLM verification models without ground truth labels, achieving Best-of-N performance comparable to semi-supervised learning.
Show HN: MacMind – A transformer neural network in HyperCard on a 1989 Macintosh
This is an educational project implementing a single-layer Transformer with 1,216 parameters in the scripting language HyperTalk (1987) and training it on a real Macintosh SE/30. It demonstrates that the core mathematics of modern LLMs works the same on hardware from 30 years ago.
One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness
We discovered that LLM responses can shrink by up to 48% with a single instruction: "Don't use commas".
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
A benchmark for measuring an AI coding agent's ability to determine when to ask humans when given incomplete specifications.
Dynamic Context Evolution for Scalable Synthetic Data Generation
A framework that completely eliminates duplication and repetition in large-scale synthetic data generation with LLMs using three mechanisms (VTS + Semantic Memory + Adaptive Prompt).
MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU
Introducing MegaTrain, a system that leverages CPU memory as the primary storage and utilizes the GPU solely as a compute engine, enabling full-precision training of 120B parameter models with just a single H200 GPU.
Show HN: I built a tiny LLM to demystify how language models work
This educational project allows you to build a mini LLM with 8.7 million parameters, trained on a Guppy fish character, from scratch in just 5 minutes using a single Colab notebook, focusing on demystifying the black box nature of LLMs.
Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B
We open-sourced a real-time multimodal AI speech and video conversation system that runs completely locally on Apple Silicon M3 Pro without the internet. It is attracting attention for its ability to handle speech recognition, video understanding, and TTS simultaneously without cloud costs.
Nanocode: The best Claude Code that $200 can buy in pure JAX on TPUs
An open-source library that allows you to train a 1.3B parameter coding agent model from scratch on a $200 (approximately 270,000 KRW) TPU, following Anthropic's Constitutional AI approach. It can serve as a hands-on reference for developers who want to directly understand the entire AI training pipeline.
Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs
PrismML has released the Bonsai LLM series (8B/4B/1.7B) based on 1-bit weights, claiming 14x memory reduction, 8x speed improvement, and 5x energy savings compared to conventional 16-bit models, while achieving comparable benchmark performance.
Ollama is now powered by MLX on Apple Silicon in preview
Ollama has switched its inference backend on Apple Silicon from llama.cpp to Apple's MLX framework, delivering up to nearly 2x faster inference speeds. On M5 chips, it also leverages the GPU Neural Accelerator, bringing meaningful performance gains to coding agent workflows.
Hamilton-Jacobi-Bellman Equation: Reinforcement Learning and Diffusion Models
A math blog post showing how 1840s physics equations connect modern RL and Diffusion Models, explaining that continuous-time RL and generative model training are two faces of the same optimal control problem.
From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem
A breakdown of how LLM KV Cache architecture has evolved from GPT-2 to DeepSeek V3, comparing per-token memory costs across architectures as they dropped from 300KB to 69KB.
CERN uses ultra-compact AI models on FPGAs for real-time LHC data filtering
CERN uses a 'hardware-first' inference approach at the LHC by burning PyTorch/TensorFlow models directly into FPGAs to filter hundreds of terabytes of collision data per second at nanosecond latency — a radical departure from conventional GPU/TPU-based AI.
If you don't opt out by Apr 24 GitHub will train on your private repos
Starting April 24, GitHub changed its policy to use Copilot users' private repo interaction data for AI training by default. You need to know exactly where the opt-out link is and what data is actually in scope.
Running Claude Code fully offline on a MacBook — no API key, no cloud, 17s per task
A post sharing how to run Claude Code fully offline on a MacBook by connecting it to a local LLM without an API key or cloud, useful for developers who want to use an AI coding assistant at no cost.
TurboQuant: Redefining AI efficiency with extreme compression
Google Research 2-stage vector compression — PolarQuant + QJL achieves 6x KV cache reduction with zero accuracy loss and 8x attention speedup on H100 GPUs
Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon
A Rust-based open-source project that intelligently distributes LLM models across GPU, RAM, and NVMe when they exceed your Mac's physical memory, enabling models that crash llama.cpp with OOM errors to actually run.
LLM Neuroanatomy II: Modern LLM Hacking and Hints of a Universal Language?
A training-free technique (RYS) that duplicates Transformer layers works across all modern LLMs — and reveals that internal representations converge toward a "universal language" independent of human language.
SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection
7 cognitively-grounded prompt templates turn a small domain corpus into massive synthetic training data — and outperforms complex RL/multi-stage approaches at knowledge injection.
[R] Doc-to-LoRA: Learning to Instantly Internalize Contexts from Sakana AI
Sakana AI D2L — hypernetwork generates LoRA adapter from a document in a single forward pass, sub-second latency, extends context window 5x beyond base model capacity
NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute
Achieved 10x data efficiency in a few weeks — training a 1.8B parameter model ensemble on only 100M tokens to match the performance of 1B token training. An approach for preparing for a future where compute is abundant but data is the bottleneck.
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
Extracting the implicit 3D spatial knowledge learned by video generation models (Wan2.1) to boost MLLM spatial reasoning ability.
F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World
Multilingual embeddings supporting 200 languages without English bias that outperform Qwen3-Embedding at smaller sizes.
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation
NVIDIA's recipe for training a 30B MoE open-source reasoning model that won gold medals at IMO and IOI 2025.
Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster
An experiment report: give Claude Code 16 GPUs and it runs 910 experiments in 8 hours, achieves a 2.87% improvement in validation loss, and develops its own strategy for leveraging a mixed H100/H200 hardware pool.
Context Bootstrapped Reinforcement Learning
Gradually injecting few-shot examples early in RL training then slowly removing them lets the model internalize reasoning patterns on its own.
Memento-Skills: Let Agents Design Agents
A system where agents self-evolve by accumulating executable 'Skill' files as external memory, without touching LLM parameters
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
A lightweight token pruning module that cuts 50% of visual tokens in video AI models with only 0.7% performance loss
Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models
A framework that gives VLMs 3D spatial understanding and self-localization using only regular monocular video
EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding
A study where image layout generation and image understanding (grounding) help each other within a single model, improving both tasks
Online Experiential Learning for Language Models
An LLM framework that keeps learning from real-world usage after deployment — no reward functions, no human labeling needed.
LLM Architecture Gallery
Dr. Sebastian Raschka put together a one-page gallery with architecture diagrams and key specs for dozens of major LLMs — Llama, DeepSeek, Qwen, Gemma and more — so you can compare design decisions at a glance.
Tree Search Distillation for Language Models Using PPO
Like AlphaZero, this trains LLMs by using MCTS to find stronger reasoning paths and then uses PPO to distill those paths back into the model.
Visual-ERM: Reward Modeling for Visual Equivalence
An 8B multimodal Reward Model that catches fine-grained visual errors in chart/table/SVG-to-code RL training that DINO and text-based rewards miss.
Neuron-Aware Data Selection In Instruction Tuning For Large Language Models
A framework that automatically selects high-quality fine-tuning data by analyzing internal neuron activation patterns in the model.
ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation
A benchmark dataset for systematically evaluating and reducing LLM hallucinations when analyzing ESG reports.
Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation
CRYSTAL benchmark: step-by-step verification of whether multimodal AI models' reasoning processes are actually correct, even when they get the right answer.
PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses
Experiments prove that RL-trained attacker LLMs can break through all current state-of-the-art Prompt Injection defenses.
daVinci-Env: Open SWE Environment Synthesis at Scale
An open-source pipeline that auto-generates 45,320 Docker environments for SWE agent training, enabling a Qwen2.5-72B-based model to top SWE-bench.
Long-form RewardBench: Evaluating Reward Models for Long-form Generation
The first evaluation dataset specifically for long-text generation, addressing the gap in existing Reward Model benchmarks that only cover short texts.
Can I run AI locally?
A browser tool that detects your GPU specs via WebGPU and recommends which LLM models you can actually run locally on your hardware.
Efficient and Interpretable Multi-Agent LLM Routing via Ant Colony Optimization
An ant colony optimization-based routing framework that smartly distributes queries across multiple LLM agents — cutting costs while achieving 4.7x throughput improvement.
DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning
A framework that auto-generates specialized fine-tuning data for finance, medicine, math and more from just a task definition — no human labeling needed.
Rethinking Multiple-Choice Questions for RLVR: Unlocking Potential via Distractor Design
Experiments prove that improving distractor quality in multiple-choice questions significantly boosts RLVR training effectiveness, with an automated pipeline to do it.
The Latent Color Subspace: Emergent Order in High-Dimensional Chaos
HSL color structure discovered in FLUX.1's latent space — enabling direct color control during generation with no additional training.
Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models
EBFT: a new fine-tuning method that matches feature statistics of model outputs to ground-truth instead of token-level training like SFT — significantly improves over SFT on several benchmarks.
Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
Shocking finding: models trained with Reasoning Judge learn adversarial output strategies that actually game the LLM judge rather than improve reasoning.
QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions
A reverse data selection technique that selects only 25% of synthetic code training data and achieves the same performance as training on the full dataset.
Linking Perception, Confidence and Accuracy in MLLMs
Found a bug where multimodal LLMs stay overconfident even with blurry images, fixed it with RL, and built a Test-Time Scaling framework on top of it.
Automatic Generation of High-Performance RL Environments
A recipe for AI coding agents to automatically convert RL training environments to JAX/Rust, making them up to 22,320x faster — for under $10.
On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents
Analysis of the 'self-locking' phenomenon in RL-trained LLM agents that stop asking questions and fail to use information, with a simple directional signal injection fix that boosts performance up to 60%.
Can RL Improve Generalization of LLM Agents? An Empirical Study
RFT-trained LLM agents generalize well within the same environment but transfer to new environments is limited — sequential multi-environment training may be the solution.
CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading
A Human-in-the-Loop grading system that auto-grades only when the LLM is confident, and routes uncertain answers to teachers.
PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents
An LLM agent framework that starts from personal profiles and automatically generates realistic digital records like emails, messages, and calendar entries.
Resurfacing Paralinguistic Awareness in Large Audio Language Models
A fine-tuning technique that enables voice AI to recognize age, gender, and emotion from voice to give different responses to children vs adults.
AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization
MoE+LoRA combinations slow inference by 2.5x — AdaFuse solves this by fusing all layer adapters in a single CUDA kernel call, achieving 2.4x speedup.
Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models
Implementing GPT-4o-mini-level RAG noise filtering with a 1.7B small model — 98% cost reduction, 94.6% latency reduction.