AdaFuse: Token-Level Pre-Gating과 Fused Kernel 최적화로 Dynamic Adapter 추론 가속화

AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization

Mar 12, 2026•Qiyang Li, Rui Kong, Yuchen Li +5•View PDF

TL;DR Highlight

MoE+LoRA 조합의 추론 속도가 2.5배 느려지는 문제를, CUDA 커널 한 번으로 모든 레이어 어댑터를 한꺼번에 합쳐서 2.4배 빠르게 해결

Who Should Read

LLM 서빙 인프라를 운영하면서 LoRA 기반 다중 어댑터나 MoE 구조의 추론 지연을 줄여야 하는 ML 엔지니어. 멀티태스크/멀티도메인 모델을 파인튜닝하고 배포하는 팀에서 특히 유용.

Core Mechanics

Dynamic Adapter(입력에 따라 다른 LoRA를 켜는 구조)는 파라미터를 1~5%만 추가하는데도 추론 지연을 250~950%나 늘림 — 연산량 문제가 아니라 CUDA 커널 호출 횟수가 문제
기존 layer-wise/block-wise 라우팅은 레이어마다 라우팅 결정을 내려서 어댑터를 미리 합칠 수 없는 구조적 한계가 있음
AdaFuse는 '첫 번째 레이어에서 한 번만 라우팅, 결과를 전 레이어에 적용'하는 token-level pre-gating 방식으로 실행 경로를 정적으로 고정
SGMM(Segmented Gather Matrix Multiplication)이라는 커스텀 CUDA 커널로 모든 레이어의 활성화된 LoRA를 단 1번의 커널 호출로 백본에 합침
토큰이 바뀔 때마다 이전 토큰 어댑터는 빼고 새 어댑터를 더하는 'fused switching' 연산도 SGMM 커널 1회로 처리
Llama2-7B, Mistral-7B 기준으로 기존 가장 빠른 dynamic adapter(PESC) 대비 2.7배 빠르고, 기존 backbone 대비 29% 지연 증가에 그침

Evidence

Llama2-7B baseline 2.4ms/token 대비 AdaFuse 3.1ms/token(+29%) — 기존 MoRAL 8.6ms(+258%), MOLA 25.3ms(+954%)와 비교
도메인 특화 태스크 평균 정확도: Llama2-7B에서 AdaFuse 83.60%, 최강 베이스라인 MoLA 84.20%로 경쟁력 유지
Mistral-7B 도메인 태스크 평균: AdaFuse 87.24%로 PESC 87.06%, MoRAL 87.05%를 소폭 상회
SGMM 없이 Simple merge만 적용 시 4.2ms/token → SGMM 적용 시 3.1ms/token으로 커널 최적화만으로 26% 추가 단축

How to Apply

멀티태스크 LoRA 서빙 중 레이어마다 라우팅하는 MoE-LoRA 구조를 쓰고 있다면, 라우터를 첫 번째 레이어 하나로 통합하고 그 결과를 전 레이어에 전파하는 pre-gating 방식으로 아키텍처를 변경하면 됨
어댑터 전환 비용이 큰 추론 서버에서는 토큰 생성 시 이전 어댑터 언머지 + 새 어댑터 머지를 별도 커널로 2회 호출하는 대신, concat 후 단일 batched GEMM으로 처리하는 SGMM 패턴을 적용하면 커널 런치 오버헤드를 대폭 줄일 수 있음
새로 dynamic adapter를 설계할 때, 라우팅 결정을 레이어 깊이와 무관하게 앞단에서 1회만 내리는 'decide-once, apply-everywhere' 원칙을 채택하면 추론 경로를 정적으로 만들어 기존 LLM 최적화 기법(PagedAttention 등)과 호환성이 높아짐

Code Example

snippet

# AdaFuse의 token-level pre-gating 핵심 로직 (개념 코드)
# 기존 layer-wise routing:
# for each layer l:
#     gate_l = router_l(hidden_l)  # 레이어마다 라우팅
#     output_l = backbone_l(hidden_l) + sum(gate_l[i] * lora_i(hidden_l))

# AdaFuse pre-gating:
# 1. 첫 레이어에서 한 번만 라우팅
gate = router_first_layer(x_first)  # shape: [top_k]
top_k_indices = gate.topk(k=2).indices

# 2. 선택된 LoRA들을 모든 레이어에 걸쳐 SGMM으로 한 번에 머지
# fused_down[l] = concat([lora_down[l][i] for i in top_k_indices])
# fused_up[l]   = concat([lora_up[l][i]   for i in top_k_indices])
# SGMM: backbone[l] += fused_down[l] @ fused_up[l]  (모든 l 동시 처리)
sgmm_kernel(
    backbone_weights=backbone_weights_all_layers,   # 포인터 배열
    lora_down=fused_down_all_layers,
    lora_up=fused_up_all_layers,
    gates=gate[top_k_indices],
    num_layers=num_layers
)  # CUDA 커널 1회 호출로 전 레이어 머지 완료

# 3. 머지된 backbone으로 일반 forward (어댑터 오버헤드 없음)
for l in range(num_layers):
    hidden = fused_backbone[l](hidden)  # 일반 행렬곱만

Terminology

LoRA모델 전체를 재학습하지 않고 작은 행렬 두 개만 추가해서 학습하는 기법. 원본 모델은 그대로 두고 '보정 레이어'만 학습하는 것.

MoEMixture-of-Experts. 여러 전문가(Expert) 네트워크 중 입력에 맞는 몇 개만 골라 쓰는 구조. 식당에서 전문 셰프 여럿 중 요리 종류에 맞는 셰프만 부르는 것과 비슷.

Dynamic Adapter입력 토큰에 따라 어떤 LoRA를 켤지 실시간으로 결정하는 어댑터. 정적 어댑터가 항상 같은 보정을 하는 것과 달리, 문맥에 맞게 다른 보정을 선택함.

CUDA 커널GPU에서 실행되는 병렬 연산 단위. 커널을 '호출'하는 것 자체에 고정 오버헤드가 있어서, 작은 계산도 호출이 많으면 느려짐.

Pre-gating처리 전에 미리 어떤 경로를 쓸지 결정하는 것. 고속도로 진입 전에 목적지 경로를 완전히 정해두는 것과 유사.

GEMMGeneral Matrix Multiplication. GPU에서 행렬 곱을 수행하는 핵심 연산. 딥러닝의 대부분 계산이 결국 GEMM으로 환원됨.

PEFTParameter-Efficient Fine-Tuning. 모델 전체가 아닌 일부 파라미터만 학습해 비용을 줄이는 파인튜닝 기법들의 총칭. LoRA, Prefix Tuning 등이 여기 속함.

Original Abstract (Expand)

The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation itself, but in the severe overhead from fragmented, sequential CUDA kernel launches required for conventional dynamic routing. To address this challenge, we introduce AdaFuse, a framework built on a tight co-design between the algorithm and the underlying hardware system to enable efficient dynamic adapter execution. Departing from conventional layer-wise or block-wise routing, AdaFuse employs a token-level pre-gating strategy, which makes a single, global routing decision for all adapter layers before a token is processed. This "decide-once, apply-everywhere" approach effectively staticizes the execution path for each token, creating an opportunity for holistic optimization. We capitalize on this by developing a custom CUDA kernel that performs a fused switching operation, merging the parameters of all selected LoRA adapters into the backbone model in a single, efficient pass. Experimental results on popular open-source LLMs show that AdaFuse achieves accuracy on par with state-of-the-art dynamic adapters while drastically cutting decoding latency by a factor of over 2.4x, thereby bridging the gap between model capability and inference efficiency.