Resurfacing Paralinguistic Awareness in Large Audio Language Models
TL;DR Highlight
A fine-tuning technique that enables voice AI to recognize age, gender, and emotion from voice to give different responses to children vs adults.
Who Should Read
ML engineers developing voice-based AI assistants who want responses to vary based on user context (children vs adults, emotional state). Especially devs looking to fine-tune Large Audio Language Models like Qwen2.5-Omni or Kimi-Audio.
Core Mechanics
- Current audio LLMs like Qwen2.5-Omni and Kimi-Audio almost completely ignore paralinguistic signals (age/gender/emotion) from voice and respond based only on content — PA-score is nearly 0
- Layer analysis: early layers (0-6) carry strong paralinguistic signals, middle layers (7-14) handle semantic understanding, information transition happens at layer 7
- Selective-layer fine-tuning (training only layers 0-14) performs better than full-layer fine-tuning (0-27) — trains only paralinguistic + semantic layers
- Adding an auxiliary classification head ADCH (predicts paralinguistic attributes from layer 14 output) significantly improves emotion recognition in particular
- Child safety issue: original model gives the same detailed instructions (electrical repair, knife use, etc.) to children and adults. After PE-FT, PA-rate improves from 7% to 97%
- PE-FT generalizes to new topics not in training — achieves 97% on child safety evaluation even without child safety samples in training
Evidence
- Qwen2.5-Omni: original age PA-score 0.010 → after PE-FT 0.945, PA-rate 50.5% → 97.3%
- Kimi-Audio: child safety PA-rate 4.29% → after PE-FT 98.57% (corresponding samples were not in training data)
- Selective-layer (0-14) fine-tuning higher than full-layer (0-27): Qwen2.5-Omni emotion PA-score 0.393 → 0.460
- PE-FT has minimal VoiceBench general capability (HS) drop vs full-layer — 72.34 vs 71.16 (Qwen2.5-Omni)
How to Apply
- When fine-tuning Qwen2.5-Omni or Kimi-Audio, training only layers 0-14 with LoRA improves parameter efficiency and paralinguistic recognition. Layer range can be re-explored for your own model using the paper's layer-wise analysis pipeline.
- When composing training data, create paired audio samples with different speakers (child/adult, male/female, per emotion) for the same text query using TTS synthesis, and set different correct responses per paralinguistic attribute. Pipeline: generate text samples with GPT-4.1, then synthesize audio with TTS.
- Add a lightweight classification head (ADCH) at layer 14 output that simultaneously predicts category (age/gender/emotion) + attribute value (child/adult, etc.) as auxiliary loss summed with SFT loss at λ=0.5 for additional improvement on harder categories like emotion. Remove ADCH at inference time.
Code Example
# PE-FT core loss configuration example
import torch
import torch.nn as nn
class ADCH(nn.Module):
"""Auxiliary Dual-level Classification Head"""
def __init__(self, hidden_size, num_categories=3, num_attrs_per_cat=[2, 2, 6]):
super().__init__()
# Category classification head (age / gender / emotion)
self.category_head = nn.Linear(hidden_size, num_categories)
# Per-attribute classification heads (child/adult, male/female, happy/sad/...)
self.attr_heads = nn.ModuleList([
nn.Linear(hidden_size, n) for n in num_attrs_per_cat
])
def forward(self, h_layer14, y_cate):
logits_cate = self.category_head(h_layer14)
# Route each sample to its corresponding category head
logits_attr = torch.stack([
self.attr_heads[y_cate[i]](h_layer14[i])
for i in range(len(h_layer14))
])
return logits_cate, logits_attr
def pe_ft_loss(sft_loss, logits_cate, logits_attr, y_cate, y_attr, lam=0.5):
"""PE-FT total loss = SFT loss + λ * (category loss + attribute loss)"""
ce = nn.CrossEntropyLoss()
l_cate = ce(logits_cate, y_cate)
l_attr = ce(logits_attr, y_attr)
return sft_loss + lam * (l_cate + l_attr)
# When configuring LoRA, specify only layers 0-14 as training targets
# (e.g., when using HuggingFace PEFT)
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
# Filter target_modules to include only attention/ffn of layers 0-14
target_modules=[
f"model.layers.{i}.self_attn.q_proj" for i in range(15)
] + [
f"model.layers.{i}.self_attn.v_proj" for i in range(15)
],
lora_dropout=0.05,
bias="none",
)
# model = get_peft_model(base_lalm, lora_config)Terminology
Related Resources
Original Abstract (Expand)
Large Audio Language Models (LALMs) have expanded the interaction with human to speech modality, which introduces great interactive potential, due to the paralinguistic cues implicitly indicating the user context. However, building on the current content-centred paradigm, LALMs usually neglect such paralinguistic cues and respond solely based on query content. In this work, to resurface the paralinguistic awareness in LALMs, we introduce five diverse layer-wise analyses to jointly identify paralinguistic layers and semantic understanding layers. Based on these insights, we propose a paralinguistic-enhanced fine-tuning (PE-FT) protocol accordingly to equip LALMs with paralinguistic-aware capabilities, including (1) selective-layer fine-tuning, and (2) an auxiliary dual-level classification head. Our experiments demonstrate that PE-FT protocol efficiently and effectively resurfaces the paralinguistic awareness, even surpassing the performance of the all-layer fine-tuning strategy.