Resurfacing Paralinguistic Awareness in Large Audio Language Models

Mar 12, 2026•Hao Yang, Minghan Wang, Tongtong Wu +3•View PDF

TL;DR Highlight

A fine-tuning technique that enables voice AI to recognize age, gender, and emotion from voice to give different responses to children vs adults.

Who Should Read

ML engineers developing voice-based AI assistants who want responses to vary based on user context (children vs adults, emotional state). Especially devs looking to fine-tune Large Audio Language Models like Qwen2.5-Omni or Kimi-Audio.

Core Mechanics

Current audio LLMs like Qwen2.5-Omni and Kimi-Audio almost completely ignore paralinguistic signals (age/gender/emotion) from voice and respond based only on content — PA-score is nearly 0
Layer analysis: early layers (0-6) carry strong paralinguistic signals, middle layers (7-14) handle semantic understanding, information transition happens at layer 7
Selective-layer fine-tuning (training only layers 0-14) performs better than full-layer fine-tuning (0-27) — trains only paralinguistic + semantic layers
Adding an auxiliary classification head ADCH (predicts paralinguistic attributes from layer 14 output) significantly improves emotion recognition in particular
Child safety issue: original model gives the same detailed instructions (electrical repair, knife use, etc.) to children and adults. After PE-FT, PA-rate improves from 7% to 97%
PE-FT generalizes to new topics not in training — achieves 97% on child safety evaluation even without child safety samples in training

Evidence

Qwen2.5-Omni: original age PA-score 0.010 → after PE-FT 0.945, PA-rate 50.5% → 97.3%
Kimi-Audio: child safety PA-rate 4.29% → after PE-FT 98.57% (corresponding samples were not in training data)
Selective-layer (0-14) fine-tuning higher than full-layer (0-27): Qwen2.5-Omni emotion PA-score 0.393 → 0.460
PE-FT has minimal VoiceBench general capability (HS) drop vs full-layer — 72.34 vs 71.16 (Qwen2.5-Omni)

How to Apply

When fine-tuning Qwen2.5-Omni or Kimi-Audio, training only layers 0-14 with LoRA improves parameter efficiency and paralinguistic recognition. Layer range can be re-explored for your own model using the paper's layer-wise analysis pipeline.
When composing training data, create paired audio samples with different speakers (child/adult, male/female, per emotion) for the same text query using TTS synthesis, and set different correct responses per paralinguistic attribute. Pipeline: generate text samples with GPT-4.1, then synthesize audio with TTS.
Add a lightweight classification head (ADCH) at layer 14 output that simultaneously predicts category (age/gender/emotion) + attribute value (child/adult, etc.) as auxiliary loss summed with SFT loss at λ=0.5 for additional improvement on harder categories like emotion. Remove ADCH at inference time.

Code Example

snippet

# PE-FT core loss configuration example
import torch
import torch.nn as nn

class ADCH(nn.Module):
    """Auxiliary Dual-level Classification Head"""
    def __init__(self, hidden_size, num_categories=3, num_attrs_per_cat=[2, 2, 6]):
        super().__init__()
        # Category classification head (age / gender / emotion)
        self.category_head = nn.Linear(hidden_size, num_categories)
        # Per-attribute classification heads (child/adult, male/female, happy/sad/...)
        self.attr_heads = nn.ModuleList([
            nn.Linear(hidden_size, n) for n in num_attrs_per_cat
        ])
    
    def forward(self, h_layer14, y_cate):
        logits_cate = self.category_head(h_layer14)
        # Route each sample to its corresponding category head
        logits_attr = torch.stack([
            self.attr_heads[y_cate[i]](h_layer14[i])
            for i in range(len(h_layer14))
        ])
        return logits_cate, logits_attr


def pe_ft_loss(sft_loss, logits_cate, logits_attr, y_cate, y_attr, lam=0.5):
    """PE-FT total loss = SFT loss + λ * (category loss + attribute loss)"""
    ce = nn.CrossEntropyLoss()
    l_cate = ce(logits_cate, y_cate)
    l_attr = ce(logits_attr, y_attr)
    return sft_loss + lam * (l_cate + l_attr)


# When configuring LoRA, specify only layers 0-14 as training targets
# (e.g., when using HuggingFace PEFT)
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    # Filter target_modules to include only attention/ffn of layers 0-14
    target_modules=[
        f"model.layers.{i}.self_attn.q_proj" for i in range(15)
    ] + [
        f"model.layers.{i}.self_attn.v_proj" for i in range(15)
    ],
    lora_dropout=0.05,
    bias="none",
)
# model = get_peft_model(base_lalm, lora_config)

Terminology

Paralinguistic cuesInformation carried in "how" something is said rather than the content. Voice tone, age, emotion, gender, etc. "It's raining again" means something different said happily vs sadly.

LALM (Large Audio Language Model)A model that adds audio understanding capability to text LLMs. Models like Qwen2.5-Omni and Kimi-Audio that can receive voice input directly and converse.

Layer-wise analysisA technique for analyzing what information each layer of a deep learning model contains. Like X-raying each layer.

Linear probingA method that measures how much desired information a layer contains by attaching a simple linear classifier to its output. Like a thermometer for measuring layer capability.

SFT (Supervised Fine-Tuning)Teaching by showing correct examples to imitate. Like showing students worked examples and having them solve similar problems.

LoRA (Low-Rank Adaptation)An efficient fine-tuning technique that trains only a very small number of additional parameters rather than relearning the entire model. Like plugging a thin adapter into the model.

t-SNEA technique for visualizing high-dimensional data in 2D. Compresses data so that similar items cluster together, making patterns visible to the eye.

PA-score / PA-rateNew evaluation metrics proposed in this paper. PA-score averages +1 when model responds correctly to paralinguistic attributes, 0 when ignored, -1 when wrong. PA-rate is the proportion of correct responses.

Related Resources

PE-FT Code and Data (GitHub)

Original Abstract (Expand)

Large Audio Language Models (LALMs) have expanded the interaction with human to speech modality, which introduces great interactive potential, due to the paralinguistic cues implicitly indicating the user context. However, building on the current content-centred paradigm, LALMs usually neglect such paralinguistic cues and respond solely based on query content. In this work, to resurface the paralinguistic awareness in LALMs, we introduce five diverse layer-wise analyses to jointly identify paralinguistic layers and semantic understanding layers. Based on these insights, we propose a paralinguistic-enhanced fine-tuning (PE-FT) protocol accordingly to equip LALMs with paralinguistic-aware capabilities, including (1) selective-layer fine-tuning, and (2) an auxiliary dual-level classification head. Our experiments demonstrate that PE-FT protocol efficiently and effectively resurfaces the paralinguistic awareness, even surpassing the performance of the all-layer fine-tuning strategy.