로딩 중...

Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models | AI Paper Digest | AI Paper Digest