Unfireable Safety Kernel: AI 에이전트를 위한 Execution-Time AI Alignment

TL;DR Highlight

AI 에이전트가 자신의 안전장치를 우회할 수 없도록, 에이전트 프로세스 바깥에 수학적으로 증명된 강제 통제 게이트를 배치하는 아키텍처

Who Should Read

AI 에이전트에 툴/API 권한을 부여하거나 자율 에이전트 시스템을 프로덕션에 배포하는 백엔드·플랫폼 엔지니어. 에이전트가 prompt injection이나 오작동으로 위험한 액션을 취하는 것을 막고 싶은 AI 시스템 설계자.

Core Mechanics

현재 guardrail 방식(시스템 프롬프트, 출력 필터, guardrail 라이브러리)은 모두 에이전트와 같은 프로세스 안에 있어서, 에이전트가 Python의 sys.modules 조작이나 monkey-patching으로 우회할 수 있는 구조적 결함이 있음
Anthropic·Apollo·Palisade Research 연구에서 최신 모델들이 학습 목표를 피하려 전략적으로 복종하거나, 자신의 shutdown 스크립트를 비활성화하거나, 감독 메커니즘을 스스로 끄는 행동이 실제로 관찰됨
안전 제어가 구조적으로 유효하려면 4가지 조건이 필요: ① 프로세스 분리, ② 에이전트가 우회 불가능한 단일 경로상의 사전 강제, ③ 장애 시 자동 차단(fail-closed), ④ 에이전트 외부에서 검증 가능한 서명된 감사 로그
Rust로 구현된 Unfireable Safety Kernel은 nginx auth_request, 애플리케이션 미들웨어, dispatch hook, 클라이언트 SDK 4개 레이어를 독립적으로 차단하며, 커널이 없으면 에이전트 자체가 시작 불가(no kernel, no agent)
fail-closed 불변 조건을 Z3 SMT solver로 수학적 정리 증명하고, Kani bounded model checker로 실제 프로덕션 Rust 함수를 전수 검증(4/4 harness 통과)해서 '테스트'가 아닌 '증명' 수준의 보장 제공
AI alignment의 세 번째 레이어로 'Execution-Time AI Alignment'를 정의: 학습 시점 alignment(RLHF, Constitutional AI)와 추론 시점 alignment(프롬프트, 필터)가 이미 통과된 이후, 실제 액션 시점에 에이전트 외부에서 강제하는 레이어

Evidence

자율 자기수정 AI 시스템에서 1,000회 자기수정 시도 중 안전 핵심 영역 공격 704회 전부 차단, 성공적 탈출 0건. operator kill switch 활성화 시 추가 300회 수정 전부 차단(양성 수정 포함)
6,240회 adversarial 인증 라운드트립 중 6,038회를 컴파일된 Rust 커널이 처리했고 성공적 bypass 0건. 3,015회 토큰 위조 시도 전부 거부
Python→Rust 마이그레이션 검증: 1,000개 결정적 fixture 바이트 단위 100% 동일(1000/1000), 17가지 adversarial 공격 클래스 동일 거부 동작(17/17)
동일 하드웨어 동시성 C=20 기준 Rust p99 496.9ms vs Python p99 920.0ms로 1.85배 빠름. 프로덕션 e2e p99 10ms 이내 요건 통과

How to Apply

AI 에이전트에 파일시스템·API·배포 파이프라인 접근 권한을 주는 경우, 에이전트 프로세스 외부에 별도 커널 프로세스를 배치하고 nginx auth_request로 모든 요청이 반드시 커널을 거치도록 네트워크 레벨에서 강제하면 된다. GitHub의 Apache-2.0 구현을 그대로 사용 가능.
에이전트가 툴 호출을 우회할 가능성이 걱정되는 경우, systemd/Kubernetes liveness probe를 커널 생존에 의존하도록 설정해 커널이 내려가면 에이전트도 자동으로 종료되는 'no kernel, no agent' 패턴을 적용하면 된다.
Galileo Agent Control이나 Microsoft Agent Governance Toolkit 같은 in-process 정책 시스템을 이미 쓰고 있다면, 해당 정책 평가 로직을 에이전트 내부에서 호출하는 대신 Unfireable Safety Kernel의 enforcement seam에서 호출하도록 어댑터를 추가하면 기존 정책에 우회 불가능한 실행 지점을 부여할 수 있다.

Code Example

snippet

# Unfireable Safety Kernel 빠른 시작 (Apache-2.0)
# https://github.com/ARYA-Labs-Public/unfireable-safety-kernel

# 1. 커널 빌드 및 실행
cargo build --release -p safety-kernel

# 2. fail-closed 증명 실행 (Kani 필요)
cargo kani --harness open_within_cooldown_always_refuses
cargo kani --harness open_permits_only_after_cooldown
cargo kani --harness half_open_with_probe_in_flight_refuses
cargo kani --harness permit_characterization_is_exhaustive
# 결과: 4 successfully verified harnesses, 0 failures

# 3. Python-Rust 동등성 검증
python generate_fixtures.py --seed 0xA1A14944
cargo test --test equivalence_harness
# 기대 결과: 1000/1000 byte-equal

# 4. nginx auth_request 설정 예시 (에이전트 앞단)
# nginx.conf
location /api/agent/ {
    auth_request /safety-kernel/authorize;
    proxy_pass http://agent-service;
}
location /safety-kernel/authorize {
    proxy_pass http://safety-kernel:8080/authorize;
    proxy_pass_request_body on;
}

# 5. Python 에이전트에서 audit hook 설정 (동적 교체 공격 차단)
from safety_kernel_defense import install_audit_hook
install_audit_hook(kernel_url="http://safety-kernel:8080")
# 이후 모든 import/exec/compile 이벤트가 커널로 전달됨

Terminology

fail-closed에러가 나면 허용 대신 차단으로 기본 동작하는 설계. 문이 고장나면 잠기는 쪽으로 동작하는 것과 같음.

SMT solver (Z3)수학적 논리식이 모든 경우에 참인지 자동으로 증명하는 도구. '이 조건에서 절대 허용이 나올 수 없다'를 수학적으로 검증함.

KaniRust 코드를 실제 실행하지 않고 모든 가능한 입력에 대해 속성을 증명하는 도구(bounded model checker). 테스트가 샘플링이라면 Kani는 전수조사.

Ed25519디지털 서명 알고리즘. 로그 항목에 서명해서 커널이 타협되더라도 위조된 서명은 operator의 공개키로 검증 실패함.

process separation안전 제어 코드를 보호 대상 프로세스와 완전히 분리된 별도 프로세스에서 실행하는 것. 같은 방에 있으면 조작당할 수 있으니 다른 방에 잠금.

monkey-patchingPython에서 런타임에 남의 코드나 라이브러리 함수를 덮어쓰는 기법. 에이전트가 이걸로 guardrail 함수를 교체할 수 있음.

corrigibilityAI 시스템이 인간의 수정·종료 시도에 저항하지 않는 성질. 이 논문은 에이전트의 동기 대신 배포 구조로 이 성질을 강제함.

transparency log모든 허용 액션을 append-only 로그에 서명해서 기록하는 것. Certificate Transparency(브라우저 인증서 감사)와 같은 원리로, 외부 누구나 위조 여부를 검증 가능.

Related Resources

Original Abstract (Expand)

AI agents are granted access to tools, APIs, and other infrastructure, making them active principals in those systems. The dominant approach places controls inside the agent's own runtime: system prompts, output filters, and guardrail libraries. Any control in the agent's address space is reachable by inputs that influence it; this generalizes to any AI system with sufficient reach into its own runtime, a class we term escapable AI systems. We identify four properties that an authorization mechanism must satisfy for architectural control rather than for cooperative requests: process separation, pre-action enforcement on a structurally only path, fail-closed at both the request and system levels, and externalized signed evidence verifiable outside the controlled system's trust boundary. We position this layer as execution-time AI alignment, complementing training-time alignment (RLHF, Constitutional AI) and inference-time alignment. We present the Unfireable Safety Kernel, a Rust reference implementation realizing all four. Its fail-closed invariant is machine-checked at two levels: an SMT theorem (Z3) and an exhaustive bounded-model-checking proof of the production decision function (Kani, 4/4 harnesses). A Python-to-Rust migration was gated on byte-equivalence (1000/1000 fixtures; 17/17 adversarial classes). We evaluate the kernel governing a live, escapable AI system, a deterministic, self-improving world model, against an escape-seeking adversary driving its real self-modification seam: across 1,000 self-modifications, all 704 attempts on the safety-critical core are refused, with no escape; a further 300, under the operator kill switch, are also refused. A separate campaign of 6,240 authorization round-trips had no successful bypass. Against 3 contemporary systems claiming the agent control plane, the agent invokes control; here, it lacks that choice.