GlyphBanana: Agentic Workflow로 정밀한 Text Rendering 달성하기

GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

Mar 12, 2026•Zexuan Yan, Jiarui Jin, Yue Ma +5•View PDF

TL;DR Highlight

학습 없이 수식·희귀 한자까지 정확하게 이미지에 텍스트를 그려주는 에이전트 파이프라인

Who Should Read

이미지 생성 모델에서 텍스트(특히 수식, 한자, 희귀 단어)가 깨지거나 틀리게 렌더링되는 문제를 해결하려는 AI 엔지니어. T2I(Text-to-Image) 파이프라인에 정밀 텍스트 렌더링을 붙이고 싶은 프로덕트 개발자.

Core Mechanics

파인튜닝 없이(Training-free) 작동 — 기존 DiT 기반 이미지 생성 모델(Z-Image, Qwen-Image 등)에 플러그인처럼 붙일 수 있음
4단계 에이전트 파이프라인: ① 텍스트·스타일 추출 → ② 초안 이미지 생성 + 레이아웃 계획 → ③ Glyph 템플릿 주입 → ④ 스타일 정제
Frequency Decomposition(주파수 분해)으로 글리프의 고주파 구조만 latent space에 주입해 배경 스타일은 유지하면서 글자 정확도를 높임
Attention Re-weighting으로 DiT 블록 내 self-attention에서 글리프 토큰↔텍스트 토큰 연결은 강화하고, 비글리프 영역은 억제
VLM(Qwen3-VL-235B)이 Style Refiner + Score Judger 역할을 맡아 반복 정제하며 배경과 텍스트의 시각적 조화를 자동 최적화
GlyphBanana-Bench라는 새 벤치마크 공개 — 쉬운 영어부터 복잡한 다줄 수학 공식까지 난이도별로 290개 샘플 포함, 기존 벤치마크 중 수식 평가를 다루는 최초 벤치마크

Evidence

Z-Image 기반에서 OCR Accuracy 85.9(+19.6%p), Qwen-Image 기반에서 75.8(+6.91%p) 달성 — 모든 베이스라인(AnyText2 33.8, TextCrafter 34.0, GLM-Image 62.1 등) 대비 최고 정확도
User Study(낮을수록 좋음) 에서 Ours+Z-Image 2.27 vs 기존 최고 Zimage 5.07 — 절반 이하로 선호도 순위 개선
Layout Planner에 5×5 그리드 오버레이 추가 시 VLM only 대비 Mean IoU 0.2703 → 0.5531로 104.6% 향상
Iterative Refinement 3회 적용 시 Visual Quality 7.62 → 9.12로 지속 상승하면서 Text Accuracy는 85.9 유지

How to Apply

기존 FLUX나 Qwen-Image T2I 파이프라인에서 텍스트 오류가 잦다면, GlyphBanana GitHub의 agentic workflow를 래퍼로 감싸 사용하면 됨 — 모델 재학습 없이 latent injection만 추가
수식 렌더링이 필요한 교육·과학 콘텐츠 자동 생성 시스템에서, MathJax 기반 Formula Renderer를 auxiliary tool로 연결해 LaTeX 수식을 픽셀 정확도로 이미지에 삽입하는 방식으로 적용 가능
Style Refinement의 프롬프트 템플릿(Clean Prompt, Style Prompt)은 바로 복사해 다른 img2img 파이프라인의 배경 보존 + 텍스트 스타일 조화 용도로 재사용 가능

Code Example

snippet

# GlyphBanana Typography Analysis 프롬프트 (VLM에 직접 사용 가능)
typography_analysis_prompt = """
You are an expert in image typography analysis.
Given a reference image with a 5x5 grid and coordinate annotations,
analyze the natural text rendering style and overall scene.
Then plan the best typography layout for each text/formula item.

Critical constraints:
- Bounding boxes must remain flat and frontal (no perspective distortion)
- Red grid lines are positioning aids only — ignore them for style description
- Grid coordinates: {0.0, 0.2, 0.4, 0.6, 0.8, 1.0} on each axis

Per-region output fields:
- content: target text or formula
- bbox: [xmin, ymin, xmax, ymax] in [0,1]
- font: from registered font list or 'auto'
- font_weight: light/regular/bold
- font_size_ratio: scalar in [0.1, 1.0]
- color: white/black/red/blue/green/yellow/orange/brown/gray/gold/silver/purple/pink
- is_latex: boolean
- alignment: left/center/right
- rotation: degrees (0 = horizontal)

Return strict JSON with keys: image_analysis, text_regions
"""

# Clean Prompt 생성 (배경만 남기고 텍스트 지시어 제거)
clean_prompt_template = """
Remove ALL quoted text, formulas, and text-rendering instructions from the prompt.
Keep ONLY the scene/background/style description.
Add 'no text visible' at the end.

Example:
Input: A classroom blackboard displays "E=mc²" in elegant chalk writing.
Output: An empty classroom blackboard as background, clear and without any text. No text visible.

Input prompt: {user_prompt}
Output ONLY the cleaned prompt, nothing else.
"""

# Style Prompt 생성 (텍스트-배경 조화용)
style_prompt_template = """
Generate a SHORT image-editing instruction (10-30 words).
Goal: restyle foreground text to harmonize with the background
while keeping the background untouched.
Do NOT move, resize, or alter any text content or position.

Background style: {background_style}
Dominant colors: {colors}
Text style hint: {hint}

Output ONLY the instruction in English, 10-30 words.
"""

Terminology

DiT (Diffusion Transformer)이미지를 생성하는 확산 모델에서 U-Net 대신 Transformer 구조를 쓰는 방식. 요즘 FLUX, Qwen-Image 같은 최신 이미지 생성 모델의 핵심 구조.

Glyph글자나 기호의 시각적 형태(모양 자체). 폰트 파일 안에 저장된 각 문자의 실제 그림.

Latent Space이미지를 VAE로 압축한 저차원 수학적 공간. 픽셀을 직접 다루는 대신 이 압축된 표현을 조작해 이미지를 생성/편집함.

Frequency Decomposition이미지 신호를 저주파(전체 색감·밝기)와 고주파(엣지·세부 윤곽) 성분으로 분리하는 기법. 여기서는 글자의 윤곽선(고주파)만 골라 주입해 배경 색감은 건드리지 않음.

Attention Re-weightingTransformer의 attention 점수를 수동으로 조정하는 기법. 특정 토큰 쌍의 연결을 강화하거나 약화시켜 모델이 원하는 영역에 집중하게 만듦.

OCR (Optical Character Recognition)이미지에서 텍스트를 읽어내는 기술. 여기서는 생성된 이미지의 텍스트가 얼마나 정확하게 렌더링됐는지 측정하는 평가 지표로 사용.

VLM (Vision-Language Model)이미지와 텍스트를 동시에 이해하는 멀티모달 모델. GPT-4V, Claude 같은 것들. 여기서는 레이아웃 계획, 스타일 분석, 품질 평가 등 에이전트의 두뇌 역할.

Related Resources

GlyphBanana GitHub Repository

Original Abstract (Expand)

Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at https://github.com/yuriYanZeXuan/GlyphBanana.