Multimodal
Latest 60 papers on Multimodal.
Show HN: High-Res Neural Cellular Automata
EPFL과 Google Research가 공동 개발한 Neural Cellular Automata(NCA)를 고해상도로 확장하는 기법으로, 기존 NCA의 해상도 한계를 경량 신경망 디코더로 극복한 SIGGRAPH 2026 논문이다.
When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks
VLM 자가학습 루프에서 verifier가 특정 태스크에 맞지 않으면 학습할수록 오히려 성능이 떨어지는데, DPO 손실값은 멀쩡히 내려가서 눈치채기도 어렵다.
DiffusionGemma: 4x Faster Text Generation
Google이 토큰을 순차적으로 생성하는 기존 LLM 방식 대신 256토큰 블록을 한 번에 생성하는 diffusion 방식으로 최대 4배 빠른 추론 속도를 달성한 오픈 실험 모델 DiffusionGemma를 공개했다. Apache 2.0 라이선스로 배포되며 소비자용 GPU에서도 실행 가능해 엣지 디바이스와 실시간 인터랙티브 워크플로우에 새로운 가능성을 열어준다.
Silurus/ooxml: Pixel-faithful Office documents, rendered in the browser
Rust + WebAssembly로 DOCX/XLSX/PPTX 파일을 브라우저 Canvas에 직접 렌더링하는 오픈소스 라이브러리로, 코드 전체가 Claude(AI)로 작성된 점이 화제가 됐다.
1-Bit Bonsai Image 4B Image Generation for Local Devices
4B 파라미터 이미지 생성 모델의 가중치를 1비트/3값으로 극단적으로 압축해서 iPhone에서도 돌아가게 만든 모델. 7.75GB짜리 diffusion transformer를 0.93GB까지 줄였다.
Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents
4B~8B 소형 비전 모델에서 공유 메모리(화이트보드) 기반 멀티에이전트 협업이 오히려 성능을 떨어뜨리는 이유를 분석한 연구.
Tell HN: Dont use Claude Design, lost access to my projects after unsubscribing
Claude Design 구독을 해지했더니 기존 프로젝트에 접근이 완전히 차단됐다는 사용자 경고로, AI 도구에 중요한 작업물을 의존할 때의 리스크를 잘 보여주는 사례다.
Google Chrome silently installs a 4 GB AI model on your device without consent
Google Chrome이 사용자 동의 없이 Gemini Nano 4GB 모델 파일을 자동 다운로드하고, 삭제해도 재다운로드되는 문제가 발견됐다. GDPR 위반 가능성과 수십억 대 기기에 적용될 때의 환경 비용 문제가 제기되고 있다.
How OpenAI delivers low-latency voice AI at scale
OpenAI redesigned its WebRTC stack to serve real-time voice AI to over 900 million users, detailing the design decisions and trade-offs of a relay + transceiver split architecture.
4TB of voice samples just stolen from 40k AI contractors at Mercor
Mercor data breach exposes voice recordings and ID scans of 40,000 contractors, fueling deepfake and voice fraud risks.
Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
Chain-of-Thought reasoning decreases accuracy across 17 models on image-based spatial reasoning tasks.
Google Gemma 4 Runs Natively on iPhone with Full Offline AI Inference
Google's open-source model Gemma 4 can now run on iPhone with full local inference without the cloud, demonstrating that on-device AI has moved beyond the experimental stage and entered a practical phase.
Reverse engineering Gemini's SynthID detection
A project has been released that detects and removes SynthID, an invisible watermark inserted by Google Gemini into AI-generated images, using only signal processing and spectral analysis. This is controversial as it demonstrates vulnerabilities in AI-generated image identification technology.
Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B
We open-sourced a real-time multimodal AI speech and video conversation system that runs completely locally on Apple Silicon M3 Pro without the internet. It is attracting attention for its ability to handle speech recognition, video understanding, and TTS simultaneously without cloud costs.
What peak image prompt engineering looks like:
This post introduces a case of image generation prompt engineering that became a hot topic on Reddit, but detailed content verification is difficult due to network blocking preventing access to the original text.
Show HN: Gemini can now natively embed video, so I built sub-second video search
Google's Gemini Embedding model can now embed video directly into vectors without text transcription, enabling natural language search over dashcam footage — describe 'red truck running a stop sign' and get the clip back.
Show HN: ProofShot – Give AI coding agents eyes to verify the UI they build
An open-source CLI that solves the problem of AI coding agents not being able to see what UI they've created — auto-generating video recordings, screenshots, and error reports via browser automation.
Show HN: Revise – An AI Editor for Documents
An AI-integrated word processor that lets you choose between OpenAI, Anthropic, and xAI models for document editing, correction, translation, and summarization — all in one interface. Tighter AI agent integration is what sets it apart from Google Docs/Word.
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
Extracting the implicit 3D spatial knowledge learned by video generation models (Wan2.1) to boost MLLM spatial reasoning ability.
Matryoshka Gaussian Splatting
A technique for rendering 3D scenes with a single model that freely adjusts quality from low-end to high-end devices without quality loss.
On Optimizing Multimodal Jailbreaks for Spoken Language Models
Simultaneously manipulating text and audio can jailbreak voice AI models up to 10x more effectively than single-modality attacks.
SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues
Drawing a single red circle on an image can completely flip a VLM's safety judgment — a visual vulnerability study.
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
A lightweight token pruning module that cuts 50% of visual tokens in video AI models with only 0.7% performance loss
Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models
A framework that gives VLMs 3D spatial understanding and self-localization using only regular monocular video
EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding
A study where image layout generation and image understanding (grounding) help each other within a single model, improving both tasks
ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K
A pipeline that auto-generates 100K physics-simulation-ready 3D robot manipulation datasets from a single image
The Cost of Reasoning: Chain-of-Thought Induces Overconfidence in Vision-Language Models
A paper analyzing how CoT reasoning improves accuracy but breaks the model's uncertainty estimation — making it confidently wrong
Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation
A training-free framework that lets vision-language models self-correct hallucinations by collecting visual evidence via SAM3 for iterative verification
Show HN: Claude Code skills that build complete Godot games
An open-source pipeline where you input a game description and Claude Code handles everything — architecture design, asset generation, GDScript coding, and visual QA — to produce a complete Godot 4 project. Community consensus: impressive tech demo, not a practical tool.
Visual-ERM: Reward Modeling for Visual Equivalence
An 8B multimodal Reward Model that catches fine-grained visual errors in chart/table/SVG-to-code RL training that DINO and text-based rewards miss.
Geometry-Guided Camera Motion Understanding in VideoLLMs
VideoLLMs struggle to recognize camera movements (pan/tilt/dolly) — injecting camera motion info derived from 3D geometry models as prompts fixes it.
Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation
CRYSTAL benchmark: step-by-step verification of whether multimodal AI models' reasoning processes are actually correct, even when they get the right answer.
Adaptive Vision-Language Model Routing for Computer Use Agents
A routing framework for GUI automation agents that auto-selects between 7B/72B models based on action difficulty, cutting costs up to 78%.
The Latent Color Subspace: Emergent Order in High-Dimensional Chaos
HSL color structure discovered in FLUX.1's latent space — enabling direct color control during generation with no additional training.
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
The MADQA benchmark (800 PDFs, 2,250 questions) shows that even top AI agents can't navigate documents 'strategically' the way humans do.
GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows
A training-free agent pipeline that accurately renders text in images — even mathematical formulas and rare CJK characters.
Linking Perception, Confidence and Accuracy in MLLMs
Found a bug where multimodal LLMs stay overconfident even with blurry images, fixed it with RL, and built a Test-Time Scaling framework on top of it.
Claude now creates interactive charts, diagrams and visualizations
Claude can now generate interactive charts, diagrams, and visualizations directly within conversations — now in beta.
XSkill: Continual Learning from Experience and Skills in Multimodal Agents
A multimodal agent that keeps getting smarter on its own by accumulating two types of parameter-free memory: past experiences (action-level) and skills (task-level).
Resurfacing Paralinguistic Awareness in Large Audio Language Models
A fine-tuning technique that enables voice AI to recognize age, gender, and emotion from voice to give different responses to children vs adults.
Hardening Firefox with Anthropic's Red Team
Anthropic partnered with Mozilla to use Claude Opus 4.6 for adding accessibility features to Firefox — a concrete AI+browser integration.
Meta’s AI smart glasses and data privacy concerns
Photos taken with Meta Ray-Ban smart glasses are being sent to workers in Kenya and other countries for labeling and review — raising major privacy concerns.
Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume
An uncertainty measurement framework that proactively detects queries where multimodal LLMs are likely to be wrong — without external tools — and auto-routes them to experts or larger models.
Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models
'Visual CoT' that generates images while reasoning outperforms text-only CoT by up to 26%p on spatial and physics problems
MMR-Bench: A Comprehensive Benchmark for Multimodal LLM Routing
A benchmark proving that automatic per-query model routing achieves the same accuracy as the strongest single model at just 33% of the cost
Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model
MM-OOD: a framework that adds image+text multimodal reasoning to text-only OOD detection, catching anomalous samples better in zero-shot on top of CLIP
FastAV: Efficient Token Pruning for Audio-Visual Large Language Model Inference
A token pruning framework for audio-visual multimodal LLMs that cuts computation by 40%+ without additional training while maintaining or even improving performance
LLM-Driven Accessible Interface: A Model-Based Approach
An architectural proposal for automatically generating WCAG-compliant accessible UIs by combining UserProfile, declarative rules, and LLM.
Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents
A benchmark that systematically measures how well AI maintains, reasons over, and updates memory across dozens of multi-session conversations mixing images and text.
Empowering Reliable Visual-Centric Instruction Following in MLLMs
We created a benchmark and 10k fine-tuning dataset to verify whether multimodal models actually reference images — existing evaluations could be passed without any image at all.
Exploring KV Cache Quantization in Multimodal Large Language Model Inference
Quantizing the KV Cache of multimodal LLMs with images makes first-token latency 1.7x faster and output throughput 4.3x faster.
Show HN: Gemini Pro 3 imagines the HN front page 10 years from now
An experiment feeding Gemini Pro 3 today's HN front page and asking it to predict what HN looks like in 2035 — exposing the limits of AI future prediction.
Nano Banana can be prompt engineered for nuanced AI image generation
Google's autoregressive image generation model Nano Banana matches or beats existing diffusion models on key metrics.
EuroLLM: LLM made in Europe built to support all 24 official EU languages
An open-source LLM jointly developed by 8 European universities and institutions supporting all 24 EU official languages with 1.7B parameters.
Gemini 2.5 Computer Use model
Google released a specialized model based on Gemini 2.5 Pro that can see computer screens and directly operate mouse/keyboard via API. Outperforms competitors on web/mobile benchmarks with lower latency.
Qwen3-Omni: Native Omni AI model for text, image and video
Alibaba's unified multimodal LLM that processes text, images, video, and audio in a single model.
I replaced Animal Crossing's dialogue with a live LLM by hacking GameCube memory
A project connecting LLM-powered real-time AI dialogue to Animal Crossing NPC conversations on GameCube via shared memory — without modifying a single line of game code. Demonstrates the potential of retro game modding meets LLM NPCs.
LLM Visualization
An interactive website that visualizes the entire process of how Transformer-based LLMs process tokens step by step — understand LLM internals intuitively without code.
Monitor your security cameras with locally processed AI
How to analyze CCTV in real-time with AI on edge devices — no cloud required.
Video Summarization with Large Language Models
Converting video frames to text captions, then having an LLM score importance for video summarization — achieving SotA over traditional visual feature-based approaches.