Multimodal
Latest 58 papers on Multimodal.
4TB of voice samples just stolen from 40k AI contractors at Mercor
Mercor data breach exposes voice recordings and ID scans of 40,000 contractors, fueling deepfake and voice fraud risks.
Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
Chain-of-Thought reasoning decreases accuracy across 17 models on image-based spatial reasoning tasks.
Google Gemma 4 Runs Natively on iPhone with Full Offline AI Inference
Google's open-source model Gemma 4 can now run on iPhone with full local inference without the cloud, demonstrating that on-device AI has moved beyond the experimental stage and entered a practical phase.
Reverse engineering Gemini's SynthID detection
A project has been released that detects and removes SynthID, an invisible watermark inserted by Google Gemini into AI-generated images, using only signal processing and spectral analysis. This is controversial as it demonstrates vulnerabilities in AI-generated image identification technology.
Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B
We open-sourced a real-time multimodal AI speech and video conversation system that runs completely locally on Apple Silicon M3 Pro without the internet. It is attracting attention for its ability to handle speech recognition, video understanding, and TTS simultaneously without cloud costs.
What peak image prompt engineering looks like:
This post introduces a case of image generation prompt engineering that became a hot topic on Reddit, but detailed content verification is difficult due to network blocking preventing access to the original text.
Show HN: Gemini can now natively embed video, so I built sub-second video search
Google's Gemini Embedding model can now embed video directly into vectors without text transcription, enabling natural language search over dashcam footage — describe 'red truck running a stop sign' and get the clip back.
Show HN: ProofShot – Give AI coding agents eyes to verify the UI they build
An open-source CLI that solves the problem of AI coding agents not being able to see what UI they've created — auto-generating video recordings, screenshots, and error reports via browser automation.
Show HN: Revise – An AI Editor for Documents
An AI-integrated word processor that lets you choose between OpenAI, Anthropic, and xAI models for document editing, correction, translation, and summarization — all in one interface. Tighter AI agent integration is what sets it apart from Google Docs/Word.
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
Extracting the implicit 3D spatial knowledge learned by video generation models (Wan2.1) to boost MLLM spatial reasoning ability.
Matryoshka Gaussian Splatting
A technique for rendering 3D scenes with a single model that freely adjusts quality from low-end to high-end devices without quality loss.
On Optimizing Multimodal Jailbreaks for Spoken Language Models
Simultaneously manipulating text and audio can jailbreak voice AI models up to 10x more effectively than single-modality attacks.
SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues
Drawing a single red circle on an image can completely flip a VLM's safety judgment — a visual vulnerability study.
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
A lightweight token pruning module that cuts 50% of visual tokens in video AI models with only 0.7% performance loss
Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models
A framework that gives VLMs 3D spatial understanding and self-localization using only regular monocular video
EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding
A study where image layout generation and image understanding (grounding) help each other within a single model, improving both tasks
ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K
A pipeline that auto-generates 100K physics-simulation-ready 3D robot manipulation datasets from a single image
The Cost of Reasoning: Chain-of-Thought Induces Overconfidence in Vision-Language Models
A paper analyzing how CoT reasoning improves accuracy but breaks the model's uncertainty estimation — making it confidently wrong
Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation
A training-free framework that lets vision-language models self-correct hallucinations by collecting visual evidence via SAM3 for iterative verification
Show HN: Claude Code skills that build complete Godot games
An open-source pipeline where you input a game description and Claude Code handles everything — architecture design, asset generation, GDScript coding, and visual QA — to produce a complete Godot 4 project. Community consensus: impressive tech demo, not a practical tool.
Visual-ERM: Reward Modeling for Visual Equivalence
An 8B multimodal Reward Model that catches fine-grained visual errors in chart/table/SVG-to-code RL training that DINO and text-based rewards miss.
Geometry-Guided Camera Motion Understanding in VideoLLMs
VideoLLMs struggle to recognize camera movements (pan/tilt/dolly) — injecting camera motion info derived from 3D geometry models as prompts fixes it.
Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation
CRYSTAL benchmark: step-by-step verification of whether multimodal AI models' reasoning processes are actually correct, even when they get the right answer.
Adaptive Vision-Language Model Routing for Computer Use Agents
A routing framework for GUI automation agents that auto-selects between 7B/72B models based on action difficulty, cutting costs up to 78%.
The Latent Color Subspace: Emergent Order in High-Dimensional Chaos
HSL color structure discovered in FLUX.1's latent space — enabling direct color control during generation with no additional training.
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
The MADQA benchmark (800 PDFs, 2,250 questions) shows that even top AI agents can't navigate documents 'strategically' the way humans do.
GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows
A training-free agent pipeline that accurately renders text in images — even mathematical formulas and rare CJK characters.
Linking Perception, Confidence and Accuracy in MLLMs
Found a bug where multimodal LLMs stay overconfident even with blurry images, fixed it with RL, and built a Test-Time Scaling framework on top of it.
Claude now creates interactive charts, diagrams and visualizations
Claude can now generate interactive charts, diagrams, and visualizations directly within conversations — now in beta.
XSkill: Continual Learning from Experience and Skills in Multimodal Agents
A multimodal agent that keeps getting smarter on its own by accumulating two types of parameter-free memory: past experiences (action-level) and skills (task-level).
Resurfacing Paralinguistic Awareness in Large Audio Language Models
A fine-tuning technique that enables voice AI to recognize age, gender, and emotion from voice to give different responses to children vs adults.
Hardening Firefox with Anthropic's Red Team
Anthropic partnered with Mozilla to use Claude Opus 4.6 for adding accessibility features to Firefox — a concrete AI+browser integration.
Meta’s AI smart glasses and data privacy concerns
Photos taken with Meta Ray-Ban smart glasses are being sent to workers in Kenya and other countries for labeling and review — raising major privacy concerns.
Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume
An uncertainty measurement framework that proactively detects queries where multimodal LLMs are likely to be wrong — without external tools — and auto-routes them to experts or larger models.
Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models
'Visual CoT' that generates images while reasoning outperforms text-only CoT by up to 26%p on spatial and physics problems
MMR-Bench: A Comprehensive Benchmark for Multimodal LLM Routing
A benchmark proving that automatic per-query model routing achieves the same accuracy as the strongest single model at just 33% of the cost
Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model
MM-OOD: a framework that adds image+text multimodal reasoning to text-only OOD detection, catching anomalous samples better in zero-shot on top of CLIP
FastAV: Efficient Token Pruning for Audio-Visual Large Language Model Inference
A token pruning framework for audio-visual multimodal LLMs that cuts computation by 40%+ without additional training while maintaining or even improving performance
LLM-Driven Accessible Interface: A Model-Based Approach
An architectural proposal for automatically generating WCAG-compliant accessible UIs by combining UserProfile, declarative rules, and LLM.
Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents
A benchmark that systematically measures how well AI maintains, reasons over, and updates memory across dozens of multi-session conversations mixing images and text.
Empowering Reliable Visual-Centric Instruction Following in MLLMs
We created a benchmark and 10k fine-tuning dataset to verify whether multimodal models actually reference images — existing evaluations could be passed without any image at all.
Exploring KV Cache Quantization in Multimodal Large Language Model Inference
Quantizing the KV Cache of multimodal LLMs with images makes first-token latency 1.7x faster and output throughput 4.3x faster.
Show HN: Gemini Pro 3 imagines the HN front page 10 years from now
An experiment feeding Gemini Pro 3 today's HN front page and asking it to predict what HN looks like in 2035 — exposing the limits of AI future prediction.
Nano Banana can be prompt engineered for nuanced AI image generation
Google's autoregressive image generation model Nano Banana matches or beats existing diffusion models on key metrics.
EuroLLM: LLM made in Europe built to support all 24 official EU languages
An open-source LLM jointly developed by 8 European universities and institutions supporting all 24 EU official languages with 1.7B parameters.
Gemini 2.5 Computer Use model
Google released a specialized model based on Gemini 2.5 Pro that can see computer screens and directly operate mouse/keyboard via API. Outperforms competitors on web/mobile benchmarks with lower latency.
Qwen3-Omni: Native Omni AI model for text, image and video
Alibaba's unified multimodal LLM that processes text, images, video, and audio in a single model.
I replaced Animal Crossing's dialogue with a live LLM by hacking GameCube memory
A project connecting LLM-powered real-time AI dialogue to Animal Crossing NPC conversations on GameCube via shared memory — without modifying a single line of game code. Demonstrates the potential of retro game modding meets LLM NPCs.
LLM Visualization
An interactive website that visualizes the entire process of how Transformer-based LLMs process tokens step by step — understand LLM internals intuitively without code.
Monitor your security cameras with locally processed AI
How to analyze CCTV in real-time with AI on edge devices — no cloud required.
Video Summarization with Large Language Models
Converting video frames to text captions, then having an LLM score importance for video summarization — achieving SotA over traditional visual feature-based approaches.
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Training a GUI automation agent with RL using only 0.02% of the data that beats existing SOTA SFT approaches.
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
An 8B-scale judge model that scores the accuracy of each solution step in image+text reasoning problems, pluggable into existing models to boost reasoning performance by up to 8.4 points.
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
A plug-and-play inference optimization technique that removes up to 90% of visual tokens in image/video multimodal models while barely losing performance.
MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation
A benchmark measuring how memory degrades in 20 multimodal AIs including GPT-4o during long conversations — plus simple solutions.
TimeCAP: Learning to Contextualize, Augment, and Predict Time Series Events with Large Language Model Agents
Splitting two GPT-4 agents to first contextualize time series data as text then predict raises F1 score by an average of 28.75%.
When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis
SeeUnsafe: a GPT-4o-based MLLM agent framework that automatically classifies traffic accidents from CCTV footage and identifies involved objects.
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
An iterative frame selection system using GPT-4 as an agent that achieves SOTA on long videos by looking at an average of only 8 frames.