4TB of voice samples just stolen from 40k AI contractors at Mercor
TL;DR Highlight
Mercor data breach exposes voice recordings and ID scans of 40,000 contractors, fueling deepfake and voice fraud risks.
Who Should Read
Developers operating or utilizing AI learning data platforms, and backend/security developers employing voice authentication or biometric data in their services.
Core Mechanics
- On April 4, 2026, the hacking group Lapsus$ exfiltrated approximately 4TB of data from the AI learning data platform Mercor and published it on their leak site. The number of affected contractors is reported to be over 40,000.
- This breach is particularly dangerous because voice recordings and ID scans (passports/driver's licenses) are linked on the same row within a single database. Previous data breaches typically contained only one or the other, but this incident exposed a 'deepfake-ready kit' combining both.
- The voice recordings of Mercor contractors are script readings recorded in quiet environments, averaging 2-5 minutes in length. According to a WSJ report in February 2026, commercially available voice cloning tools require only 15 seconds of clean speech, far less than the amount of data leaked.
- Bypassing bank voice authentication is a realistic threat. Some banks in the US and UK still use voice matching as a second factor of authentication, which can be circumvented by reading a challenge phrase with a cloned voice.
- Vishing (voice phishing) attacks targeting the HR and finance teams of victims’ companies to request changes to payroll transfers or initiate wire transfers have occurred multiple times. According to the Krebs on Security archive, there have been over 20 confirmed cases since 2023.
- Precedents exist, such as the 2024 Hong Kong Arup incident, where $25 million was stolen using a deepfake video call. The Arup incident used publicly available video and audio, while the Mercor breach exposes studio-quality audio and ID scans, enabling far more precise forgeries.
- Synthesized voice attacks targeting insurance call centers are also surging. A Pindrop report states that synthetic voice attacks on insurance call centers increased by 475% year-over-year in 2025, with auto, life, and disability insurance as primary targets.
Evidence
- "Comments pointed out the irony of the original article promoting a free voice analysis service for victims, while the same voice data could be sent to another AI company. The service is operated by ORAVYS, a voice analysis startup, raising suspicions that the article itself is marketing content. Many opinions stated that the combination of voice and ID data is fundamentally different from password leaks, framing biometric information as a 'forever password' because passwords can be changed but voices cannot. Comments highlighted the problematic practice of centralizing voice biometric data on servers, questioning why browser or on-device processing isn't used in 2026 with tools like Whisper.cpp and WebGPU support, concluding that server-side processing is cheaper but doesn't account for periodic breach costs. Several comments referenced the German term 'Datensparsamkeit' ('data minimization'), arguing that the best defense is to avoid collecting data unnecessary for core services. Comments criticized the structural problems of AI data collection companies, noting that contractors who label and collect data are the least protected layer of the AI supply chain, making that pipeline an attack surface and labeling it a 'shameful labor issue'."
How to Apply
- "If you operate a service using voice authentication as a second factor, combine it with liveness detection and challenge-response mechanisms, or add AudioSeal watermarking or AASIST anti-spoofing models to filter out synthetic voice attacks. When designing or selecting an AI learning data collection pipeline, avoid storing voice recordings and ID scans on the same database row, separating the data into different encrypted storage and managing the linking key separately to minimize the impact of a breach. If you are centralizing voice/biometric data on servers, consider Whisper.cpp or WebGPU-based browser processing. Switching to on-device processing eliminates the presence of biometric originals on the server, removing the data from potential exfiltration. If you have participated as a contractor through platforms like Mercor, search for and delete your publicly indexed voice samples (YouTube, podcasts, Zoom recordings) and replace SMS OTP or hardware tokens for voice authentication on bank/brokerage accounts."
Terminology
Related Papers
MTG Bench: Testing how well LLMs can play Magic
카드 게임 MTG의 규칙 준수 능력으로 LLM의 복잡한 규칙 추론 능력을 측정하는 독창적인 벤치마크로, gpt-5.5가 95.4점으로 1위를 차지했다.
ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing
도메인 파인튜닝으로 망가진 LLM 안전성을, 재학습 없이 추론 시점에 작은 안전 모델에서 빌려와 복구하는 방법.
The iPad was on Tailscale: a WebRTC debugging story
WebRTC 데이터 채널에서 iPad만 응답을 못 받는 희귀 버그를 추적한 결과, webrtc-rs의 하드코딩된 MTU 상수와 Tailscale의 IPv6 Fragment 패킷 드롭이 동시에 작용한 복합 버그였다는 2주간의 디버깅 실화.
Can LLMs Beat Classical Hyperparameter Optimization Algorithms?
LLM 기반 하이퍼파라미터 최적화 에이전트와 CMA-ES, TPE 같은 고전 알고리즘을 직접 비교한 연구로, LLM 단독으로는 고전 방법을 이기지 못하지만 두 방법을 합친 하이브리드 'Centaur'가 최고 성능을 낸다는 결론이 나왔다.
What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks
Bold, 하이라이트, 공백 배치 같은 타이포그래피 트릭으로 GPT-4o, Llama Guard 등 10개 콘텐츠 모더레이션 시스템을 99% 이상 우회할 수 있다.
Did Claude increase bugs in rsync?
rsync 프로젝트에 Claude AI가 도입된 이후 버그가 늘었다는 소셜 미디어 주장을 실제 데이터와 통계 분석으로 검증한 글로, 결론적으로 Claude 도입 후 릴리즈가 역사적 분포에서 유독 버그가 많다는 통계적 근거는 없었다.