4TB of voice samples just stolen from 40k AI contractors at Mercor

TL;DR Highlight

Mercor data breach exposes voice recordings and ID scans of 40,000 contractors, fueling deepfake and voice fraud risks.

Who Should Read

Developers operating or utilizing AI learning data platforms, and backend/security developers employing voice authentication or biometric data in their services.

Core Mechanics

On April 4, 2026, the hacking group Lapsus$ exfiltrated approximately 4TB of data from the AI learning data platform Mercor and published it on their leak site. The number of affected contractors is reported to be over 40,000.
This breach is particularly dangerous because voice recordings and ID scans (passports/driver's licenses) are linked on the same row within a single database. Previous data breaches typically contained only one or the other, but this incident exposed a 'deepfake-ready kit' combining both.
The voice recordings of Mercor contractors are script readings recorded in quiet environments, averaging 2-5 minutes in length. According to a WSJ report in February 2026, commercially available voice cloning tools require only 15 seconds of clean speech, far less than the amount of data leaked.
Bypassing bank voice authentication is a realistic threat. Some banks in the US and UK still use voice matching as a second factor of authentication, which can be circumvented by reading a challenge phrase with a cloned voice.
Vishing (voice phishing) attacks targeting the HR and finance teams of victims’ companies to request changes to payroll transfers or initiate wire transfers have occurred multiple times. According to the Krebs on Security archive, there have been over 20 confirmed cases since 2023.
Precedents exist, such as the 2024 Hong Kong Arup incident, where $25 million was stolen using a deepfake video call. The Arup incident used publicly available video and audio, while the Mercor breach exposes studio-quality audio and ID scans, enabling far more precise forgeries.
Synthesized voice attacks targeting insurance call centers are also surging. A Pindrop report states that synthetic voice attacks on insurance call centers increased by 475% year-over-year in 2025, with auto, life, and disability insurance as primary targets.

Evidence

"Comments pointed out the irony of the original article promoting a free voice analysis service for victims, while the same voice data could be sent to another AI company. The service is operated by ORAVYS, a voice analysis startup, raising suspicions that the article itself is marketing content. Many opinions stated that the combination of voice and ID data is fundamentally different from password leaks, framing biometric information as a 'forever password' because passwords can be changed but voices cannot. Comments highlighted the problematic practice of centralizing voice biometric data on servers, questioning why browser or on-device processing isn't used in 2026 with tools like Whisper.cpp and WebGPU support, concluding that server-side processing is cheaper but doesn't account for periodic breach costs. Several comments referenced the German term 'Datensparsamkeit' ('data minimization'), arguing that the best defense is to avoid collecting data unnecessary for core services. Comments criticized the structural problems of AI data collection companies, noting that contractors who label and collect data are the least protected layer of the AI supply chain, making that pipeline an attack surface and labeling it a 'shameful labor issue'."

How to Apply

"If you operate a service using voice authentication as a second factor, combine it with liveness detection and challenge-response mechanisms, or add AudioSeal watermarking or AASIST anti-spoofing models to filter out synthetic voice attacks. When designing or selecting an AI learning data collection pipeline, avoid storing voice recordings and ID scans on the same database row, separating the data into different encrypted storage and managing the linking key separately to minimize the impact of a breach. If you are centralizing voice/biometric data on servers, consider Whisper.cpp or WebGPU-based browser processing. Switching to on-device processing eliminates the presence of biometric originals on the server, removing the data from potential exfiltration. If you have participated as a contractor through platforms like Mercor, search for and delete your publicly indexed voice samples (YouTube, podcasts, Zoom recordings) and replace SMS OTP or hardware tokens for voice authentication on bank/brokerage accounts."

Terminology

Lapsus$A cybercrime/extortion group known for hacking into large corporations such as Microsoft, Nvidia, and Uber. They use stolen data as leverage for blackmail or direct publication on leak sites.

Voice CloningA technology that generates a voice mimicking a specific person's tone, accent, and timbre by learning from a short audio sample. Implementation is possible with commercially available tools using approximately 15 seconds of clean audio.

AudioSealA voice watermarking technology developed by Meta that embeds source information into synthesized audio in an inaudible manner, enabling detection of AI generation.

AASISTA deep learning-based anti-spoofing model for voice spoofing detection. It is used to distinguish between synthesized and real human voices.

VishingA portmanteau of Voice + Phishing. A social engineering attack that uses phone calls to impersonate a trustworthy individual and steal money or information.

DatensparsamkeitGerman for 'data minimization'. A principle advocating for collecting and storing only the minimum amount of data necessary for service operation, viewing unnecessary data collection as a security risk.