GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers
TL;DR Highlight
GPTZero scanned 4841 NeurIPS 2025 papers and found 53 with 100+ fabricated citations (hallucinated references) — a serious academic integrity issue.
Who Should Read
Academic researchers, conference organizers, and anyone evaluating whether AI-generated content in scholarly work is detectable and problematic.
Core Mechanics
- GPTZero's AI detection tool scanned the entire NeurIPS 2025 accepted paper corpus (4841 papers) and flagged 53 papers with 100 or more citations that appear to be hallucinated.
- Hallucinated citations are plausible-sounding but nonexistent references — they often have realistic author names, paper titles, and venues but don't correspond to real publications.
- This is a different problem from AI-generated text detection — it's specifically about fabricated scholarly references, which can cascade through the literature when others cite the citing paper.
- The 53-paper figure likely understates the problem — GPTZero's threshold was 100+ hallucinated citations, so papers with fewer fabricated references weren't flagged.
- NeurIPS 2025 acceptance rate is around 25% — if accepted papers have this issue, rejected papers likely have higher rates.
- The academic community has no established process for systematically checking citations for hallucination at scale, making this a systemic gap.
Evidence
- GPTZero published their methodology and a list of flagged paper IDs, enabling community verification.
- Several researchers independently verified a sample of flagged citations and confirmed the hallucination pattern.
- HN discussion was alarmed: academic citation networks are a foundational trust mechanism, and systematic hallucination corrupts that infrastructure.
- Debate about whether the authors knew (intentional misconduct) or didn't know (accidentally included AI-generated reference lists without checking). Both are problematic for different reasons.
- Conference organizers don't have the capacity to manually verify all citations — this points to a need for automated citation verification at submission time.
How to Apply
- If you're writing academic papers with any AI assistance: run every reference through a citation verifier (Semantic Scholar, CrossRef, Google Scholar) before submission.
- For reviewers: spot-check 5-10 citations in every paper you review — hallucinated references are often in the related work section and may not be obvious.
- Conference organizers: consider adding automated citation verification as part of the submission pipeline — tools like GPTZero and Semantic Scholar can flag suspicious references.
- For research teams: establish a policy that every reference must be independently verified before inclusion, regardless of how the draft was generated.
Code Example
# Example of verifying paper existence using Semantic Scholar API
import requests
def verify_citation(title: str) -> bool:
url = "https://api.semanticscholar.org/graph/v1/paper/search"
resp = requests.get(url, params={"query": title, "limit": 1})
data = resp.json()
return data.get("total", 0) > 0
# Usage
print(verify_citation("Attention Is All You Need")) # True
print(verify_citation("Fake Paper by John Doe 2024")) # FalseTerminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Refusal in Language Models Is Mediated by a Single Direction
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.