Synthesizing scientific literature with retrieval-augmented language models
TL;DR Highlight
A RAG-based scientific literature synthesis model that searches 45 million open-access papers and attaches citation sources.
Who Should Read
Researchers, academics, and R&D teams who need to synthesize large bodies of scientific literature quickly with verifiable citations.
Core Mechanics
- Indexes 45 million open-access papers and enables semantic search across the corpus
- Generates synthesized summaries grounded in retrieved papers with inline citations
- Outperforms general-purpose LLMs on scientific QA tasks where citations are required
- Retrieval pipeline uses dense embeddings + sparse BM25 hybrid search for high recall
- Citation accuracy (correctly attributing claims to the right papers) significantly higher than baseline RAG
Evidence
- Evaluated on scientific QA benchmarks; outperforms GPT-4 + web search on citation accuracy
- 45M paper corpus covers most major open-access repositories (arXiv, PubMed, Semantic Scholar, etc.)
- Human evaluation confirms synthesized summaries are more factually grounded than uncited LLM outputs
How to Apply
- Use this system (or similar RAG pipelines) when you need literature-backed answers rather than LLM hallucinations about research.
- For your own RAG pipeline over scientific corpora, implement hybrid retrieval (dense + BM25) to improve recall on rare terms.
- Always surface citation links to users — grounding claims in actual papers dramatically improves trust and verifiability.
Code Example
# OpenScholar self-feedback loop conceptual prompt example
system_prompt = """
You are a scientific literature synthesis assistant.
Given retrieved passages with citation keys, write a factual answer.
After drafting, review each claim and verify it is directly supported
by at least one cited passage. Remove or correct any unsupported claims.
"""
user_prompt = """
Query: {user_question}
Retrieved passages:
[1] {passage_1} (Source: {paper_1_title}, {paper_1_year})
[2] {passage_2} (Source: {paper_2_title}, {paper_2_year})
...
Step 1: Draft a synthesis answer with inline citations [1], [2], ...
Step 2: Self-check — does every sentence have a supporting citation?
If not, revise or remove that sentence.
Step 3: Output the final answer.
"""Terminology
Related Papers
Show HN: Bible as RAG Database
성경 전체를 RAG(검색 증강 생성) 데이터베이스로 인덱싱해 주제나 키워드로 관련 성경 구절을 의미론적으로 검색할 수 있는 웹 서비스다. 종교 텍스트에 RAG를 적용한 실용적 예시로, 유사한 프로젝트를 만들려는 개발자에게 참고가 된다.
Haystack: Open-Source AI Framework for Production Ready Agents, RAG
deepset이 만든 오픈소스 AI 오케스트레이션 프레임워크로, LangChain의 대안으로 주목받고 있으며 모듈형 파이프라인 방식으로 RAG·Agent·멀티모달 앱을 프로덕션까지 구축할 수 있다.
We built a persistent agent memory layer on Elasticsearch with 0.89 recall
AI 에이전트가 세션이 끝나도 사용자 정보를 기억할 수 있도록 Elasticsearch 위에 구축한 멀티테넌트 장기 메모리 시스템 아키텍처 공개. 168개 질문 기준 R@10 0.89, 테넌트 간 데이터 누출 0건을 달성한 구체적인 구현 방법을 담았다.
TAHOE: Text-to-SQL with Automated Hint Optimization from Experience
LLM이 SQL 생성 실패에서 배운 힌트를 재사용 가능한 Hint Bank로 쌓아, 모델 재학습 없이 Snowflake 방언 SQL 정확도를 대폭 끌어올리는 시스템.
Inside FAISS: Billion-Scale Similarity Search
FAISS가 수십억 개 벡터를 빠르게 검색하는 핵심 알고리즘인 IVF(파티셔닝)와 Product Quantization(압축)을 시각적으로 설명한 글로, RAG나 벡터 검색 시스템을 구축하는 개발자에게 내부 동작 원리를 이해시켜 준다.
Show HN: Airbyte Agents – context for agents across multiple data sources
Airbyte가 Slack, Salesforce, Linear 등 여러 SaaS 시스템의 데이터를 미리 인덱싱해서 Agent가 API를 일일이 뒤지지 않아도 되는 Context Store를 출시했다. 기존 MCP 방식보다 토큰을 최대 90%까지 줄이는 효과를 확인했다.
Related Resources
Original Abstract (Expand)
Scientific progress depends on the ability of researchers to synthesize the growing body of literature. Can large language models (LLMs) assist scientists in this task? Here we introduce OpenScholar, a specialized retrieval-augmented language model (LM)1 that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience and biomedicine. Despite being a smaller open model, OpenScholar-8B outperforms GPT-4o by 6.1% and PaperQA2 by 5.5% in correctness on a challenging multi-paper synthesis task from the new ScholarQABench. Although GPT-4o hallucinates citations 78–90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar’s data store, retriever and self-feedback inference loop improve off-the-shelf LMs: for instance, OpenScholar-GPT-4o improves the correctness of GPT-4o by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT-4o responses over expert-written ones 51% and 70% of the time, respectively, compared with 32% for GPT-4o. We open-source all artefacts, including our code, models, data store, datasets and a public demo. A specialized, open-source, retrieval-augmented language model is introduced for answering scientific queries and synthesizing literature, the responses of which are shown to be preferred by human evaluations over expert-written answers.