LLM 기반 미국 대학 입학 지원 시스템 EZCollegeApp

Large Language Models for Assisting American College Applications

Jan 23, 2026•Zheng Liu, Weihang You, Peng Shu +14•View PDF

TL;DR Highlight

RAG + Human-in-the-loop 설계로 미국 대입 서류 작성을 도와주는 실용적 LLM 시스템 아키텍처 논문.

Who Should Read

고위험 도메인(금융, 의료, 법률)에서 LLM 어시스턴트를 만드는 백엔드/풀스택 개발자. 특히 RAG 파이프라인에서 소스 신뢰도 관리와 hallucination 방지를 고민하는 개발자에게 유용.

Core Mechanics

mapping-first paradigm: 폼 질문에 바로 답변 생성하지 않고, 먼저 canonical schema의 필드로 매핑한 뒤 답변을 만들어 여러 포털 간 일관성 확보
소스 분리 RAG: 공식 사이트 / 큐레이션 FAQ / 커뮤니티 포럼을 별도 인덱스로 관리하고, 검색 시 공식 > FAQ > 커뮤니티 순으로 tiered retrieval 적용
strict human-in-the-loop: 절대 자동 제출 안 함. AI는 제안만 하고 사용자가 직접 복사-붙여넣기해야 반영됨. '의도적 마찰' 설계
agentic retrieval: 모든 쿼리에 자동 검색하지 않고, LLM이 function calling으로 필요할 때만 search_knowledge_base 도구를 직접 호출
structure-aware retrieval: 벡터 임베딩 없이 문서 트리 구조(목차, 섹션 경계)를 파악하고 LLM이 트리를 탐색해 관련 섹션 추출. 임베딩 인프라 불필요
extraction over generation 원칙: 에세이는 생성 거부. 오직 업로드된 문서에서 정보를 추출·재조합만 함. AI 오남용 방지와 학문적 정직성 동시 확보

Evidence

일반 지원서 84% fill rate (50개 질문 중 42개 자동 답변 생성)
인용 유효성 92%: 생성된 답변의 인용이 실제 소스와 코사인 유사도 0.7 이상
인간 평가자 82%가 답변을 '유용하다' 또는 '매우 유용하다'로 평가 (5점 척도 4~5점)
평가자 89%가 인용 메커니즘이 신뢰도를 높인다고 응답, 답변 수정 시간 15~20초 (직접 작성 대비 2~3분)

How to Apply

RAG 시스템에서 소스를 분리 인덱싱할 때: 공식 문서 / FAQ / 커뮤니티 콘텐츠를 별도 컬렉션으로 나누고, 검색 시 소스 타입 메타데이터로 우선순위 재정렬하면 신뢰도 높은 답변 보장
폼/문서 자동화 파이프라인 설계 시: 질문을 바로 LLM에 넣지 말고, 먼저 canonical field(예: user.academics.gpa)에 매핑한 뒤 타입·포맷 제약과 함께 쿼리 구성하면 여러 폼 간 일관성과 출력 포맷 준수율이 올라감
LLM 에이전트에서 검색 비용 줄이기: 모든 쿼리에 자동 RAG 하지 말고, search_knowledge_base를 function tool로 등록해서 LLM이 필요할 때만 호출하게 하면 컨텍스트 소비와 레이턴시 동시 감소

Code Example

snippet

# Agentic Retrieval 패턴 예시 (OpenAI function calling)
tools = [
    {
        "type": "function",
        "function": {
            "name": "search_knowledge_base",
            "description": "Search the indexed admissions documents or student profile for relevant information",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "The search query"},
                    "source_type": {
                        "type": "string",
                        "enum": ["official", "faq", "community", "personal"],
                        "description": "Preferred source type (official > faq > community)"
                    },
                    "top_k": {"type": "integer", "default": 10}
                },
                "required": ["query"]
            }
        }
    }
]

# LLM이 필요할 때만 검색 도구 호출
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    tools=tools,
    tool_choice="auto"  # LLM이 검색 필요 여부 스스로 판단
)

# tool_call이 있으면 검색 실행 후 재호출
if response.choices[0].message.tool_calls:
    results = execute_search(response.choices[0].message.tool_calls[0])
    messages.append({"role": "tool", "content": results, ...})
    final_response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)

Terminology

RAGLLM이 답변할 때 외부 문서를 검색해서 근거로 활용하는 방식. 도서관에서 책 찾아보고 답하는 것과 비슷. 모델이 학습 때 외운 것만 쓰는 게 아니라 실시간으로 자료를 참고함.

Human-in-the-loopAI가 자동으로 실행하지 않고 반드시 사람의 확인을 거치는 설계 패턴. 자동 이체 대신 '이 금액 이체할까요? 확인 버튼 누르세요' 처럼.

canonical schema서로 다른 형식의 데이터를 하나의 표준 형태로 변환하는 내부 규격. 각 대학 지원서가 다 달라도 내부에서는 user.academics.gpa 같은 통일된 필드명으로 관리하는 것.

hallucinationLLM이 근거 없이 그럴듯한 거짓 정보를 자신있게 출력하는 현상. 마치 자신감 넘치는 사람이 모르면서 아는 척 대답하는 것.

agentic retrievalAI가 모든 질문에 자동으로 검색하지 않고, 스스로 판단해서 검색이 필요할 때만 도구를 호출하는 방식. 사람도 모든 질문에 구글 검색하지 않고 모를 때만 찾아보는 것처럼.

vector embedding텍스트를 숫자 배열(벡터)로 변환해서 의미가 비슷한 텍스트가 가까운 거리에 위치하도록 만드는 기술. '사과'와 '과일'은 가깝고, '사과'와 '자동차'는 멀게.

tiered retrieval검색 시 여러 소스를 신뢰도 순서대로 우선순위를 두고 검색하는 방식. 공식 사이트에서 못 찾으면 FAQ, 그래도 없으면 커뮤니티 글을 보는 것처럼 단계별로 에스컬레이션.

Related Resources

Original Abstract (Expand)

American college applications require students to navigate fragmented admissions policies, repetitive and conditional forms, and ambiguous questions that often demand cross-referencing multiple sources. We present EZCollegeApp, a large language model (LLM)-powered system that assists high-school students by structuring application forms, grounding suggested answers in authoritative admissions documents, and maintaining full human control over final responses. The system introduces a mapping-first paradigm that separates form understanding from answer generation, enabling consistent reasoning across heterogeneous application portals. EZCollegeApp integrates document ingestion from official admissions websites, retrieval-augmented question answering, and a human-in-the-loop chatbot interface that presents suggestions alongside application fields without automated submission. We describe the system architecture, data pipeline, internal representations, security and privacy measures, and evaluation through automated testing and human quality assessment. Our source code is released on GitHub (https://github.com/ezcollegeapp-public/ezcollegeapp-public) to facilitate the broader impact of this work.