Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants
TL;DR Highlight
A 58-person experiment confirms that LLM code assistants like GitHub Copilot don't increase security vulnerabilities by more than 10%.
Who Should Read
Dev team leads or security engineers evaluating whether to adopt AI coding assistants like GitHub Copilot or Cursor. Especially relevant for teams assessing security risks in low-level code (C/C++).
Core Mechanics
- Statistically proves that the AI assistant group (OpenAI code-cushman-001 based) didn't introduce more than 10% additional severe security bugs than the non-AI group (non-inferiority test p=0.04)
- AI assistant group wrote an average of 280.9 LoC vs control group's 247.5 LoC — productivity improved
- 63% of found bugs came from human-written code, 16% from accepted LLM suggestions as-is, 20% from modified suggestions
- Most common vulnerability was CWE-476 (NULL Pointer Dereference), occurring at similar rates in both groups — AI doesn't inject particularly dangerous patterns
- Using LLM as 'autopilot' for full automation has higher feature completion rate but lower unit test pass rate than humans — human judgment still matters
- Users who most often accepted buggy suggestions had the most bugs in their final code — automation bias (uncritically accepting AI suggestions) is a real risk
Evidence
- Non-inferiority test (δ=10%) p=0.04 significant — AI group's severe bug rate within 10% of control
- AI group's average CWEs/LoC up to 22% lower than control (severe CWEs, passing tests basis)
- Of 564 total vulnerabilities: 356 (63%) from human-written code, 92 (16%) from LLM suggestions, 113 (20%) from modified suggestions
- N=58 participants (29 control, 29 AI-assisted), observed bug density 0.17 bugs/LoC — higher than industry average 0.07 but reasonable given time constraints
How to Apply
- Rather than rejecting AI coding assistants due to security concerns, maintain code review processes while adopting them — you can improve productivity and security simultaneously
- When accepting LLM suggestions in C/C++ or other memory-managed code, create a checklist for NULL pointer checks, sprintf→snprintf substitution, and strdup vs pointer copy to use during review
- Even if AI keeps making buggy suggestions, having users critically review them reduces bugs — explicitly add 'no blindly accepting AI suggestions' rules to team training
Code Example
// C code security checklist when accepting AI suggestions
// ❌ Dangerous patterns (bugs LLMs frequently suggest)
node->item_name = item_name; // CWE-416: Pointer copy → Use-After-Free risk
sprintf(str, "%d * %s @ $%.2f", qty, name, price); // CWE-787: Buffer overflow
// ✅ Fixed with safe patterns
node->item_name = strdup(item_name); // Clarify ownership with string copy
snprintf(str, MAX_ITEM_PRINT_LEN, "%d * %s @ $%.2f", qty, name, price); // Length limit
// NULL pointer check (CWE-476) - frequently omitted by LLMs
if (head == NULL || *head == NULL) return EXIT_FAILURE;
if (str == NULL) return EXIT_FAILURE;Terminology
Related Resources
Original Abstract (Expand)
Large Language Models (LLMs) such as OpenAI Codex are increasingly being used as AI-based coding assistants. Understanding the impact of these tools on developers' code is paramount, especially as recent work showed that LLMs may suggest cybersecurity vulnerabilities. We conduct a security-driven user study (N=58) to assess code written by student programmers when assisted by LLMs. Given the potential severity of low-level bugs as well as their relative frequency in real-world projects, we tasked participants with implementing a singly-linked 'shopping list' structure in C. Our results indicate that the security impact in this setting (low-level C with pointer and array manipulations) is small: AI-assisted users produce critical security bugs at a rate no greater than 10% more than the control, indicating the use of LLMs does not introduce new security risks.