Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants

Aug 20, 2022•Gustavo Sandoval, H. Pearce, Teo Nys +3•View PDF

TL;DR Highlight

A 58-person experiment confirms that LLM code assistants like GitHub Copilot don't increase security vulnerabilities by more than 10%.

Who Should Read

Dev team leads or security engineers evaluating whether to adopt AI coding assistants like GitHub Copilot or Cursor. Especially relevant for teams assessing security risks in low-level code (C/C++).

Core Mechanics

Statistically proves that the AI assistant group (OpenAI code-cushman-001 based) didn't introduce more than 10% additional severe security bugs than the non-AI group (non-inferiority test p=0.04)
AI assistant group wrote an average of 280.9 LoC vs control group's 247.5 LoC — productivity improved
63% of found bugs came from human-written code, 16% from accepted LLM suggestions as-is, 20% from modified suggestions
Most common vulnerability was CWE-476 (NULL Pointer Dereference), occurring at similar rates in both groups — AI doesn't inject particularly dangerous patterns
Using LLM as 'autopilot' for full automation has higher feature completion rate but lower unit test pass rate than humans — human judgment still matters
Users who most often accepted buggy suggestions had the most bugs in their final code — automation bias (uncritically accepting AI suggestions) is a real risk

Evidence

Non-inferiority test (δ=10%) p=0.04 significant — AI group's severe bug rate within 10% of control
AI group's average CWEs/LoC up to 22% lower than control (severe CWEs, passing tests basis)
Of 564 total vulnerabilities: 356 (63%) from human-written code, 92 (16%) from LLM suggestions, 113 (20%) from modified suggestions
N=58 participants (29 control, 29 AI-assisted), observed bug density 0.17 bugs/LoC — higher than industry average 0.07 but reasonable given time constraints

How to Apply

Rather than rejecting AI coding assistants due to security concerns, maintain code review processes while adopting them — you can improve productivity and security simultaneously
When accepting LLM suggestions in C/C++ or other memory-managed code, create a checklist for NULL pointer checks, sprintf→snprintf substitution, and strdup vs pointer copy to use during review
Even if AI keeps making buggy suggestions, having users critically review them reduces bugs — explicitly add 'no blindly accepting AI suggestions' rules to team training

Code Example

snippet

// C code security checklist when accepting AI suggestions
// ❌ Dangerous patterns (bugs LLMs frequently suggest)
node->item_name = item_name;  // CWE-416: Pointer copy → Use-After-Free risk
sprintf(str, "%d * %s @ $%.2f", qty, name, price);  // CWE-787: Buffer overflow

// ✅ Fixed with safe patterns
node->item_name = strdup(item_name);  // Clarify ownership with string copy
snprintf(str, MAX_ITEM_PRINT_LEN, "%d * %s @ $%.2f", qty, name, price);  // Length limit

// NULL pointer check (CWE-476) - frequently omitted by LLMs
if (head == NULL || *head == NULL) return EXIT_FAILURE;
if (str == NULL) return EXIT_FAILURE;

Terminology

CWECommon Weakness Enumeration — a standard classification of software vulnerability types. Like disease codes (ICD), each bug type has a number. CWE-476 is NULL pointer dereference, CWE-787 is buffer overflow.

Non-inferiority testA statistical test proving that a new method is 'not significantly worse' than an existing one. Used in drug trials to show a new drug is no worse than the existing one.

LoCLines of Code. A basic measure of code volume.

Automation biasThe tendency to uncritically accept AI/automated system suggestions. A known human tendency where people trust 'computer says so' without verification.

CWE-476NULL Pointer Dereference. A bug where a program tries to use a pointer before checking if it's NULL, causing a crash.

Related Resources

https://zenodo.org/record/7187359

Original Abstract (Expand)

Large Language Models (LLMs) such as OpenAI Codex are increasingly being used as AI-based coding assistants. Understanding the impact of these tools on developers' code is paramount, especially as recent work showed that LLMs may suggest cybersecurity vulnerabilities. We conduct a security-driven user study (N=58) to assess code written by student programmers when assisted by LLMs. Given the potential severity of low-level bugs as well as their relative frequency in real-world projects, we tasked participants with implementing a singly-linked 'shopping list' structure in C. Our results indicate that the security impact in this setting (low-level C with pointer and array manipulations) is small: AI-assisted users produce critical security bugs at a rate no greater than 10% more than the control, indicating the use of LLMs does not introduce new security risks.