When AI Coding Assistants Leak Training Data: Study Memorization in Code LLMs
Large Language Models (LLMs) for code generation risk memorizing and reproducing sensitive training data, including licensed code and proprietary information. We investigate memorization behavior in state-of-the-art code LLMs using a two-stage attack pipeline combining membership inference and data extraction. We evaluate four models (StarCoder2-3B, StarCoder2-7B, Llama3-8B, and DeepSeek-R1-distilled-Llama-8B) on a custom dataset of 30,000+ Python files. Our results reveal memorization rates of 42-64%, with code-specialized models exhibiting higher rates than general-purpose models. Categorical analysis shows that repetitive content (license headers, documentation) is memorized at rates up to 70%, while complex code exhibits lower susceptibility. Notably, realistic code completion scenarios trigger unintentional memorization in 13-14% of cases, posing practical risks for AI coding assistants. We demonstrate that knowledge distillation reduces extraction rates by approximately 19%, offering a cost-effective mitigation approach. Our findings confirm that memorization persists in modern LLMs and is influenced more by training domain and content characteristics than by parameter count alone.