AIware 2026
Mon 6 - Tue 7 July 2026 Montreal, Canada
co-located with FSE 2026

Large Language Models (LLMs) for code generation risk memorizing and reproducing sensitive training data, including licensed code and proprietary information. We investigate memorization behavior in state-of-the-art code LLMs using a two-stage attack pipeline combining membership inference and data extraction. We evaluate four models (StarCoder2-3B, StarCoder2-7B, Llama3-8B, and DeepSeek-R1-distilled-Llama-8B) on a custom dataset of 30,000+ Python files. Our results reveal memorization rates of 42-64%, with code-specialized models exhibiting higher rates than general-purpose models. Categorical analysis shows that repetitive content (license headers, documentation) is memorized at rates up to 70%, while complex code exhibits lower susceptibility. Notably, realistic code completion scenarios trigger unintentional memorization in 13-14% of cases, posing practical risks for AI coding assistants. We demonstrate that knowledge distillation reduces extraction rates by approximately 19%, offering a cost-effective mitigation approach. Our findings confirm that memorization persists in modern LLMs and is influenced more by training domain and content characteristics than by parameter count alone.