Can LLMs really reason about Code? Studying how well LLMs understand the relation between Input, Code, and Output
This program is tentative and subject to change.
In the past years, large language models (LLMs) have demonstrated remarkable progress in code generation. However, their ability to reason about program behavior remains an open challenge - an ability that is relevant for applications including reverse engineering, debugging, secure code generation, test-driven synthesis, input reconstruction, reverse fuzzing, behavioral monitoring, and safe execution modeling.
To study this ability, we examine the capacity of LLMs to reason about the semantics of code - specifically, their ability to \emph{relate} code, its inputs, and its outputs to each other. To this end, we investigate whether and how well LLMs can predict one of these three components given the other two - that is,
- predict the input given code and output,
- predict the output given code and input, and
- predict the code given input and output.
This way, we assess how well LLMs can reason about and understand the underlying relationships that govern program execution.
We construct four datasets covering string processing, array operations, and coding challenges in JavaScript and Python to evaluate diverse program-understanding capabilities, incorporating various code mutation techniques to increase complexity.
In our evaluation on tasks covering string processing, array operations, and coding challenges, we find that closed-weight models achieve the strongest performance across all datasets, including perfect input recovery on deterministic string tasks. Across tasks, output prediction is comparatively stable, whereas code prediction remains the hardest setting and often fails for smaller models. Finally, cross-codebase transfer is feasible, especially for input prediction, but highly sensitive to model capacity and fine-tuning strategy.