Researchers introduce AST-PAC, a syntax-aware adaptation of PAC for membership inference attacks on code LLMs. It uses AST-based perturbations to create valid calibration samples, outperforming baselines on larger files but facing limits on small or alphanumeric-rich code. The work calls for syntax-adaptive methods to audit code model training data.
Key Points
- 1.Evaluates Loss and PAC MIAs on 3B-7B code models
- 2.PAC fails on complex code due to invalid syntax augmentations
- 3.AST-PAC generates syntactically valid samples via AST perturbations
- 4.AST-PAC improves on large syntactic files but underperforms on small/alphanumeric code
Impact Analysis
AST-PAC advances auditing of unauthorized code usage in LLMs, supporting data governance and copyright compliance. It highlights gaps in current MIAs, pushing for domain-specific tools in code provenance.
Technical Details
AST-PAC adapts PAC by perturbing Abstract Syntax Trees to ensure syntactic validity in calibration data. Tested on code models, it scales better with syntactic complexity than syntax-agnostic PAC. Limitations include under-mutation of small files and struggles with alphanumeric-heavy code.