AST-PAC Enhances Code MIA with AST Guidance

Post LinkedIn

📄Read original on ArXiv AI

💡Syntax-aware MIA tool audits code LLMs' data usage—key for compliance (62 chars)

⚡ 30-Second TL;DR

What changed

Evaluates Loss and PAC MIAs on 3B-7B code models

Why it matters

AST-PAC advances auditing of unauthorized code usage in LLMs, supporting data governance and copyright compliance. It highlights gaps in current MIAs, pushing for domain-specific tools in code provenance.

What to do next

Implement AST-PAC perturbations in your MIA pipeline to test code LLM training data provenance.

Who should care:Researchers & Academics

Researchers introduce AST-PAC, a syntax-aware adaptation of PAC for membership inference attacks on code LLMs. It uses AST-based perturbations to create valid calibration samples, outperforming baselines on larger files but facing limits on small or alphanumeric-rich code. The work calls for syntax-adaptive methods to audit code model training data.

Key Points

1.Evaluates Loss and PAC MIAs on 3B-7B code models
2.PAC fails on complex code due to invalid syntax augmentations
3.AST-PAC generates syntactically valid samples via AST perturbations
4.AST-PAC improves on large syntactic files but underperforms on small/alphanumeric code

Impact Analysis

Technical Details

AST-PAC adapts PAC by perturbing Abstract Syntax Trees to ensure syntactic validity in calibration data. Tested on code models, it scales better with syntactic complexity than syntax-agnostic PAC. Limitations include under-mutation of small files and struggles with alphanumeric-heavy code.

#membership-inference #ast-perturbations #code-provenanceast-pac

📄Read original article on ArXiv AI

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Read Next

Same topic

Explore #membership-inference

Same product