Recovering verbatim finetuning data from LLM logits without weights
๐กNew method recovers private training data from LLMs using only logits; major implications for model security.
โก 30-Second TL;DR
What Changed
CDD recovers verbatim finetuning data using only grey-box logit access.
Why It Matters
This research highlights a significant privacy vulnerability in finetuned models, suggesting that logit access alone is sufficient to reconstruct sensitive training data. It underscores the risks of using synthetic data from popular LLMs for finetuning.
What To Do Next
Audit your finetuning pipelines to ensure that synthetic training data is sanitized of model-specific artifacts or personas before training.
๐ง Deep Insight
AI-generated analysis for this event.
๐ Enhanced Key Takeaways
- โขCDD leverages the divergence between a base model and a finetuned model's output distribution to isolate memorized sequences without needing gradient information.
- โขThe method exploits the 'logit drift' phenomenon, where finetuned models exhibit significantly higher confidence scores on verbatim training tokens compared to the pre-trained base model.
- โขResearch indicates that CDD is particularly effective against models finetuned on small, high-quality datasets, where memorization is more prevalent than in large-scale instruction tuning.
- โขThe technique demonstrates that logit-only access is sufficient to reconstruct sensitive PII (Personally Identifiable Information) that was previously thought to be protected by weight-access restrictions.
- โขCDD's efficiency stems from its ability to perform 'contrastive sampling' in real-time, allowing for the extraction of training data during inference without the need for expensive backpropagation.
๐ Competitor Analysisโธ Show
| Feature | CDD (Contrastive Decoding Diffing) | ADL (Activation Difference Lens) | Training Data Extraction (Gradient-based) |
|---|---|---|---|
| Access Level | Grey-box (Logits only) | White-box (Weights required) | White-box (Gradients required) |
| Computational Cost | Low (Inference-time) | High (Requires backprop) | Very High (Requires training state) |
| Accuracy | High (19/20 benchmarks) | Moderate | Variable |
| Weight Access | Not Required | Required | Required |
๐ ๏ธ Technical Deep Dive
- CDD operates by calculating the difference in logit vectors between a reference base model and the target finetuned model at each token position.
- It utilizes a thresholding mechanism on the logit difference to identify tokens that deviate significantly from the base model's probability distribution.
- The algorithm employs a sliding window approach to reconstruct sequences, effectively filtering out noise by focusing on high-confidence logit spikes.
- Implementation does not require access to the model's hidden states or internal activations, relying solely on the final softmax layer output.
- The method is agnostic to the specific architecture of the LLM, provided the base model and finetuned model share the same vocabulary and tokenizer.
๐ฎ Future ImplicationsAI analysis grounded in cited sources
โณ Timeline
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
Same topic
Explore #llm-security
Same product
More on contrastive-decoding-diffing-(cdd)
Same source
Latest from Reddit r/MachineLearning

AI browsers vulnerable to data-stealing agent attacks
H64LM: A 249M-parameter MoE Transformer built from scratch
Is Tom Yeh's 'AI by hand' course worth it?
Internship Prep Guide for Small Language Models
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning โ