Researchers introduce methods to rewrite teacher model reasoning traces, deterring unauthorized knowledge distillation while embedding verifiable watermarks. Techniques include LLM-powered rewriting and gradient-based approaches that preserve answer correctness. Experiments demonstrate strong anti-distillation effects with maintained or improved teacher performance and reliable watermark detection.
Key Points
- 1.Modifies reasoning traces for anti-distillation and API watermarking
- 2.Uses LLM-based and gradient-based dynamic rewriting
- 3.Instruction-based rewriting degrades distillation utility effectively
- 4.Preserves semantic coherence and teacher performance
- 5.Achieves reliable watermark detection with no false alarms
Impact Analysis
This technique empowers LLM providers to protect intellectual property from unauthorized model compression. It balances security with usability, potentially shifting industry practices toward traceable APIs. Researchers and companies can adopt it to safeguard frontier models.
Technical Details
Rewriting leverages LLMs via instructions or gradients to alter traces without changing final answers. Simple prompting achieves anti-distillation by disrupting student training signals. Watermarks are embedded verifiably in student models for detection.