AI Updates Aggregator

🐯虎嗅•May 1, 2026Stalecollected in 4m

DeepSeek's Deleted Visual Primitives Paper

#visual-primitives #reference-gap #multimodal-reasoning #image-compressiondeepseek-v4-flashdeepseek deepseek-v4-flash gpt claude gemini

💡DeepSeek's 'finger' trick (points/boxes) fixes top models' visual counting fails at 7056x compression.

⚡ 30-Second TL;DR

What Changed

Introduces points/boxes as visual primitives embedded in CoT to reference image elements precisely

Why It Matters

This approach could outperform GPT/Claude/Gemini in visual tasks with far less compute, democratizing efficient multimodal AI. Highlights DeepSeek's edge in MoE efficiency for vision-language models.

What To Do Next

Test embedding bounding box coordinates in your multimodal CoT prompts for better visual counting.

Who should care:Researchers & Academics

Key Points

•Introduces points/boxes as visual primitives embedded in CoT to reference image elements precisely
•Compresses 756x756 images to 81 tokens (7056x ratio) while retaining counting accuracy
•Trained on 40M samples from detection datasets after rigorous quality filtering
•Uses separate expert models for boxes/points, then merges via imitation learning

🧠 Deep Insight

AI-generated analysis for this event.

🔑 Enhanced Key Takeaways

•The retraction was reportedly prompted by internal concerns regarding the potential for 'coordinate leakage' where the model's reliance on precise pixel-level primitives could lead to overfitting on specific synthetic dataset artifacts rather than generalizable visual reasoning.
•The 'Reference Gap' addressed by the paper refers to the inability of standard Vision-Language Models (VLMs) to maintain spatial consistency when generating long-form Chain-of-Thought (CoT) responses, leading to hallucinations in multi-step visual tasks.
•The model architecture utilized a novel 'Spatial-Token Alignment' layer that forces the transformer's attention heads to map text-based coordinate tokens directly to the latent visual feature map before the final decoding stage.

📊 Competitor Analysis▸ Show

Feature	DeepSeek Visual Primitives	GPT-4o (Vision)	Claude 3.5 Sonnet
Visual Reasoning Approach	Explicit coordinate embedding	Implicit spatial attention	Implicit spatial attention
Image Compression	7056x (Extreme)	Adaptive/Variable	Adaptive/Variable
Object Counting Precision	High (via primitives)	Moderate (prone to hallucination)	Moderate
Training Data	40M curated detection samples	Proprietary/General	Proprietary/General

🛠️ Technical Deep Dive

Coordinate Embedding Mechanism: The model treats [x, y] coordinates as special tokens within the CoT sequence, which are processed by a dedicated 'Spatial Decoder' head separate from the main text-generation head.
Compression Strategy: The 7056x compression is achieved through a hierarchical patch-merging technique that preserves high-frequency edge information necessary for bounding box regression, despite the extreme reduction in token count.
Imitation Learning Phase: The model was fine-tuned using a 'Teacher-Student' framework where a larger, non-compressed vision model provided ground-truth spatial attention maps to guide the compressed model's primitive selection.

🔮 Future ImplicationsAI analysis grounded in cited sources

Future VLM architectures will shift toward explicit spatial grounding tokens.

The industry is moving away from implicit visual attention toward explicit coordinate-based reasoning to solve persistent hallucination issues in complex visual tasks.

Data curation for visual reasoning will prioritize detection-heavy datasets over raw image-text pairs.

The success of the 40M detection sample training set demonstrates that structured spatial data is more effective for reasoning than unstructured web-scraped image captions.

⏳ Timeline

2024-01

DeepSeek begins research into visual-spatial reasoning integration.

2025-09

Initial development of the 284B MoE architecture with visual primitive support.

2026-03

DeepSeek publishes 'Thinking with Visual Primitives' paper.

2026-04

DeepSeek retracts the paper citing internal quality and safety concerns.

🐯Read original article on 虎嗅

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #visual-primitives

Same product

AI-curated news aggregator. All content rights belong to original publishers.
Original source: 虎嗅 ↗

⚡ 30-Second TL;DR

Key Points

🧠 Deep Insight

🔑 Enhanced Key Takeaways

🛠️ Technical Deep Dive

🔮 Future ImplicationsAI analysis grounded in cited sources

⏳ Timeline

👉Related Updates

Ubtech's U1 Robot Faces Market Skepticism

The Rise of AI Game Companions

Taobao expands instant retail 'Convenience Store' network

Leapmotor pivots to overseas markets amid domestic competition