SkyReels-V4 Tops Global Text-to-Video Ranking

๐กSkyReels-V4 beats Sora, Veo in text-to-video+audio benchmarksโnew global leader!
โก 30-Second TL;DR
What Changed
SkyReels-V4 ranks #1 in Text-to-Video with Audio on Artificial Analysis leaderboard
Why It Matters
This milestone elevates Kunlun Tech in the competitive text-to-video space, pressuring leaders like OpenAI and Google to innovate faster. It signals advancing capabilities in multimodal AI, benefiting developers seeking state-of-the-art video generation tools.
What To Do Next
Test SkyReels-V4 performance on Artificial Analysis leaderboard for your text-to-video projects.
๐ง Deep Insight
Web-grounded analysis with 7 cited sources.
๐ Enhanced Key Takeaways
- โขSkyReels-V4 achieves an Elo score of 1131 on the Artificial Analysis Text-to-Video with Audio leaderboard, leading ahead of Kling 3.0 1080p (Pro) at 1097.[3]
- โขThe model supports 1080p resolution at 32 FPS for sequences up to 15 seconds, with native audio-visual co-generation including frame-perfect lip-syncing and SFX alignment.[2]
- โขKunlun Tech (also known as Skywork AI) employs a dual-stream Multimodal Diffusion Transformer architecture, starting from low-resolution text-to-image training on 3 billion images and scaling to 1080p with multimodal inputs.[4]
๐ Competitor Analysisโธ Show
| Model | Elo (With Audio) | Elo (No Audio) | Resolution/FPS | Max Length |
|---|---|---|---|---|
| SkyReels-V4 | 1131 | 1244 | 1080p/32 | 15s |
| Kling 3.0 1080p (Pro) | 1097 | 1248 | 1080p | N/A |
| Veo 3.1 Fast | 1086 | N/A | N/A | N/A |
| Sora 2 | Below top 5 | N/A | N/A | N/A |
๐ ๏ธ Technical Deep Dive
- โขUtilizes a dual-stream Multimodal Diffusion Transformer (MM-DiT) that unifies video and audio synthesis, inpainting, and editing in a single framework.[4]
- โขTraining pipeline: Stage 1 low-resolution text-to-image on 3 billion images, progressing to stage 6 with 1080p multimodal inputs; integrates sound via upsampling and fusion with video sparse attention.[4]
- โขSupports multimodal inputs including text, image, and mask references for pixel-level control; generates 1080p at 32 FPS up to 15 seconds with microsecond-level audio-video synchronization.[2]
๐ฎ Future ImplicationsAI analysis grounded in cited sources
โณ Timeline
๐ Sources (7)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Pandaily โ