Home
Categories
EXPLORE
True Crime
Comedy
Society & Culture
Business
Sports
Technology
History
About Us
Contact Us
Copyright
© 2024 PodJoint
Podjoint Logo
US
00:00 / 00:00
Sign in

or

Don't have an account?
Sign up
Forgot password
https://is1-ssl.mzstatic.com/image/thumb/Podcasts211/v4/86/0c/75/860c75aa-068a-18b9-1cb5-600f803acdd4/mza_17177667092256625558.jpg/600x600bb.jpg
AI Illuminated
The AI Illuminators
25 episodes
2 days ago
A new way to keep up with AI research. Delivered to your ears. Illuminated by AI. Part of the GenAI4Good initiative.
Show more...
Courses
Education
RSS
All content for AI Illuminated is the property of The AI Illuminators and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.
A new way to keep up with AI research. Delivered to your ears. Illuminated by AI. Part of the GenAI4Good initiative.
Show more...
Courses
Education
https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_episode/42256170/42256170-1729314531700-17f64efc21536.jpg
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
AI Illuminated
9 minutes 21 seconds
1 year ago
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

[00:00] Intro

[00:28] Limitation of existing unified models

[00:57] Janus's decoupled visual encoding solution

[01:18] Advantages of decoupling

[02:03] Janus architecture

[02:50] Three-stage training

[03:41] Ablation studies

[04:23] Extensions for Janus

[05:10] Performance gains

[05:47] Current limitations

[06:31] Impact of simplicity and extensibility

[07:10] Qualitative results

[08:18] Potential applications

[08:52] Key takeaways


Abstract: In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.


Authors: Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo


Affiliations: DeepSeek-AI, The University of Hong Kong, Peking University


Link: https://arxiv.org/abs/2410.13848

AI Illuminated
A new way to keep up with AI research. Delivered to your ears. Illuminated by AI. Part of the GenAI4Good initiative.