0:00 Introduction
0:20 Limitations of traditional SfM and SLAM techniques.
0:57 Shortcomings of existing neural network methods.
1:07 MegaSaM's approach: balance of accuracy, speed, and robustness.
1:31 Differentiable bundle adjustment (BA) layer.
2:03 Integration of monocular depth priors and motion probability maps.
2:37 Uncertainty-aware global BA scheme.
3:14 Two-stage training scheme.
3:45 Consistent video depth estimation without test-time fine-tuning.
4:16 Key quantitative and qualitative improvements.
4:49 Limitations of MegaSaM and future research avenues.
5:15 Synthetic data for training and generalization to real-world videos.
5:49 Datasets used for evaluation.
6:26 DepthAnything and UniDepth for monocular depth estimation.
7:02 Summary of MegaSaM's advancements.
Authors: Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, Noah Snavely
Affiliations: Google DeepMind, UC Berkeley, University of Michigan
Abstract: We present a system that allows for accurate, fast, and robust estimation of camera parameters and depth maps from casual monocular videos of dynamic scenes. Most conventional structure from motion and monocular SLAM techniques assume input videos that feature predominantly static scenes with large amounts of parallax. Such methods tend to produce erroneous estimates in the absence of these conditions. Recent neural network-based approaches attempt to overcome these challenges; however, such methods are either computationally expensive or brittle when run on dynamic videos with uncontrolled camera motion or unknown field of view. We demonstrate the surprising effectiveness of a deep visual SLAM framework: with careful modifications to its training and inference schemes, this system can scale to real-world videos of complex dynamic scenes with unconstrained camera paths, including videos with little camera parallax. Extensive experiments on both synthetic and real videos demonstrate that our system is significantly more accurate and robust at camera pose and depth estimation when compared with prior and concurrent work, with faster or comparable running times. See interactive results on our project page: this https URL
Link: https://mega-sam.github.io/
[00:00] SVD-Quant: 4-bit diffusion model quantization
[00:27] Challenge: Outlier sensitivity in 4-bit quantization
[00:59] Solution: Smoothing + SVD approach
[01:37] Technical: SVD's role in low-rank approximation
[02:08] Nunchuku: New inference engine with kernel fusion
[02:35] Comparison: INT4 vs FP4 quantization methods
[03:00] Results: 3.5x memory reduction on Flux-1.0
[03:44] Feature: Seamless LoRA compatibility
[04:06] Study: Validating combined approach effectiveness
[04:40] Future: Hardware compatibility and improvements
[06:12] Methods: Image quality assessment metrics
[06:53] Impact: Open-source deployment benefits
Authors: Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han
Affiliations: MIT, NVIDIA, CMU, Princeton, UC Berkeley, SJTU, Pika Labs
Abstract: Diffusion models have been proven highly effective at generating high-quality images. However, as these models grow larger, they require significantly more memory and suffer from higher latency, posing substantial challenges for deployment. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where conventional post-training quantization methods for large language models like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights, then employ a high-precision low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD). This process eases the quantization on both sides. However, na\"ıvely running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without the need for re-quantization. Extensive experiments on SDXL, PixArt-Σ, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5×, achieving 3.0× speedup over the 4-bit weight-only quantized baseline on the 16GB laptop 4090 GPU, paving the way for more interactive applications on PCs. Our quantization library and inference engine are open-sourced.
Link: https://hanlab.mit.edu/projects/svdquant
[00:00] Intro
[00:18] Current limitations in depth-sensing technology
[00:56] D3RoMa's diffusion model approach to depth estimation
[01:47] Integration of geometric constraints in the model
[02:27] HiSS: New dataset for transparent/specular objects
[03:18] Benchmark results showing major accuracy improvements
[04:02] Current limitations and future development areas
[05:34] Technical details of HiSS dataset creation
[06:30] Real-world testing with robotic systems
[07:15] Why diffusion models outperform GANs
[08:54] Implementation of consistency loss functions
[12:00] Solving simulation-to-real-world transfer
[13:25] Potential expansion to single-camera systems
Authors: Songlin Wei, Haoran Geng, Jiayi Chen, Congyue Deng, Wenbo Cui, Chengyang Zhao, Xiaomeng Fang, Leonidas Guibas, He Wang
Affiliations: Peking University, UC Berkeley, Stanford, Galbot, University of Chinese Academy of Sciences, Beijing Academy of Artificial Intelligence
Abstract: Depth sensing is an important problem for 3D vision-based robotics. Yet, a real-world active stereo or ToF depth camera often produces noisy and incomplete depth which bottlenecks robot performances. In this work, we propose D3RoMa, a learning-based depth estimation framework on stereo image pairs that predicts clean and accurate depth in diverse indoor scenes, even in the most challenging scenarios with translucent or specular surfaces where classical depth sensing completely fails. Key to our method is that we unify depth estimation and restoration into an image-to-image translation problem by predicting the disparity map with a denoising diffusion probabilistic model. At inference time, we further incorporated a left-right consistency constraint as classifier guidance to the diffusion process. Our framework combines recently advanced learning-based approaches and geometric constraints from traditional stereo vision. For model training, we create a large scene-level synthetic dataset with diverse transparent and specular objects to compensate for existing tabletop datasets. The trained model can be directly applied to real-world in-the-wild scenes and achieve state-of-the-art performance in multiple public depth estimation benchmarks. Further experiments in real environments show that accurate depth prediction significantly improves robotic manipulation in various scenarios.
Link: https://arxiv.org/abs/2409.14365
[00:00] Intro
[00:21] Key problem: Poor generalization in robotic learning
[00:51] HPT: New transformer architecture for robotics
[00:59] Core components of HPT architecture
[01:44] Scale analysis: Data and model size impacts
[02:16] Training data: Real robots, simulations, human videos
[02:54] Results: 20% improvement on new tasks
[04:04] Real-world testing limitations
[05:18] Future additions: Tactile and 3D data
[05:57] Requirements for better robotics datasets
[06:48] Weight sampling in heterogeneous data
[08:55] Benefits of modular architecture
[10:30] Scaling challenges and trade-offs
Authors: Lirui Wang, Xinlei Chen, Jialiang Zhao, Kaiming He
Affiliations: MIT CSAIL, Meta FAIR
Abstract: One of the roadblocks for training generalist robotic models today is heterogeneity. Previous robot learning methods often collect data to train with one specific embodiment for one task, which is expensive and prone to overfitting. This work studies the problem of learning policy representations through heterogeneous pre-training on robot data across different embodiments and tasks at scale. We propose Heterogeneous Pre-trained Transformers (HPT), which pre-train a large, shareable trunk of a policy neural network to learn a task and embodiment agnostic shared representation. This general architecture aligns the specific proprioception and vision inputs from distinct embodiments to a short sequence of tokens and then processes such tokens to map to control robots for different tasks. Leveraging the recent large-scale multi-embodiment real-world robotic datasets as well as simulation, deployed robots, and human video datasets, we investigate pre-training policies across heterogeneity. We conduct experiments to investigate the scaling behaviors of training objectives, to the extent of 52 datasets. HPTs outperform several baselines and enhance the fine-tuned policy performance by over 20% on unseen tasks in multiple simulator benchmarks and real-world settings. See the project website (this https URL) for code and videos.
Link: https://arxiv.org/abs/2409.20537
[00:00] Introduction to Hover: Neural Whole Body Controller for Humanoids
[00:15] Problem: Current controllers lack versatility across tasks
[00:50] Human motion imitation as a unified control approach
[01:23] Policy distillation: Learning from an oracle policy
[02:01] Command space: Kinematic, joint angle, and root tracking modes
[02:34] Motion retargeting: From human data to robot movements
[03:09] Performance comparison with specialist policies
[03:43] Real-world testing on Unitree H1 robot
[04:15] Comparison with MHC and Masked Mimic approaches
[04:49] Future work and current limitations
[05:18] Reward function design and components
[06:02] D-Agger advantages in policy learning
[06:33] Domain randomization for sim-to-real transfer
[07:06] Conclusions on Hover's contributions
Authors: Tairan He, Wenli Xiao, Toru Lin, Zhengyi Luo, Zhenjia Xu, Zhenyu Jiang, Jan Kautz, Changliu Liu, Guanya Shi, Xiaolong Wang, Linxi Fan, Yuke Zhu
Affiliations: NVIDIA, CMU, UC Berkeley, UT Austin, UC San Diego
Abstract: Humanoid whole-body control requires adapting to diverse tasks such as navigation, loco-manipulation, and tabletop manipulation, each demanding a different mode of control. For example, navigation relies on root velocity tracking, while tabletop manipulation prioritizes upper-body joint angle tracking. Existing approaches typically train individual policies tailored to a specific command space, limiting their transferability across modes. We present the key insight that full-body kinematic motion imitation can serve as a common abstraction for all these tasks and provide general-purpose motor skills for learning multiple modes of whole-body control. Building on this, we propose HOVER (Humanoid Versatile Controller), a multi-mode policy distillation framework that consolidates diverse control modes into a unified policy. HOVER enables seamless transitions between control modes while preserving the distinct advantages of each, offering a robust and scalable solution for humanoid control across a wide range of modes. By eliminating the need for policy retraining for each control mode, our approach improves efficiency and flexibility for future humanoid applications.
Link: https://hover-versatile-humanoid.github.io/
[00:00] Intro
[00:24] Tackles RL challenges using a visual backbone, efficient RL, and human feedback.
[01:20] Pretrained backbone boosts stability and exploration efficiency.
[02:06] RLPD combines offline data and human corrections effectively.
[02:57] Human-guided interventions reduce errors, enabling gradual autonomy.
[03:42] System choices aid spatial generalization and safe exploration.
[04:40] RL outperforms imitation learning in success and speed.
[05:29] Funnel model shows reliable, focused policy improvement.
[06:07] Learns both reactive and predictive tasks, enhancing flexibility.
[06:57] HIL-SERL excels over baselines in integrating human data.
[07:27] Outperforms diffusion policy on reactive tasks.
[08:04] Future work: longer tasks, pretraining, unstructured testing.
[08:57] Key takeaway: human-in-the-loop RL enables adaptable, efficient robotic policies.
Authors: Jianlan Luo, Charles Xu, Jeffrey Wu, Sergey Levine
Affiliations: UC Berkeley
Abstract: Reinforcement learning (RL) holds great promise for enabling autonomous acquisition of complex robotic manipulation skills, but realizing this potential in real-world settings has been challenging. We present a human-in-the-loop vision-based RL system that demonstrates impressive performance on a diverse set of dexterous manipulation tasks, including dynamic manipulation, precision assembly, and dual-arm coordination. Our approach integrates demonstrations and human corrections, efficient RL algorithms, and other system-level design choices to learn policies that achieve near-perfect success rates and fast cycle times within just 1 to 2.5 hours of training. We show that our method significantly outperforms imitation learning baselines and prior RL approaches, with an average 2x improvement in success rate and 1.8x faster execution. Through extensive experiments and analysis, we provide insights into the effectiveness of our approach, demonstrating how it learns robust, adaptive policies for both reactive and predictive control strategies. Our results suggest that RL can indeed learn a wide range of complex vision-based manipulation policies directly in the real world within practical training times. We hope this work will inspire a new generation of learned robotic manipulation techniques, benefiting both industrial applications and research advancements. Videos and code are available at our project website this https URL.
Link: https://hil-serl.github.io/
[00:00] Paper intro: Zero-shot robotic manipulation via local policies
[00:26] Key challenges: Limited generalization and sim-to-real transfer
[01:03] Local policies: Task decomposition through localized focus regions
[01:38] Foundation models: VLMs for task understanding
[02:07] Training approach: Simulation-based RL + visuomotor policy distillation
[02:46] Implementation: Depth maps and impedance control system
[03:25] Results: 97% simulation success, 76% real-world success
[04:02] Challenges: Vision errors and collision handling
[04:32] Limitations: Issues with reflective objects and complex contacts
[05:48] Impact: Advancing autonomous robotic manipulation
[06:36] Design: Modular system for continuous improvement
[07:21] Dependencies: VLM and motion planner requirements
Authors: Murtaza Dalal, Min Liu, Walter Talbott, Chen Chen, Deepak Pathak, Jian Zhang, Ruslan Salakhutdinov
Affiliations: Carnegie Mellon University, Apple
Abstract: Sim2real for robotic manipulation is difficult due to the challenges of simulating complex contacts and generating realistic task distributions. To tackle the latter problem, we introduce ManipGen, which leverages a new class of policies for sim2real transfer: local policies. Locality enables a variety of appealing properties including invariances to absolute robot and object pose, skill ordering, and global scene configuration. We combine these policies with foundation models for vision, language and motion planning and demonstrate SOTA zero-shot performance of our method to Robosuite benchmark tasks in simulation (97%). We transfer our local policies from simulation to reality and observe they can solve unseen long-horizon manipulation tasks with up to 8 stages with significant pose, object and scene configuration variation. ManipGen outperforms SOTA approaches such as SayCan, OpenVLA, LLMTrajGen and VoxPoser across 50 real-world manipulation tasks by 36%, 76%, 62% and 60% respectively. Video results at this https URL
Link: https://mihdalal.github.io/manipgen/
[00:00] Introduction to Mentor system for visual RL
[00:29] Problem: Sample inefficiency in robotic learning
[00:59] Innovation: Mixture of Experts (MoE) architecture
[01:55] Results: MoE achieves 100% success in multi-task testing
[02:33] Feature: Task-oriented perturbation for exploration
[03:55] Real-world testing: 83% success in robotic tasks
[04:33] Study: MoE and perturbation each boost performance by 30%
[05:14] Future work: Optimizing MoE implementation
[05:59] Challenge: Bridging simulation-to-real-world gap
[06:45] Impact: Advancing practical robotics applications
Authors: Suning Huang, Zheyu Zhang, Tianhai Liang, Yihan Xu, Zhehao Kou, Chenhao Lu, Guowei Xu, Zhengrong Xue, Huazhe Xu
Affiliations: Tsinghua University, Shanghai Qi Zhi Institute, Shanghai AI Lab
Abstract: Visual deep reinforcement learning (RL) enables robots to acquire skills from visual input for unstructured tasks. However, current algorithms suffer from low sample efficiency, limiting their practical applicability. In this work, we present MENTOR, a method that improves both the architecture and optimization of RL agents. Specifically, MENTOR replaces the standard multi-layer perceptron (MLP) with a mixture-of-experts (MoE) backbone, enhancing the agent's ability to handle complex tasks by leveraging modular expert learning to avoid gradient conflicts. Furthermore, MENTOR introduces a task-oriented perturbation mechanism, which heuristically samples perturbation candidates containing task-relevant information, leading to more targeted and effective optimization. MENTOR outperforms state-of-the-art methods across three simulation domains -- DeepMind Control Suite, Meta-World, and Adroit. Additionally, MENTOR achieves an average of 83% success rate on three challenging real-world robotic manipulation tasks including peg insertion, cable routing, and tabletop golf, which significantly surpasses the success rate of 32% from the current strongest model-free visual RL algorithm. These results underscore the importance of sample efficiency in advancing visual RL for real-world robotics. Experimental videos are available at this https URL.
Link: https://arxiv.org/abs/2410.14972
[00:00] SkillGen: AI system for robotic learning and automation.
[00:19] Core Innovation: Automated dataset generation from minimal human input.
[01:12] Skill Segmentation: Smart system for breaking down and adapting complex tasks.
[01:59] Hybrid Skill Policy: Framework for controlling robot actions and task completion.
[02:50] Performance Results: 75.4% success rate, generating 24,000+ demonstrations.
[04:18] Real-World Testing: 35% success in direct simulation-to-reality transfer.
[05:01] Current Limitations: Preset sequences and object tracking requirements.
[06:47] HSP Variants: Different approaches to robot control and motion planning.
[07:43] Practical Applications: Successful implementation in pick-and-place tasks.
Authors: Caelan Garrett, Ajay Mandlekar, Bowen Wen, Dieter Fox
Affiliation: NVIDIA
Abstract: Imitation learning from human demonstrations is an effective paradigm for robot manipulation, but acquiring large datasets is costly and resource-intensive, especially for long-horizon tasks. To address this issue, we propose SkillMimicGen (SkillGen), an automated system for generating demonstration datasets from a few human demos. SkillGen segments human demos into manipulation skills, adapts these skills to new contexts, and stitches them together through free-space transit and transfer motion. We also propose a Hybrid Skill Policy (HSP) framework for learning skill initiation, control, and termination components from SkillGen datasets, enabling skills to be sequenced using motion planning at test-time. We demonstrate that SkillGen greatly improves data generation and policy learning performance over a state-of-the-art data generation framework, resulting in the capability to produce data for large scene variations, including clutter, and agents that are on average 24% more successful. We demonstrate the efficacy of SkillGen by generating over 24K demonstrations across 18 task variants in simulation from just 60 human demonstrations, and training proficient, often near-perfect, HSP agents. Finally, we apply SkillGen to 3 real-world manipulation tasks and also demonstrate zero-shot sim-to-real transfer on a long-horizon assembly task. Videos, and more at this https URL.
Link: https://arxiv.org/abs/2410.18907
[00:00] Intro to LVSM: Novel transformer for view synthesis
[00:14] Problems with existing 3D synthesis methods
[00:59] LVSM architecture: encoder-decoder vs decoder-only
[01:41] Performance trade-offs between architectures
[02:13] Using Pluecker rays for implicit 3D geometry
[02:49] Zero-shot capabilities with varying input views
[03:23] Training stability and technical solutions
[03:59] Training & evaluation datasets
[04:23] Insights from architecture ablation studies
[05:00] Achieving SOTA with limited GPU resources
[05:25] Future work and research directions
[06:05] Parallels with language models
[06:38] Limitations in aspect ratio handling
Authors: Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, Zexiang Xu
Affiliations: Cornell University, The University of Texas at Austin, Adobe Research, Massachusetts Institute of Technology
Abstract: We propose the Large View Synthesis Model (LVSM), a novel transformer-based approach for scalable and generalizable novel view synthesis from sparse-view inputs. We introduce two architectures: (1) an encoder-decoder LVSM, which encodes input image tokens into a fixed number of 1D latent tokens, functioning as a fully learned scene representation, and decodes novel-view images from them; and (2) a decoder-only LVSM, which directly maps input images to novel-view outputs, completely eliminating intermediate scene representations. Both models bypass the 3D inductive biases used in previous methods -- from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps) -- addressing novel view synthesis with a fully data-driven approach. While the encoder-decoder model offers faster inference due to its independent latent representation, the decoder-only LVSM achieves superior quality, scalability, and zero-shot generalization, outperforming previous state-of-the-art methods by 1.5 to 3.5 dB PSNR. Comprehensive evaluations across multiple datasets demonstrate that both LVSM variants achieve state-of-the-art novel view synthesis quality. Notably, our models surpass all previous methods even with reduced computational resources (1-2 GPUs). Please see our website for more details: this https URL .
Link: https://arxiv.org/abs/2410.17242
[00:00] Introduction to 3D Gaussian tracking for robotic manipulation
[00:26] Limitations of current video prediction methods
[01:11] Advantages of 3D Gaussian representation
[02:04] Graph Neural Networks for modeling object dynamics
[02:54] Control particle implementation and computation reduction
[03:42] Physics-based optimization for prediction stability
[04:25] Integration with real-world robotic systems
[05:12] Performance testing across different materials
[05:58] Advantages over traditional physics-based methods
[09:16] Implementation of object detection systems
[10:02] Data collection and synchronization challenges
[14:39] Long-term prediction capabilities and limitations
Authors: Mingtong Zhang, Kaifeng Zhang, Yunzhu Li
Affiliations: University of Illinois Urbana-Champaign, Columbia University
Abstract: Videos of robots interacting with objects encode rich information about the objects' dynamics. However, existing video prediction approaches typically do not explicitly account for the 3D information from videos, such as robot actions and objects' 3D states, limiting their use in real-world robotic applications. In this work, we introduce a framework to learn object dynamics directly from multi-view RGB videos by explicitly considering the robot's action trajectories and their effects on scene dynamics. We utilize the 3D Gaussian representation of 3D Gaussian Splatting (3DGS) to train a particle-based dynamics model using Graph Neural Networks. This model operates on sparse control particles downsampled from the densely tracked 3D Gaussian reconstructions. By learning the neural dynamics model on offline robot interaction data, our method can predict object motions under varying initial configurations and unseen robot actions. The 3D transformations of Gaussians can be interpolated from the motions of control particles, enabling the rendering of predicted future object states and achieving action-conditioned video prediction. The dynamics model can also be applied to model-based planning frameworks for object manipulation tasks. We conduct experiments on various kinds of deformable materials, including ropes, clothes, and stuffed animals, demonstrating our framework's ability to model complex shapes and dynamics. Our project page is available at this https URL.
Link: https://arxiv.org/abs/2410.18912
[00:00] Introduction
[00:20] Core limitations in robot manipulation: challenges with RL and IL
[01:08] SPIRE's hybrid approach: combining task planning with learning methods
[01:44] TAMP-gated learning: selective application of learned policies
[02:20] Training innovations: warm-starting RL and KL-divergence implementation
[02:59] Results: 35-50% performance gain, 6x more data efficient
[04:04] Multi-worker framework: improved sampling and distribution
[05:11] Future directions: expanding beyond rigid objects
[05:59] Curriculum learning: sequential training strategies
[07:11] Safety improvements: demonstrated through coffee task example
Authors: Zihan Zhou, Animesh Garg, Dieter Fox, Caelan Garrett, Ajay Mandlekar
Affiliations: NVIDIA, University of Toronto, Vector Institute, Georgia Institute of Technology
Abstract: Robot learning has proven to be a general and effective technique for programming manipulators. Imitation learning is able to teach robots solely from human demonstrations but is bottlenecked by the capabilities of the demonstrations. Reinforcement learning uses exploration to discover better behaviors; however, the space of possible improvements can be too large to start from scratch. And for both techniques, the learning difficulty increases proportional to the length of the manipulation task. Accounting for this, we propose SPIRE, a system that first uses Task and Motion Planning (TAMP) to decompose tasks into smaller learning subproblems and second combines imitation and reinforcement learning to maximize their strengths. We develop novel strategies to train learning agents when deployed in the context of a planning system. We evaluate SPIRE on a suite of long-horizon and contact-rich robot manipulation problems. We find that SPIRE outperforms prior approaches that integrate imitation learning, reinforcement learning, and planning by 35% to 50% in average task performance, is 6 times more data efficient in the number of human demonstrations needed to train proficient agents, and learns to complete tasks nearly twice as efficiently. View this https URL for more details.
Link: https://arxiv.org/abs/2410.18065
[00:00] VILA-U: A unified visual AI model
[00:29] Problem: Inefficiency of separate visual modules
[01:11] Vision tower: Novel quantization approach
[02:09] Training strategy: CLIP-based staged learning
[03:03] RVQ technique: Enhanced visual representation
[03:47] Multi-modal training: Text-image-video fusion
[04:35] Performance: Results and current limitations
[05:23] Impact: Contrastive loss effectiveness
[06:03] Generation: Optimal guidance settings
[06:37] Capabilities: Video, Q&A, and image reasoning
[07:14] Applications: Future use cases and scaling
[08:00] Architecture: LLaMA 2 7B integration
[08:48] Data: Quality vs quantity considerations
[09:35] Impact: Unified framework achievements
Authors: Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu
Affiliations: Tsinghua University, MIT, NVIDIA, UC Berkeley, UC San Diego
Abstract: VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework.
Link: https://arxiv.org/abs/2409.04429
[00:00] Intro
[00:31] Challenge: Limited multi-humanoid training data
[00:55] CooHOI's two-phase learning framework
[01:49] Object dynamics as implicit agent communication
[02:25] Bounding box strategy for long objects
[03:07] Results: Superior performance vs baselines
[03:42] Ablation study findings: Key system components
[04:18] Limitation: Basic hand manipulation only
[04:57] Impact: New approach to robot cooperation
Authors: Jiawei Gao, Ziqin Wang, Zeqi Xiao, Jingbo Wang, Tai Wang, Jinkun Cao, Xiaolin Hu, Si Liu, Jifeng Dai, Jiangmiao Pang
Affiliations: Shanghai AI Laboratory, Tsinghua University, Beihang University, Nanyang Technological University, Carnegie Mellon University
Abstract: Recent years have seen significant advancements in humanoid control, largely due to the availability of large-scale motion capture data and the application of reinforcement learning methodologies. However, many real-world tasks, such as moving large and heavy furniture, require multi-character collaboration. Given the scarcity of data on multi-character collaboration and the efficiency challenges associated with multi-agent learning, these tasks cannot be straightforwardly addressed using training paradigms designed for single-agent scenarios. In this paper, we introduce Cooperative Human-Object Interaction (CooHOI), a novel framework that addresses multi-character objects transporting through a two-phase learning paradigm: individual skill acquisition and subsequent transfer. Initially, a single agent learns to perform tasks using the Adversarial Motion Priors (AMP) framework. Following this, the agent learns to collaborate with others by considering the shared dynamics of the manipulated object during parallel training using Multi Agent Proximal Policy Optimization (MAPPO). When one agent interacts with the object, resulting in specific object dynamics changes, the other agents learn to respond appropriately, thereby achieving implicit communication and coordination between teammates. Unlike previous approaches that relied on tracking-based methods for multi-character HOI, CooHOI is inherently efficient, does not depend on motion capture data of multi-character interactions, and can be seamlessly extended to include more participants and a wide range of object types.
Link: https://arxiv.org/abs/2406.14558v2
[00:00] Introduction to SynFlowNet
[00:29] Problem: AI-generated molecules often can't be synthesized
[01:17] Solution: SynFlowNet - uses real chemical reactions
[02:03] GFlowNets: Enables diverse molecule generation
[02:47] Scalability: Morgan fingerprints handle 200K+ compounds
[03:14] Challenge: Solving backward trajectory issues
[04:14] Results: Better synthesis rates and molecular diversity
[05:30] Scale test: Successfully handled 221K molecules
[06:06] Application: Integration with fragment screening
[06:38] Wrap-up: SynFlowNet advances drug design
Authors: Miruna Cretu, Charles Harris, Ilia Igashov, Arne Schneuing, Marwin Segler, Bruno Correia, Julien Roy, Emmanuel Bengio, Pietro Liò
Affiliations: University of Cambridge, EPFL, Microsoft Research, Valence Labs
Abstract: Generative models see increasing use in computer-aided drug design. However, while performing well at capturing distributions of molecular motifs, they often produce synthetically inaccessible molecules. To address this, we introduce SynFlowNet, a GFlowNet model whose action space uses chemical reactions and buyable reactants to sequentially build new molecules. By incorporating forward synthesis as an explicit constraint of the generative mechanism, we aim at bridging the gap between in silico molecular generation and real world synthesis capabilities. We evaluate our approach using synthetic accessibility scores and an independent retrosynthesis tool to assess the synthesizability of our compounds, and motivate the choice of GFlowNets through considerable improvement in sample diversity compared to baselines. Additionally, we identify challenges with reaction encodings that can complicate traversal of the MDP in the backward direction. To address this, we introduce various strategies for learning the GFlowNet backward policy and thus demonstrate how additional constraints can be integrated into the GFlowNet MDP framework. This approach enables our model to successfully identify synthesis pathways for previously unseen molecules.
Link: https://arxiv.org/abs/2405.01155v2
[00:00] Intro to L3DG for 3D modeling
[00:32] Solving room-sized 3D scene complexity
[01:36] VQ-VAE compresses 3D Gaussian representation
[02:41] Generative sparse transpose convolution
[03:20] Latent diffusion for scene generation
[04:30] Visual improvements over baselines
[05:14] Scalability challenges for room-sized scenes
[06:13] Spherical harmonics for view dependence
[06:58] RGB and perceptual loss in training
[07:59] L1 and SSIM for 3D Gaussian optimization
[08:55] Training pipeline overview
[09:59] Densification in 3D Gaussian optimization
[10:49] Hyperparameter selection impact
[11:45] Future research directions
[12:42] Implementation optimization potential
[13:29] Comparison with GANs and diffusion methods
[14:27] Sparse grid representation trade-offs
[15:26] Evaluation datasets
[16:20] Chamfer distance for geometric analysis
[17:13] Applications
Authors: Barbara Roessle, Norman Müller, Lorenzo Porzi, Samuel Rota Bulò, Peter Kontschieder, Angela Dai, Matthias Nießner
Affiliations: Technical University of Munich, Meta Reality Labs Zurich
Abstract: We propose L3DG, the first approach for generative 3D modeling of 3D Gaussians through a latent 3D Gaussian diffusion formulation. This enables effective generative 3D modeling, scaling to generation of entire room-scale scenes which can be very efficiently rendered. To enable effective synthesis of 3D Gaussians, we propose a latent diffusion formulation, operating in a compressed latent space of 3D Gaussians. This compressed latent space is learned by a vector-quantized variational autoencoder (VQ-VAE), for which we employ a sparse convolutional architecture to efficiently operate on room-scale scenes. This way, the complexity of the costly generation process via diffusion is substantially reduced, allowing higher detail on object-level generation, as well as scalability to large scenes. By leveraging the 3D Gaussian representation, the generated scenes can be rendered from arbitrary viewpoints in real-time. We demonstrate that our approach significantly improves visual quality over prior work on unconditional object-level radiance field synthesis and showcase its applicability to room-scale scene generation.
Link: https://arxiv.org/abs/2410.13530
[00:00] Intro
[00:33] Combining transformers & diffusion models
[01:12] Key design: Scalable attention blocks (AdaLN)
[02:30] Efficient observation tokenization
[03:45] DiT Block policy architecture overview
[04:20] BiPlay dataset introduction
[04:53] Performance improvements over baselines
[05:30] Key findings from ablations
[06:08] Generalization to different robot types
[06:43] Simulation vs real-world performance
[07:13] Takeaways and future research directions
Authors: Sudeep Dasari, Oier Mees, Sebastian Zhao, Mohan Kumar Srirama, Sergey Levine
Affiliations: Carnegie Mellon University, University of California, Berkeley.
Abstract: In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately, combining these two orthogonal improvements has proven surprisingly difficult, since there is no clear and well-understood process for making important design choices. In this paper, we identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies. The resulting models can efficiently solve diverse tasks on multiple robot embodiments, without the excruciating pain of per-setup hyper-parameter tuning. By combining the results of our investigation with our improved model components, we are able to present a novel architecture, named \method, that significantly outperforms the state of the art in solving long-horizon (1500+ time-steps) dexterous tasks on a bi-manual ALOHA robot. In addition, we find that our policies show improved scaling performance when trained on 10 hours of highly multi-modal, language annotated ALOHA demonstration data. We hope this work will open the door for future robot learning techniques that leverage the efficiency of generative diffusion modeling with the scalability of large scale transformer architectures. Code, robot dataset, and videos are available at: this https URL
[00:00] Introduction to EgoAllo system
[00:38] Challenges in egocentric motion estimation
[01:20] Importance of spatial/temporal invariance
[02:11] Comparison of conditioning parameterizations
[02:57] Integration of hand observations
[03:50] Global alignment phase
[04:28] Guidance losses in sampling
[05:03] Handling longer sequences
[05:35] Evaluation results
[06:30] System limitations and future work
[07:13] Implications for other egocentric tasks
[08:05] Advantages of diffusion models
[09:07] Use of synthetic datasets
[09:53] Promising research directions
[10:43] Impact on future motion capture systems
[11:41] Comparison to traditional methods
[12:31] Improved hand estimation accuracy
[13:25] SLAM data inaccuracies impact
[14:09] Levenberg-Marquardt optimizer usage
[15:14] Adapting to complex environments
Authors: Brent Yi, Vickie Ye, Maya Zheng, Lea Müller, Georgios Pavlakos, Yi Ma, Jitendra Malik, Angjoo Kanazawa
Affiliation: UC Berkeley, UT Austin
Abstract: We present EgoAllo, a system for human motion estimation from a head-mounted device. Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters that capture the wearer's actions in the allocentric coordinate frame of the scene. To achieve this, our key insight is in representation: we propose spatial and temporal invariance criteria for improving model performance, from which we derive a head motion conditioning parameterization that improves estimation by up to 18%. We also show how the bodies estimated by our system can improve the hands: the resulting kinematic and temporal constraints result in over 40% lower hand estimation errors compared to noisy monocular estimates.
Project page: https://egoallo.github.io/
[00:00] Intro
[00:28] Limitation of existing unified models
[00:57] Janus's decoupled visual encoding solution
[01:18] Advantages of decoupling
[02:03] Janus architecture
[02:50] Three-stage training
[03:41] Ablation studies
[04:23] Extensions for Janus
[05:10] Performance gains
[05:47] Current limitations
[06:31] Impact of simplicity and extensibility
[07:10] Qualitative results
[08:18] Potential applications
[08:52] Key takeaways
Abstract: In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.
Authors: Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo
Affiliations: DeepSeek-AI, The University of Hong Kong, Peking University
[00:00] Introduction
[00:23] Computational cost of traditional diffusion models
[00:59] Reducing iterations in image generation
[01:06] Shortcut models
[01:39] Training process and self-consistency property
[02:22] Advantages over other methods
[03:05] Results on image generation benchmarks
[03:45] Application to robotic control
[04:15] Limitations and future work
[04:54] Best practices
Authors: Kevin Frans, Danijar Hafner, Sergey Levine, Pieter Abbeel
Affiliations: UC Berkeley
Abstract: Diffusion models and flow-matching models have enabled generating diverse and realistic images by learning to transfer noise to data. However, sampling from these models involves iterative denoising over many neural network passes, making generation slow and expensive. Previous approaches for speeding up sampling require complex training regimes, such as multiple training phases, multiple networks, or fragile scheduling. We introduce shortcut models, a family of generative models that use a single network and training phase to produce high-quality samples in a single or multiple sampling steps. Shortcut models condition the network not only on the current noise level but also on the desired step size, allowing the model to skip ahead in the generation process. Across a wide range of sampling step budgets, shortcut models consistently produce higher quality samples than previous approaches, such as consistency models and reflow. Compared to distillation, shortcut models reduce complexity to a single network and training phase and additionally allow varying step budgets at inference time.