Ep. 247 - Part 3 - June 13, 2024

https://is1-ssl.mzstatic.com/image/thumb/Podcasts126/v4/4a/9c/ef/4a9ceff8-5c1a-e15c-62d9-6360c52cd38a/mza_2283181023971434852.jpg/600x600bb.jpg

TechcraftingAI Computer Vision

Brad Edwards

315 episodes

5 days ago

TechcraftingAI Computer Vision brings you summaries of the latest arXiv research daily. Research is read by your virtual host, Sage. The podcast is produced by Brad Edwards, an AI Engineer from Vancouver, BC, and a graduate student of computer science studying AI at the University of York. Thank you to arXiv for use of its open access interoperability.

Technology

RSS

All content for TechcraftingAI Computer Vision is the property of Brad Edwards and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/39305030/39305030-1703089970889-aab16cf4a6955.jpg

Ep. 247 - Part 3 - June 13, 2024

TechcraftingAI Computer Vision

51 minutes 59 seconds

1 year ago

Ep. 247 - Part 3 - June 13, 2024

ArXiv Computer Vision research for Thursday, June 13, 2024.

00:21: LRM-Zero: Training Large Reconstruction Models with Synthesized Data

01:56: Scale-Invariant Monocular Depth Estimation via SSI Depth

03:08: GGHead: Fast and Generalizable 3D Gaussian Heads

04:55: Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset

06:34: Towards Vision-Language Geo-Foundation Model: A Survey

08:11: SimGen: Simulator-conditioned Driving Scene Generation

09:44: Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

11:03: Sagiri: Low Dynamic Range Image Enhancement with Generative Diffusion Prior

12:32: LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living

13:56: WonderWorld: Interactive 3D Scene Generation from a Single Image

15:21: Modeling Ambient Scene Dynamics for Free-view Synthesis

16:29: Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA

17:50: Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

19:39: Real-Time Deepfake Detection in the Real-World

21:17: OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

23:02: Yo'LLaVA: Your Personalized Language and Vision Assistant

24:30: MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

26:26: Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion

28:03: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

29:59: ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

31:24: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

33:16: Towards Evaluating the Robustness of Visual State Space Models

34:57: Data Attribution for Text-to-Image Models by Unlearning Synthesized Images

36:09: CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras

37:37: Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach

40:02: MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

41:40: Explore the Limits of Omni-modal Pretraining at Scale

42:46: Interpreting the Weight Space of Customized Diffusion Models

43:58: Depth Anything V2

45:12: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

46:23: Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

48:11: Rethinking Score Distillation as a Bridge Between Image Distributions

49:44: VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding