Ep. 240 - Part 2 - June 6, 2024

https://is1-ssl.mzstatic.com/image/thumb/Podcasts126/v4/4a/9c/ef/4a9ceff8-5c1a-e15c-62d9-6360c52cd38a/mza_2283181023971434852.jpg/600x600bb.jpg

TechcraftingAI Computer Vision

Brad Edwards

315 episodes

2 hours ago

TechcraftingAI Computer Vision brings you summaries of the latest arXiv research daily. Research is read by your virtual host, Sage. The podcast is produced by Brad Edwards, an AI Engineer from Vancouver, BC, and a graduate student of computer science studying AI at the University of York. Thank you to arXiv for use of its open access interoperability.

Technology

RSS

All content for TechcraftingAI Computer Vision is the property of Brad Edwards and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/39305030/39305030-1703089970889-aab16cf4a6955.jpg

Ep. 240 - Part 2 - June 6, 2024

TechcraftingAI Computer Vision

52 minutes 52 seconds

1 year ago

Ep. 240 - Part 2 - June 6, 2024

ArXiv Computer Vision research for Thursday, June 06, 2024.

00:20: M3LEO: A Multi-Modal, Multi-Label Earth Observation Dataset Integrating Interferometric SAR and RGB Data

02:34: Understanding Information Storage and Transfer in Multi-modal Large Language Models

04:27: Conv-INR: Convolutional Implicit Neural Representation for Multimodal Visual Signals

06:01: Localized Gaussian Point Management

07:59: A Survey on 3D Human Avatar Modeling -- From Reconstruction to Generation

09:25: GeoGen: Geometry-Aware Generative Modeling via Signed Distance Functions

11:07: MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

13:02: ELFS: Enhancing Label-Free Coreset Selection via Clustering-based Pseudo-Labeling

14:39: VideoTetris: Towards Compositional Text-to-Video Generation

16:00: SpectralZoom: Efficient Segmentation with an Adaptive Hyperspectral Camera

17:04: Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment

18:51: Neural Surface Reconstruction from Sparse Views Using Epipolar Geometry

20:05: Vision-LSTM: xLSTM as Generic Vision Backbone

21:01: ReFiNe: Recursive Field Networks for Cross-modal Multi-scene Representation

22:03: ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization

23:43: Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step

25:32: Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking

27:23: VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

28:33: DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

30:24: SF-V: Single Forward Video Generation Model

31:51: ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

34:06: Parameter-Inverted Image Pyramid Networks

35:50: Coarse-To-Fine Tensor Trains for Compact Visual Representations

37:23: BitsFusion: 1.99 bits Weight Quantization of Diffusion Model

38:37: DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

40:24: Coherent Zero-Shot Visual Instruction Generation

41:17: Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

42:58: RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation

44:56: GLACE: Global Local Accelerated Coordinate Encoding

46:43: Interpreting the Second-Order Effects of Neurons in CLIP

48:03: Learning 1D Causal Visual Representation with De-focus Attention Networks

49:41: Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image

51:14: Stereo-Depth Fusion through Virtual Pattern Projection