Jailbreaks, Collaboration, and Cognitive Shifts

https://is1-ssl.mzstatic.com/image/thumb/Podcasts221/v4/23/72/b6/2372b6f2-e946-2b4a-2d13-3f32527305e3/mza_2092205043051898135.jpg/600x600bb.jpg

Today in arXiv AI

Scot Bearss

7 episodes

3 days ago

Today in arXiv AI is your daily deep dive into the cutting edge of artificial intelligence. Every morning, we unpack the latest breakthroughs in LLM architectures, agentic AI, multimodal models, scaling strategies, safety research and more—mixing expert analysis, lively debate, and real‑world use cases. Whether you’re an AI practitioner, tech leader, or just curious about what’s next, we break down complex papers (and what they mean for you) into a fast‑paced, two‑host conversation you’ll actually enjoy. I am an independent creator and not affiliated with arXiv. Sources linked in descriptions

Technology

RSS

All content for Today in arXiv AI is the property of Scot Bearss and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/44125644/44125644-1753362261358-97f877a66347b.jpg

Jailbreaks, Collaboration, and Cognitive Shifts

Today in arXiv AI

1 hour 2 minutes

3 months ago

Jailbreaks, Collaboration, and Cognitive Shifts

Generated by Google NotebookLM.

This episode explores 15 new research papers at the edge of LLM behavior, safety, collaboration, and reasoning:

Beyond passive replies – CollabLLM rethinks how LLMs interact across turns, training them to uncover user intent and proactively collaborate.
Red teaming, automated – RedCoder weaponizes multi-turn attacks against code models, training autonomous agents to probe for unsafe generations.
Synthesis by simulation – CodeEvo builds training data by pairing coder and reviewer agents in feedback loops, automating high-quality instruction-code generation.
Internal deception – Linear probes and SAEs reveal how truthful features flip when models are prompted to lie.
Defense by deflection – SDeflection avoids refusal and instead rewrites malicious prompts into innocuous replies, lowering jailbreak success without hurting helpfulness.
Attack by persona – A genetic algorithm crafts persona prompts that reduce refusal rates and supercharge jailbreaks, especially when stacked with other methods.
Agents with evolving maps – CoEx lets planning agents continually revise their world models, co-adapting structure and strategy over time.
Interfaces for oversight – Magentic-UI powers human-in-the-loop agentic systems with long-term memory, action guards, and collaborative controls.
Measuring long-context reasoning – NeedleChain moves past “needle-in-a-haystack” with tasks that require full semantic integration across long input windows.
Bias as an exploit – CognitiveAttack uncovers how stacking psychological biases in prompts dramatically increases LLM jailbreak success.
Patching with logic – RePaCA guides LLMs to assess bug fixes using chain-of-thought, boosting accuracy and explainability in patch correctness tasks.
Federated fine-tuning at scale – H2Tune handles architectural and task diversity across clients with a novel decomposition and disentanglement scheme.
Multimodal mastery – MoCHA uses sparse MoE connectors and hierarchical attention to align vision with language and reduce hallucinations.
Where demos belong – A detailed analysis of demo position bias finds that demonstration ordering in prompts drastically alters LLM accuracy and stability.

Together, these papers uncover the subtle mechanics that shape LLM trustworthiness, the strategies that make or break jailbreak defenses, and the design patterns emerging in agentic interfaces and federated learning.

Sources:

CollabLLM: arXiv:2406.04425
RedCoder: arXiv:2407.00482
CodeEvo: arXiv:2407.00483
When Truthful Representations Flip Under Deceptive Instructions: arXiv:2407.00495
Strategic Deflection: arXiv:2407.00496
Enhancing Jailbreak Attacks via Persona Prompts: arXiv:2407.00499
CoEx: arXiv:2407.00508
Magentic-UI: arXiv:2407.00510
NeedleChain: arXiv:2407.00518
CognitiveAttack: arXiv:2407.00519
RePaCA: arXiv:2407.00523
H2Tune: arXiv:2407.00529
MoCHA: arXiv:2407.00530
Where to show Demos in Your Prompt: arXiv:2407.00533