Ctrl+Alt+Future

https://is1-ssl.mzstatic.com/image/thumb/Podcasts211/v4/95/fe/9e/95fe9e2a-cee8-e955-84e3-1301efdb1fc8/mza_2178844958555179913.jpg/600x600bb.jpg

Ctrl+Alt+Future

Mp3Pintyo

15 episodes

1 week ago

Feeling overwhelmed by the future? It's time for a hard reset. Welcome to Ctrl+Alt+Future, the podcast that navigates the complex world of AI, innovation, and digital culture. Join your hosts, Jules (the skeptic) and Aris (the visionary), for a weekly deep dive into the tech that shapes our world. Through their respectful debates, they separate the signal from the noise and help you understand tomorrow, today. Tune in and reboot your worldview.

Technology

RSS

All content for Ctrl+Alt+Future is the property of Mp3Pintyo and is served directly from their servers with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.

Technology

Episodes (15/15)

Ctrl+Alt+Future

Qwen3-Next: Free large language model from Alibaba that could revolutionize training costs?

Qwen3-Next is a new large-scale language model (LLM) from Alibaba that has 80 billion parameters but only activates 3 billion during inference through a hybrid attention mechanism and rare Mixture-of-Experts (MoE) design. It offers outstanding efficiency and speed of up to 10 times compared to previous models, while achieving higher accuracy in ultra-long context tasks and outperforming Gemini-2.5-Flash-Thinking model on complex reasoning tests.

Why is Qwen3-Next good and what makes it special?

Accessibility and open source:

Qwen3-Next models are available through Hugging Face, ModelScope, Alibaba Cloud Model Studio, and NVIDIA API Catalog. Its open source nature, released under the Apache 2.0 license, encourages innovation and democratizes access to cutting-edge AI technology.

Cost-effectiveness:

- Qwen3-Next not only shows higher accuracy, but also significant efficiency compared to other models

- It can be trained with less than 10% of the computational cost (9.3% to be exact) compared to the Qwen3-32B model. This reduced training cost has the potential to democratize AI development.

Faster inference:

- Only 3 billion (about 3.7%) of its 80 billion parameters are active during the inference phase. This dramatically reduces the FLOPs/token ratio while maintaining model performance

FLOPs is an abbreviation for Floating Point Operations Per Second, which is a unit of measurement for computer performance. In the case of AI models, FLOPs/token indicates how many computational operations are required to process a single text "token" (word or word fragment).

- For shorter contexts, it provides up to 7x speedup in the prefill (first token output) phase and 4x speedup in the decode (additional tokens output) phase.

Innovative architecture:

- Hybrid attention mechanism, which enables extremely efficient context modeling for ultra-long contexts.

- Rare Mixture-of-Experts (MoE) system: consists of 512 experts, where 10 experts and 1 shared expert are actively used at the same time.

Outstanding performance:

- Outperforms Qwen3-32B-Base in most benchmarks, while using less than 10% of its computational cost

- Very close in performance to Alibaba's flagship 235B parameter model.

- Performs particularly well in handling ultra-long context tasks, up to 256,000 tokens. Furthermore, the context length can be extended to 1 million tokens using the YaRN method.

- Qwen3-Next-80B-A3B-Thinking excels at complex reasoning tasks. It outperforms mid-range Qwen3 variants and even outperforms the closed-source Gemini-2.5-Flash-Thinking in several benchmarks

Multilingual capabilities:

The automatic speech recognition model, Qwen3-ASR-Flash, performs accurate transcription in 11 major languages and several Chinese dialects

Agent capabilities

Excellent for device invocation tasks and agent-based workflows

Links

Qwen3-Next: Towards Ultimate Training & Inference Efficiency: https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-listHugging Face model: https://huggingface.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9dModelscope: https://modelscope.cn/models/Qwen/Qwen3-Next-80B-A3B-ThinkingOpenrouter: https://openrouter.ai/qwenQwen Chat: https://chat.qwen.ai/

1 month ago

46 minutes 11 seconds

Ctrl+Alt+Future

HunyuanImage 2.1 is an open source model that can generate high resolution (2K) images

HunyuanImage 2.1 is an open source text-to-image diffusion model capable of generating ultra-high resolution (2K) images. It stands out with its dual text encoder, two-stage architecture including a refinement model, and PromptEnhancer module for automatic prompt transcription, all contributing to image-to-text consistency and more detailed control.

What does HunyuanImage 2.1 image generation model do?

- High resolution: Generates ultra-high resolution (2K) images with cinematic quality composition

- Supports various aesthetics, from photorealism to anime, comics, and vinyl figures, providing outstanding visual appeal and artistic quality.

- Multilingual prompt support: Natively supports both Chinese and English prompts. The multilingual ByT5 text encoder integrated into the model improves text rendering capabilities and image-to-text integration.

- Advanced semantics and granular control: It can handle ultra-long and complex prompts, up to 1000 tokens. It precisely controls the generation of multiple objects with different descriptions within a single image, including scene details, character poses, and facial expressions.

- Flexible aspect ratios: It supports various aspect ratios such as 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3

HunyuanImage 2.1 stands out from other models with several technological innovations and unique features:

- Two-stage architecture:

1. Basic text-to-image model: This first stage uses two text encoders: a multimodal large-scale language model (MLLM) to improve image-text matching, and a multilingual character-aware encoder to improve text rendering in different languages. This stage includes a single and dual-stream diffusion transformer (DiT) with 17 billion parameters. It uses human feedback-based reinforcement learning (RLHF) to optimize aesthetics and structural coherence.

2. Refiner Model: The second stage introduces a refiner model that further improves image quality and clarity while minimizing artifacts.

- High-compression VAE (Variational Autoencoder): The model uses a highly expressive VAE with a 32x spatial compression ratio, significantly reducing computational costs. This allows it to generate 2K images with the same token length and inference time as other models require for 1K images.

- PromptEnhancer module (text transcription model): This is an innovative module that automatically transcribes user prompts, supplementing them with detailed and descriptive information to improve descriptive accuracy and visual quality

- Extensive training data and captioning: It uses an extensive dataset and structured captions that involve multiple expert models to significantly improve text-to-image matching. It also employs an OCR agent and IP RAG to address the shortcomings of VLM captioners in dense texts and world knowledge descriptions, and a two-way verification strategy to ensure caption accuracy.

- Open source model: HunyuanImage 2.1 is open source, and the inference code and pre-trained weights were released on September 8, 2025

Links

Twitter: https://x.com/TencentHunyuan/status/1965433678261354563

Blog: https://hunyuan.tencent.com/image/en?tabIndex=0

PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt: https://hunyuan-promptenhancer.github.io/

GitHub PromptEnhancer: https://github.com/Hunyuan-PromptEnhancer/PromptEnhancer

PromptEnhancer Paper: https://www.arxiv.org/pdf/2509.04545

Hugging Face HunyuanImage-2.1: https://huggingface.co/tencent/HunyuanImage-2.1

GitHub: https://github.com/Tencent-Hunyuan/HunyuanImage-2.1

Checkpoints: https://github.com/Tencent-Hunyuan/HunyuanImage-2.1/blob/main/ckpts/checkpoints-download.md

Hugging Face demo: https://huggingface.co/spaces/tencent/HunyuanImage-2.1

RunPod: https://runpod.io?ref=2pdhmpu1

Leaderboard-Image: https://github.com/mp3pintyo/Leaderboard-Image

1 month ago

33 minutes 12 seconds

Ctrl+Alt+Future

Google Stitch: user interface (UI) design using artificial intelligence

Google Stitch is an AI-powered tool designed for app developers to generate user interfaces (UI) for mobile and web applications. It can turn ideas into UIs. By default, it uses Google DeepMind’s latest large language model, the Gemini 2.5 Pro model.

What is Google Stitch good for?

- Generate UIs: Easily create UIs using natural language prompts. No coding or design knowledge required.

- Simplify design process: Speed up design iterations and allow you to go from concepts to working UI designs without having to start from scratch. It can create complete app designs in minutes.

- Customization and references: Upload images, wireframes, or files that the AI can use as reference material, giving you more control over the output.

- Export and Code: Export your front-end code directly to Figma. Generates clean, tidy HTML and CSS code. Quickly edit themes and export to Figma in standard mode.

- Versatile: Not just for apps, but also for websites, landing pages, dashboards, and admin panels.

- Business opportunities: Great for rapid prototyping. Web design agencies, freelancers, and app development companies can use it to speed up their workflows, showcase prototypes, or create internal tools.

What’s new?

Google Stitch has received several new updates that make it even better:

- Gemini 2.5 Pro default mode: Stitch now defaults to Gemini 2.5 Pro experimental mode. This mode is almost three times faster than standard mode and provides more creative, easier-to-edit outputs. Users preferred the results of this mode 3x more.

Larger experimental mode quota: In experimental mode, you can use up to 100 generations per month (previously 50). In standard mode, this limit is 350 generations. It is important to note that these limits are subject to change.

- Canvas update: This is a fundamental new feature that allows you to see your entire user flow at once. Great for tracking the state of components and ensuring design consistency across your project.

- Multi-select: This powerful new feature allows you to edit multiple screens at once with a single command. Simply hold down the SHIFT key, click and select the screens you want to edit, then enter a prompt and it will apply your changes to all selected screens. This is perfect for creating consistent versions or updating your entire user flow in seconds.

- Faster workflows: Suggested responses appear in chat, speeding up the process.

- Better designs: Improved quality and consistency of generated UIs.

- Refreshed interface: The entire product has a new, clean UI.

Why use it?

- Completely free: It’s currently completely free. All you need is a Google account to get started.

- Ease of use: No coding or design skills required, just text commands.

- Speed and efficiency: Accelerates the design process, allowing you to iterate quickly and turn concepts into reality in minutes.

- Quality: Generates high-quality, professional-looking UIs that are creative and easy to edit.

- Consistency: Easily ensure design consistency across multiple screens and throughout the user journey with the new Canvas and Multi-select features.

- Business potential: Free access and rapid prototyping capabilities offer businesses a huge opportunity to make money by providing app design services or quickly validating their own projects.

Links

Twitter Stitch by Google: https://x.com/stitchbygoogle

Blog: https://stitch.withgoogle.com/home

Prompt guide: https://discuss.ai.google.dev/t/stitch-prompt-guide/83844

Stitch: https://stitch.withgoogle.com/

1 month ago

33 minutes 16 seconds

Ctrl+Alt+Future

Kimi K2 0905 is the latest update to Moonshot AI's large-scale Mixture-of-Experts language model

Kimi K2 0905 is the latest update to Moonshot AI’s large-scale Mixture-of-Experts (MoE) language model, which is well-suited for complex agent-like tasks. With its advanced coding and reasoning capabilities, and extended context length, it delivers outstanding performance in the field of artificial intelligence.

- Agent-like intelligence: It doesn’t just answer questions, it also performs actions. This includes advanced tool usage, reasoning, and code synthesis. It automatically understands how to use given tools to complete a task without having to write complex workflows.

- Long-context inference: Supports long-context inference of up to 256k tokens, which has been extended from the previous 128k.

- Coding: It has improved agent-like coding, with higher accuracy and better generalization across frameworks. It also offers advanced front-end coding with more aesthetic and functional outputs for web, 3D and related tasks. It performs well on coding benchmarks such as LiveCodeBench and SWE-bench.

- Reasoning and Knowledge: Achieves state-dependent performance in boundary knowledge, mathematics and coding among non-thinking models. It performs well on reasoning benchmarks such as ZebraLogic and GPQA.

- Tool Usage: Performs well on tool usage benchmarks such as Tau2 and AceBench. To strengthen tool invocation capabilities, the model can independently decide when and how to invoke its tools.

Links

Twitter: https://x.com/Kimi_Moonshot/status/1963802687230947698Kimi-K2: https://moonshotai.github.io/Kimi-K2/Hugging Face: https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905Tech report: https://github.com/MoonshotAI/Kimi-K2/blob/main/tech_report.pdfUser Manual: https://platform.moonshot.ai/docs/introduction#text-generation-modelKimi Chat: https://www.kimi.com/Openrouter MoonshotAI: Kimi K2 0905: https://openrouter.ai/moonshotai/kimi-k2-0905Groq: https://groq.com/blog/introducing-kimi-k2-0905-on-groqcloud

2 months ago

29 minutes 11 seconds

Ctrl+Alt+Future

Tencent HunyuanWorld-Voyager: Generating 3D-consistent video from a single photo

Tencent has unveiled its AI-powered tool called HunyuanWorld-Voyager, which can transform a single image into a directional, 3D-consistent video—providing the thrill of exploration without the need for actual 3D modeling. It’s a clever solution: by blending RGB and depth data, it preserves the position of objects from different angles, creating the illusion of spatial consistency.

The model aims to create 3D-consistent point cloud sequences from a single image with user-defined camera movement for world exploration. The framework also includes a data acquisition mechanism that automates the prediction of camera angles and metric depth for videos, allowing for the creation of large amounts of annotated training data. Voyager has demonstrated outstanding performance in scene video generation and 3D world reconstruction, outperforming previous methods in terms of geometric coherence and visual quality.

The results aren't true 3D models, but they achieve a similar effect: The AI tool generates 2D video images that maintain spatial consistency as if the camera were moving in a real 3D space. Each generation results in just 49 frames—roughly two seconds of video—although Tencent says multiple clips can be strung together to create "multiple-minute" sequences. Objects remain in the same relative position as the camera moves around them, and the perspective changes correctly, as would be expected in a real 3D environment. While the output is video with depth maps rather than true 3D models, this information can be transformed into 3D point clouds for reconstruction purposes. The system accepts a single input image and a user-defined camera trajectory. Users can specify camera movements, such as forward, backward, left, right, or pan, via the provided interface. The system combines image and depth data with a memory-efficient "world cache" to produce video sequences that reflect user-defined camera movements.

Voyager is trained to recognize and reproduce patterns of spatial consistency, but with an added geometric feedback loop. As it creates each frame, it converts the output into 3D points, then projects those points back into 2D to reference subsequent frames.

The model comes with significant licensing restrictions. Like Tencent's other Hunyuan models, the license prohibits use in the European Union, the United Kingdom, and South Korea. In addition, commercial deployments exceeding 100 million monthly active users require separate licensing from Tencent.

Links

HunyuanWorld-Voyager: https://3d-models.hunyuan.tencent.com/world/Kutatási anyag: https://3d-models.hunyuan.tencent.com/voyager/voyager_en/assets/HYWorld_Voyager.pdfHugging Face: https://huggingface.co/tencent/HunyuanWorld-VoyagerGitHub: https://github.com/Tencent-Hunyuan/HunyuanWorld-VoyagerRunPod: https://runpod.io?ref=2pdhmpu1Runpod bemutató: https://www.youtube.com/watch?v=WudXnf8Gogc

2 months ago

46 minutes 39 seconds

Ctrl+Alt+Future

GLM-4.5: The Next Generation of Artificial Intelligence That Thinks and Acts

Z.ai introduces its latest flagship models, the GLM-4.5 and GLM-4.5-Air, which take the capabilities of intelligent assistants to a new level. These models uniquely combine deep analytics, master-level coding, and autonomous task execution. Their special feature is their hybrid operation: with a single click, you can switch between the “Analyze” mode, which requires complex, thoughtful problem solving, and the “Instant” mode, which provides lightning-fast, immediate answers. This versatility, combined with market-leading performance, gives developers and users a more efficient and flexible tool than ever before.

In the most important ranking, which summarizes 12 industry tests, the GLM-4.5 took 3rd place among the world's leading models (OpenAI, Anthropic, Google DeepMind), while the smaller but highly efficient GLM-4.5-Air took 6th place. And in terms of autonomous task execution (agent capabilities), GLM-4.5 is the second best on the market.

Capabilities in detail

🧠 Reasoning and problem solving

GLM-4.5 does not shy away from even the most complex logical, mathematical or scientific problems. By turning on the “analyst” mode, the model is able to think deeply about the task and arrive at the correct solution with impressive accuracy.

It achieved outstanding results on such difficult tests as AIME 24 (91.0%) or MATH 500 (98.2%).

Its performance also surpasses the OpenAI o3 model in several areas.

💻 Master-level coding

- GLM-4.5 is the perfect partner for developers, whether it is building a completely new project or detecting errors in an existing code base.

- It outperforms GPT-4.1 and Gemini-2.5-Pro in the SWE-bench Verified test (which measures real-world software development tasks).

- It is capable of creating complex, full-stack web applications from database management to backend deployment.

- It leads the market with a success rate of 90.6% in device calls, which guarantees that it reliably performs the coding tasks entrusted to it.

🤖 Autonomous task execution (Agent capabilities)

- This model is not just a Q&A assistant. It is capable of independently performing complex tasks: browsing the Internet, collecting data, and even creating presentations or spectacular posters from the information it finds.

- Its huge, 128,000-token context window allows it to handle large amounts of information at once.

- It outperforms Claude-4-Opus in web browsing tests.

Under the hood: Performance and architecture

The secret to GLM-4.5's impressive performance is its modern Mixture-of-Experts (MoE) architecture. This technology allows the model to activate only the relevant "expert" parts depending on the type of task, thus using the computational capacity extremely efficiently. Thanks to this, GLM-4.5 delivers outstanding performance for its size and is much more parameter-efficient than many of its competitors.

Open source

Both GLM-4.5 and GLM-4.5-Air are open source. They are freely available to anyone, even for commercial purposes, under the MIT license. The models are available on the Z.ai platform, via API, and can be downloaded from HuggingFace and ModelScope.

Multilingualism, Translation, and Security

The model has been trained on a large number of multilingual documents, so it performs well not only in English, but also in Chinese and many other languages. It is particularly strong in understanding cultural references and Internet slang, so its translation capabilities often outperform even targeted translation programs.

Links

GLM-4.5: https://z.ai/blog/glm-4.5GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models: https://arxiv.org/pdf/2508.06471GitHub: https://github.com/zai-org/GLM-4.5Hugging Face: https://huggingface.co/collections/zai-org/glm-45-687c621d34bda8c9e4bf503bOpenRouter: https://openrouter.ai/z-aiChat Z.ai: https://chat.z.ai/

2 months ago

35 minutes 20 seconds

Ctrl+Alt+Future

Gemini 2.5 Flash Image: Advanced AI Generation and Editing

Gemini 2.5 Flash Image, also known as Nano Banana, is an advanced, multimodal image creation and editing model that can interpret both text and image commands, allowing users to create, edit, and iterate on images in a conversational manner. Its main strengths include maintaining character consistency across scenes, creatively combining multiple images, and fine-tuning details such as backgrounds or objects using natural language commands. The model excels at creating photorealistic images, stylized illustrations, product photos, and even logos with readable text.

Key Capabilities and Uses

Gemini 2.5 Flash Image is a versatile tool that excels in the following key areas:

1. Image creation and editing using natural language:

- Conversational editing: The model allows for an ongoing dialogue with the user, who can refine the image step by step until it is perfect. For example, you can request that a car be changed in color and then converted into a convertible in a subsequent step.

- Detailed Control: You can use simple text commands to modify the details of the image, such as changing the background, replacing an object, correcting a caption, or even changing the time of day.

- Character Consistency: The model can consistently portray the same character in different situations, poses, outfits, or even decades. You can depict the same person as a teacher, a sculptor, or a baker.

2. Creative and Complex Image Manipulation

- Combining Multiple Images (Composition): You can upload up to three images to combine their elements into a new image. For example, you can combine a portrait of a woman and a photo of a dress to create an image where the woman is wearing the dress

- Style and Texture Transfer: You can transfer the style, color scheme, or texture of one image to another while maintaining the form of the original subject. For example, you can recreate a city photo in the style of Vincent van Gogh's "Starry Night"

- Pushing creative boundaries: The model allows you to experiment with different design trends. You can build a visual design from a blueprint, or you can decorate a room in a completely new style based on color samples

3. Professional and specific use cases:

- Accurate text rendering: The model (thanks to Imagen 4 technology) is outstanding at creating readable and aesthetic text within images, such as logos or posters.

- Photorealistic scenes and product photos: Create professional-quality, realistic images with detailed descriptions that include photography terms (e.g. camera angle, lens type, lighting).

- Visual storytelling: With a single prompt, you can generate multiple interconnected images that tell a complete story, such as a comic book or a cinematic sequence.

Why use Gemini 2.5 Flash Image?

The model has several advantages:

- User-friendly and intuitive: No image editing skills required; natural language, conversation-based guidance allows anyone to create complex image content.

- Flexibility and iteration: Conversation-based refinement eliminates the need to start the process over every time you want to change a small detail.

- Excellent quality and performance: The model represents state-of-the-art technology and is ranked at the forefront of both text-to-image and image editing categories according to user reviews (e.g. LMArena).

- Responsible operation: Each generated image contains an invisible digital watermark (SynthID) that identifies that the image was created by artificial intelligence. In addition, strict content filtering procedures are used to minimize harmful content.

Links

Gemini 2.5 Flash Image: https://deepmind.google/models/gemini/image/

Gemini: https://gemini.google.com/

Google AI Studio: https://aistudio.google.com/

GitHub Mp3Pintyo képarány fotók: https://github.com/mp3pintyo/NanoBanana

2 months ago

49 minutes 55 seconds

Ctrl+Alt+Future

Qwen-Image image generation model: complex text display and precise image editing

Qwen-Image is a basic image generation model developed by Alibaba's Qwen team. It has two outstanding capabilities: complex text rendering and precise image editing.

Qwen-Image can render text, even long paragraphs, in images with very high quality. It is particularly good at handling English and Chinese, where it is exceptionally accurate. It preserves the typographic details, layout, and contextual harmony of texts.

Precise image editing: The model allows for style transfer, adding or removing objects, refining details, editing text within images, and even manipulating human poses. This capability makes almost professional-level editing accessible to everyday users.

This is a 20 billion-parameter MMDiT (Multimodal Diffusion Transformer) model. Open source under the Apache 2.0 license.

Availability: Natively supported in ComfyUI, but also available via Hugging Face and ModelScope, and can be tried as a demo on Qwen Chat

Performance: Independently evaluated, it shows outstanding results in both image generation and image editing, and is currently one of the best open source models on the market.

The MMDiT (Multimodal Diffusion Transformer) is the central, fundamental element or "backbone" of the Qwen-Image image generation model. (This approach has also proven effective in other models, such as the FLUX and Seedream series.)

Now let's see what this means exactly:

Imagine that the model works like a sculptor who starts from random noise (like a grainy TV broadcast). The essence of the diffusion model is to gradually remove this noise step by step until a clean and recognizable image is created. This is not done directly with the pixels, but with a compressed, abstract form of the images, which we call the (image) latent space. Qwen-Image uses a special tool, the VAE (Variational AutoEncoder), to transform the original images into such encoded, latent representations.

During the diffusion process, MMDiT learns the complex relationships between noisy image codes and clean, desired image codes. It practically learns the "recipe" of how to transform the noise into some specific visual content.

Qwen-Image uses a model called Qwen2.5-VL to extract interpretable "instructions" for MMDiT from text inputs. Thus, the model generates exactly the image we have described.

Qwen-Image has multimodal capabilities. Not only can it generate images from text (Text-to-Image), but it can also edit images based on text instructions (Text-Image-to-Image). It can also perform certain image interpretation tasks, such as object recognition or depth information estimation. This is because MMDiT is designed to process and interpret text and image information simultaneously.

LinksQwen-Image blog: https://qwenlm.github.io/blog/qwen-image/Qwen-Image Technical Report: https://arxiv.org/pdf/2508.02324GitHub: https://github.com/QwenLM/Qwen-ImageHugging Face: https://huggingface.co/Qwen/Qwen-ImageQwen Chat: https://chat.qwen.ai/Hugging Face Demo: https://huggingface.co/spaces/Qwen/Qwen-ImageKépgenerátor Aréna: https://github.com/mp3pintyo/Leaderboard-Image

2 months ago

39 minutes 34 seconds

Ctrl+Alt+Future

OpenAI gpt-oss: OpenAI's latest development in open source AI models

We’d like to introduce OpenAI’s latest development in open source AI models: the gpt-oss series. These two open-weight language models, gpt-oss-120b and gpt-oss-20b, have been tested by OpenAI to deliver impressive performance across logic tasks, agent capabilities, and developer usage. Available under the flexible Apache 2.0 license, the gpt-oss models are OpenAI’s first open-weight language models since GPT-2, and are designed to make AI more widely accessible and drive innovation.

Here’s a summary of why you should check out these models:

- Two versions, for different purposes

- gpt-oss-120b: This larger model has 117 billion parameters and is designed to run on a single 80GB GPU (such as the NVIDIA H100 or AMD MI300X). It is well suited for production environments, general-purpose and high-thinking tasks.

- gpt-oss-20b: This smaller model has 21 billion parameters and requires only 16 GB of memory, making it ideal for low-latency, local or specialized applications, even on consumer hardware.

- Open-source and permissive license:

- The gpt-oss models are released as open-source models. The Apache 2.0 license allows for free experimentation, customization and commercial use, without copyleft restrictions or patent risks.

- Advanced thinking capabilities:

- The models support adjustable thinking effort (low, medium, high), which can be optimized according to the task requirements and latency expectations.

- Full Chain-of-Thought (CoT) access is provided. This allows detailed insight into the thinking process of the model, which helps in debugging and increases confidence in the outputs. It is important to note that the content of the CoT is not guaranteed to be security compliant and should not be shown directly to end users.

- The gpt-oss-120b model outperforms OpenAI o3-mini on most benchmarks and approaches the capabilities of OpenAI o4-mini in areas such as competitive mathematics or health queries.

- Agent capabilities and device usage:

- Models are natively capable of calling functions, browsing the web, executing Python code, and generating structured output.

- They are able to use built-in browser and Python tools to perform their tasks more efficiently.

- Efficiency and hardware support:

- The use of MXFP4 quantization significantly reduces the memory footprint of the models. This allows gpt-oss-120b to run on a single 80 GB GPU and gpt-oss-20b to run on just 16 GB of memory.

- Wide range of runtime environments supported, including Transformers, vLLM, Ollama, LM Studio, PyTorch, Triton, and Apple Metal.

- Fine-tuning:

- Both gpt-oss models can be fully fine-tuned for specific use cases. gpt-oss-20b can even be fine-tuned on consumer hardware.

- Harmony response format:

- The models are trained exclusively using and work properly with the OpenAI Harmony response format. This format defines the structure of conversations, reasoning outputs, and function calls.

- Security focus:

- OpenAI considers security to be of fundamental importance. The models have undergone extensive security training and evaluation, including filtering out harmful data during pre-training, and are resistant to jailbreak attacks.

- Wide availability:

- Weights are freely available for download from Hugging Face.

Links

Introducing gpt-oss: https://openai.com/index/introducing-gpt-oss/Technikai dokumentáció: https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdfGitHub: https://github.com/openai/gpt-ossHugging Face: https://huggingface.co/collections/openai/gpt-oss-68911959590a1634ba11c7a4LM Studio: https://lmstudio.ai/blog/gpt-ossOllama: https://ollama.com/library/gpt-ossgpt-oss playground: https://gpt-oss.com/OpenAI Harmony Response Format: https://cookbook.openai.com/articles/openai-harmony

2 months ago

51 minutes 27 seconds

Ctrl+Alt+Future

Qwen-Image-Edit: Image editing with artificial intelligence. No need for Photoshop anymore?

Today, we will look at an AI model that simplifies image editing: Qwen-Image-Edit. This model builds on the foundation of the original, high-performance Qwen-Image, and brings amazing capabilities in the areas of text rendering and precise image editing.

Qwen-Image-Edit’s capabilities and benefits in brief:

- This model stands out for its ability to precisely edit texts within images, both in a bilingual (Chinese and English) environment. This includes directly adding, deleting, and modifying text while preserving the original text size, font, and style. For example, it can make corrections in calligraphy or modify even the smallest text elements on posters.

- It allows you to modify the content of the image while maintaining the original visual semantics and consistency. This includes creating IP (intellectual property) content (e.g., modifying a mascot to have different personalities), rotating objects (even 90 or 180 degrees to see the back), and style transformation (e.g., transforming a portrait into a Studio Ghibli style).

- Precision Detail Editing: This feature focuses on leaving certain regions of the image completely unchanged while adding, removing, or modifying specific elements. Examples include adding a sign and generating an associated reflection, removing small objects or hair, changing the color of a specific font, or modifying a person's clothing and background.

- Step-by-step editing (chained approach): Qwen-Image-Edit allows users to progressively correct errors in images, such as calligraphy. This means that bounding boxes can be used to mark areas to be corrected and modifications can be made iteratively until the desired result is achieved.

What makes it better than others?

- It not only generates or edits images, but also understands them, making it a comprehensive base model for intelligent visual creation and manipulation, where language, layout and images converge.

- Open source ecosystem. The model is natively supported in ComfyUI and is also available on the HuggingFace and ModelScope platforms, making it widely accessible to developers and users. Optimizations such as low GPU memory requirements, FP8 quantization and acceleration methods further increase its accessibility and efficiency.

Links

Blog: https://qwenlm.github.io/blog/qwen-image-edit/GitHub: https://github.com/QwenLM/Qwen-ImageSystem prompt: https://huggingface.co/spaces/Qwen/Qwen-Image-Edit/blob/main/app.pyHugging Face: https://huggingface.co/Qwen/Qwen-Image-EditHF Demo: https://huggingface.co/spaces/Qwen/Qwen-Image-EditQwen Chat: https://chat.qwen.ai/Qwen-Image-Edit ComfyUI Native Support: https://blog.comfy.org/p/qwen-image-edit-comfyui-supportQwen-Image-Edit ComfyUI Native Workflow Example: https://docs.comfy.org/tutorials/image/qwen/qwen-image-editLenovo UltraReal: https://civitai.com/models/1662740/lenovo-ultrareal?modelVersionId=2106185Realism: https://huggingface.co/flymy-ai/qwen-image-realism-lora

2 months ago

27 minutes

Ctrl+Alt+Future

ByteDance Seed-OSS-36B, a large language model specifically for long context understanding and reasoning

Seed-OSS is a set of open-source large-scale language models developed by ByteDance Seed Team, designed to provide powerful capabilities in long-context understanding, reasoning, and agentic tasks. It stands out with its flexible control of the "thinking budget", robust performance on various benchmarks, and research-friendly approach, making it a versatile tool for developers and researchers alike.

- Specifically designed to provide long-context understanding, reasoning, agentic, and general capabilities.

- Primarily optimized for internationalized (i18n) use cases.

- Users can flexibly adjust the length of reasoning as needed

- Seed-OSS is specifically optimized for reasoning tasks

According to ByteDance, the open-source SOTA (State-Of-The-Art) performs well in various categories, including Knowledge (MMLU-Pro, MMLU, TriviaQA for the base model; MMLU-Pro, MMLU for the Instruct model), Mathematics (GSM8K, MATH for the base model; AIME24, AIME25, BeyondAIME for the Instruct model), Coding (MBPP, HumanEval for the base model; LiveCodeBench v6 for the Instruct model), Instruction Following (IFEval), Agent (TAU1-Retail, SWE-Bench, Multi-SWE-Bench), Multilingualism (MMMLU), and Long Context (RULER)

Links

Seed-OSS Open-Source Models Release: https://seed.bytedance.com/en/blog/seed-oss-open-source-models-release?view_from=blogHugging Face ByteDance-Seed/Seed-OSS-36B-Instruct: https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-InstructGitHub: https://github.com/ByteDance-Seed/seed-ossLM Studio: https://lmstudio.ai/home

2 months ago

39 minutes 42 seconds

Ctrl+Alt+Future

Microsoft VibeVoice is excellent for creating podcasts, even by cloning our own voice

VibeVoice is a novel framework designed to generate expressive, emotional, and lifelike long-form, multi-actor audio, such as podcasts, from text. The model aims to solve the significant challenges of traditional text-to-speech (TTS) systems in terms of scalability, speaker consistency, and natural conversational turns.

The capabilities and special features of the VibeVoice model are as follows:

- Capable of synthesizing conversations with up to four different speakers and generating up to 90 minutes of speech, which exceeds the typical limitations of many previous models.

- Excellent for creating podcasts and similar long-form audio content.

- Allows voice cloning from voice samples. This requires clean, minimal background noise voice samples, at least 3-10 seconds long, but 30 seconds is recommended for better quality.

- Text File Loading: Suitable for loading text scripts from .txt files.

- Flexible configuration: Adjustable with parameters such as temperature, sampling, and guidance scale (cfg_scale).

Two model options:

- VibeVoice-1.5B: Provides faster inference and has a download size of approximately 5 GB, ideal for single speakers and rapid prototyping.

- VibeVoice-7B-Preview: Provides higher quality output, especially for multi-actor conversations, has slower inference and has a download size of approximately 17 GB.

- Technological innovation: One of its fundamental innovations is the use of continuous speech tokenizers (acoustic and semantic) that operate at an extremely low frame rate of 7.5 Hz. These tokenizers achieve a compression ratio of 3200x while maintaining audio fidelity, drastically increasing computational efficiency when processing long sequences.

- LLM-based next-token diffusion framework: The model uses a large-scale language model (LLM, e.g. Qwen2.5) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.

Results and performance: The VIBEVOICE-7B model outperforms most state-of-the-art models in long-discussion speech generation, both subjectively and objectively, showing better realism, richness, and overall preference.

It is important to note that the model works best primarily with English and Chinese text. The VibeVoice model itself is for research purposes and is subject to Microsoft’s license terms.

Links

Microsoft VibeVoice: https://microsoft.github.io/VibeVoice/Technical Report: https://arxiv.org/pdf/2508.19205GitHub: https://github.com/microsoft/VibeVoiceGoogle Colab: https://colab.research.google.com/github/microsoft/VibeVoice/blob/main/demo/VibeVoice_colab.ipynbHugging Face VibeVoice-1.5B: https://huggingface.co/microsoft/VibeVoice-1.5BHugging Face VibeVoice-7B-Large: https://huggingface.co/WestZhang/VibeVoice-Large-ptComfyUI: https://github.com/Enemyx-net/VibeVoice-ComfyUIAudacity: https://www.audacityteam.org/

2 months ago

40 minutes 28 seconds

Ctrl+Alt+Future

Deep Cogito - Cogito v2: Free model. Using a unique, iterative self-learning method (IDA)

According to developer Deep Cogito, Cogito v2 is one of the world’s most powerful open-source AI models, available in sizes ranging from 70B to 671B parameters. Thanks to its unique, iterative self-learning method (IDA), the model solves complex problems by developing its internal “intuition” rather than by searching for longer, shorter and more efficient thoughts.

• Market-leading performance: The company claims that the performance of the largest 671B-parameter MoE (Mixture of Experts) model competes with the latest DeepSeek models and approaches that of closed models such as o3 and Claude 4 Opus. The models have been trained in over 30 languages and are optimized for coding, STEM tasks, instruction following, and tool calling.

• Innovative Training Method (IDA): The company uses a method called Iterated Distillation & Amplification (IDA), which it describes as a scalable and efficient strategy for achieving superintelligence. The essence of this is that the model internalizes the inference process and improves its own parameters through iterative self-improvement, rather than simply searching for the answer. According to Deep Cogito, this helps the models develop better “intuition.”

• Superior efficiency: The company emphasizes that thanks to the IDA method, their models achieve superior results with shorter “reasoning chains.” For example, their 671B model uses 60% shorter reasoning chains than DeepSeek R1. This approach is also significantly more cost-effective; they claim that training all of the Cogito models cost less than $3.5 million.

• Flexible, hybrid operation: According to the company, Cogito v2 models are hybrid models, which means that they can respond immediately (like a standard LLM), or respond after a self-reflective, “thinking” process. This thinking mode can be manually turned on.

• Size selection and local runnability: Deep Cogito has released the models in four different sizes (70B, 109B, 405B, 671B), so users can choose the model that suits their hardware

. The company highlights that with the help of Unsloth, the models can also be run locally, even in quantized (reduced size) form, with minimal loss of accuracy.

• Emergent capabilities: The company mentions as an interesting consequence that although the models were trained only on text data, thanks to the multimodal base model, they can also think on visual content through transfer learning.

LinksCogito V2 Preview: https://www.deepcogito.com/research/cogito-v2-previewHugging Face Cogito v2 preview - 671B MoE: https://huggingface.co/deepcogito/cogito-v2-preview-deepseek-671B-MoEunsloth: https://docs.unsloth.ai/basics/tutorials-how-to-fine-tune-and-run-llms/cogito-v2-how-to-run-locallyOpenRouter: https://openrouter.ai/deepcogito/cogito-v2-preview-deepseek-671b

2 months ago

47 minutes 16 seconds

Ctrl+Alt+Future

Mastering Prompt Tricks with Large Language Models

In this episode, we dive deep into the art of crafting effective prompts for large language models. Join our hosts as they explore essential techniques to optimize outputs, enhance creativity, and improve interaction with AI systems like GPT. They’ll walk you through constructing prompts with the right context, defining tasks, and setting clear expectations. Learn how small adjustments can lead to significant improvements in both the quality and speed of AI responses, and discover practical tips for applying these tricks in real-world scenarios. Whether you're new to prompt engineering or looking to refine your skills, this conversation is packed with insights that will elevate your AI game.

1 year ago

10 minutes 37 seconds

Ctrl+Alt+Future

AI in Enterprise

The rapid development of AI has outpaced the ability of many organisations to adapt1. This discrepancy presents both challenges and opportunities. While there is growing pressure to utilize AI for its potential benefits, such as increased efficiency and competitiveness23, companies must address the accompanying challenges.One major concern is the decentralized use of personal devices and the potential risks to data security and knowledge sharing2. To mitigate this, it's crucial to establish clear data privacy policies and create a centralized knowledge-sharing platform2. Furthermore, organisations should focus on upskilling their workforce and fostering a culture that embraces AI3. Highlighting individuals within the organization who are successfully utilizing AI can encourage wider adoption3.Instead of solely focusing on strategic documents, organizations need to react to the immediate impact of AI and encourage proactive adaptation1. This shift requires flexibility and a recognition that AI implementation is ultimately a human-centric process, driven by individuals who can leverage it to create value and solve organizational problems1. By acknowledging the human element and fostering a culture of learning and adaptation, organizations can navigate the evolving landscape of AI and harness its transformative potential.

1 year ago

4 minutes 39 seconds