
HunyuanImage 2.1 is an open source text-to-image diffusion model capable of generating ultra-high resolution (2K) images. It stands out with its dual text encoder, two-stage architecture including a refinement model, and PromptEnhancer module for automatic prompt transcription, all contributing to image-to-text consistency and more detailed control.
What does HunyuanImage 2.1 image generation model do?
- High resolution: Generates ultra-high resolution (2K) images with cinematic quality composition
- Supports various aesthetics, from photorealism to anime, comics, and vinyl figures, providing outstanding visual appeal and artistic quality.
- Multilingual prompt support: Natively supports both Chinese and English prompts. The multilingual ByT5 text encoder integrated into the model improves text rendering capabilities and image-to-text integration.
- Advanced semantics and granular control: It can handle ultra-long and complex prompts, up to 1000 tokens. It precisely controls the generation of multiple objects with different descriptions within a single image, including scene details, character poses, and facial expressions.
- Flexible aspect ratios: It supports various aspect ratios such as 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3
HunyuanImage 2.1 stands out from other models with several technological innovations and unique features:
- Two-stage architecture:
1. Basic text-to-image model: This first stage uses two text encoders: a multimodal large-scale language model (MLLM) to improve image-text matching, and a multilingual character-aware encoder to improve text rendering in different languages. This stage includes a single and dual-stream diffusion transformer (DiT) with 17 billion parameters. It uses human feedback-based reinforcement learning (RLHF) to optimize aesthetics and structural coherence.
2. Refiner Model: The second stage introduces a refiner model that further improves image quality and clarity while minimizing artifacts.
- High-compression VAE (Variational Autoencoder): The model uses a highly expressive VAE with a 32x spatial compression ratio, significantly reducing computational costs. This allows it to generate 2K images with the same token length and inference time as other models require for 1K images.
- PromptEnhancer module (text transcription model): This is an innovative module that automatically transcribes user prompts, supplementing them with detailed and descriptive information to improve descriptive accuracy and visual quality
- Extensive training data and captioning: It uses an extensive dataset and structured captions that involve multiple expert models to significantly improve text-to-image matching. It also employs an OCR agent and IP RAG to address the shortcomings of VLM captioners in dense texts and world knowledge descriptions, and a two-way verification strategy to ensure caption accuracy.
- Open source model: HunyuanImage 2.1 is open source, and the inference code and pre-trained weights were released on September 8, 2025
Links
Twitter: https://x.com/TencentHunyuan/status/1965433678261354563
Blog: https://hunyuan.tencent.com/image/en?tabIndex=0
PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt: https://hunyuan-promptenhancer.github.io/
GitHub PromptEnhancer: https://github.com/Hunyuan-PromptEnhancer/PromptEnhancer
PromptEnhancer Paper: https://www.arxiv.org/pdf/2509.04545
Hugging Face HunyuanImage-2.1: https://huggingface.co/tencent/HunyuanImage-2.1
GitHub: https://github.com/Tencent-Hunyuan/HunyuanImage-2.1
Checkpoints: https://github.com/Tencent-Hunyuan/HunyuanImage-2.1/blob/main/ckpts/checkpoints-download.md
Hugging Face demo: https://huggingface.co/spaces/tencent/HunyuanImage-2.1
RunPod: https://runpod.io?ref=2pdhmpu1
Leaderboard-Image: https://github.com/mp3pintyo/Leaderboard-Image