ByteDance’s Infinity Framework Redefines Text-to-Image Synthesis with Record-Breaking Efficiency and Quality

Chinese tech company ByteDance’s research team published a paper on ArXiv on December 5th, 2024, introducing a new framework called Infinity, aimed at improving the efficiency and quality of text-to-image synthesis.

In the field of image generation, creating high-resolution and realistic images has always posed significant challenges, especially in text-to-image synthesis tasks. Traditional methods largely rely on diffusion models and Variational AutoRegressive (VAR) frameworks.

While these models can produce high-quality images, they demand extensive computational resources, making them less practical for real-time applications. Additionally, VAR models often encounter cumulative errors when handling discrete tokens, leading to a loss of image detail and diminished realism.

Infinity addresses these challenges by introducing a bitwise token prediction framework that replaces traditional index-level tokens, achieving finer-grained representations and significantly reducing quantization errors. This framework incorporates an infinite-vocabulary tokenizer and classifier, scaling the vocabulary size theoretically to infinity. This approach unleashes powerful scaling capabilities compared to standard VAR models, enhancing image realism and generation capacity.

The Infinity architecture comprises three main components: a bit-level multi-scale quantization tokenizer that converts image features into binary tokens for reduced computational overhead; a transformer-based auto-regressive model that predicts residuals based on text prompts and prior outputs; and a bitwise self-correction mechanism that introduces random bit flips during training to improve the model’s robustness to errors.

SEE ALSO: ByteDance Denies Collaborating with Nubia to Develop AI Smartphones

Trained on large datasets such as LAION and OpenImages, the framework progressively enhances resolutions from 256×256 to 1024×1024. Evaluation results demonstrate its remarkable performance, with a GenEval benchmark score of 0.73 (compared to 0.62 for SD3-Medium) and an ImageReward benchmark score of 0.96 (compared to 0.87), achieving a 66% win rate.

Infinity also outpaces its peers in efficiency, generating high-resolution 1024×1024 images in 0.8 seconds, which is 2.6 times faster than SD3-Medium. This combination of speed and quality makes Infinity the fastest text-to-image model available, setting new benchmarks in the field. ByteDance plans to release the models and code, promoting further exploration of Infinity for visual generation and unified tokenizer modeling.

Report