On February 12th, the Doubao large-scale model team announced that the ByteDance Doubao large-scale model team proposed a new sparse model architecture UltraMem, which effectively solves the high memory access problem during MoE inference. The inference speed is improved by 2-6 times compared to the MoE architecture, and the highest inference cost can be reduced by up to 83%. The research also revealed the Scaling Law of the new architecture, proving that it not only has excellent scaling characteristics but also surpasses MoE in performance.
Experimental results show that an UltraMem model with a training scale of 20 million values can achieve industry-leading inference speed and model performance under equal computing resources, opening up a new path for building models with several billion scale values or experts.
According to reports, UltraMem is a sparse model architecture that decouples computation and parameters similarly while solving memory access problems during inference without compromising on model effectiveness. Experimental results show that under the same parameter and activation conditions, UltraMem outperforms MoE in terms of model effectiveness and improves inference speed by 2-6 times. In addition, at common batch size scales, UltraMem’s memory access costs are almost equivalent to those of dense models with similar computational loads.
Under the Transformer architecture, the performance of the model is logarithmically related to its parameter count and computational complexity. As the scale of LLM continues to increase, inference costs will sharply rise, slowing down speed.
Although the MoE architecture has successfully decoupled computation and parameters, during inference, a smaller batch size will activate all experts, leading to a sharp increase in memory access and significantly increasing inference latency.
The “MoE” refers to the Mixture of Experts (MoE) architecture, which is a design used to improve model performance and efficiency. In the MoE architecture, the model consists of multiple sub-models (experts), each responsible for processing a portion of input data. During training and inference processes, based on the features of input data, some experts will be selectively activated for computation, thereby achieving decoupling of computation and parameters to enhance model flexibility and efficiency.
SEE ALSO: ByteDance’s Doubao Team Releases Open-Source VideoWorld Model for Video-Based AI Learning
Sign up today for 5 free articles monthly!
GIPHY App Key not set. Please check settings