ByteDance’s Doubao Vision Understanding Model is 85% Cheaper than the Industry Average Price

On December 18th, at the Force Conference hosted by the Volcano Engine, ByteDance officially launched the Doubao Vision Understanding Model and announced its price at 1,000 tokens for 0.003 yuan, 85% cheaper than the industry average, which means that for one yuan, it can process 284 images of 720P resolution.

This signifies ByteDance replicating its strategy of significantly lowering prices in the field of large multimodal models, just as it did in the field of general large models. In May, the company officially released the Doubao Large Model, reducing the price by an order of magnitude, triggering a wave of price reductions in large models by companies such as Alibaba and Baidu.

Tan Dai, the President of the Volcano Engine, previously stated that reducing costs is a key factor in driving large models to the “value creation stage.”

Six months later, the performance of the Doubao General Large Model in the market has to some extent validated Tan Dai’s judgment. Data publicly released by ByteDance shows that as of mid-December, the daily token usage of the Doubao General Model has exceeded 40 trillion, a 33-fold increase from when it was first released seven months ago.

The application of large models is accelerating penetration into various industries. According to reports from Chinese media Jiemian, the Doubao Large Model has already collaborated with 80% of mainstream automotive brands and has been integrated into multiple smart terminals such as mobile phones and PCs, covering approximately 300 million terminal devices. The call volume of the Doubao Large Model from smart terminals has increased by 100 times in six months. In the last three months, the call volume of the Doubao Large Model in information processing scenarios has increased by 39 times, customer service and sales scenarios by 16 times, hardware terminal scenarios by 13 times, AI tool scenarios by 9 times, and significant growth has also been seen in learning and education scenarios.

At this conference, Tan Dai once again emphasized the explosive growth of the Doubao Large Model’s market share, attributed to the Volcano Engine’s development philosophy of “stronger models, lower costs, and easier implementation.”

According to ByteDance, the Doubao Vision Understanding Model not only accurately identifies visual content but also possesses outstanding understanding and reasoning capabilities. It can perform complex logical calculations based on image information, analyze charts, process code, answer subject-related questions, and has delicate visual description and creative abilities.

For example, it can instantly recognize the shadow of an animal and infer what animal it is, identify landmarks, unfamiliar items in daily life, and provide popular science information, and also identify objects in images that are fully selected.

Zhou Hao, Head of Doubao Strategic Research, stated that Doubao has always been trying to make user inputs faster and more convenient, focusing heavily on multimodal inputs and refinement, including capabilities in speech, vision, and others, all of which have been made available to enterprise customers through the Volcano Engine.

Released alongside the Doubao Vision Understanding Model is the Doubao 3D Generation Model. This model, when combined with the Volcano Engine’s digital twin platform veOmniverse, is said to efficiently complete intelligent training, data synthesis, and digital asset production, becoming a set of physical world simulation simulators that support AIGC creation.

In the spring of next year, ByteDance will also release the Doubao Video Generation Model 1.5, which will have the ability to generate longer videos, and the Doubao end-to-end real-time speech model will also be launched soon, unlocking new capabilities such as multi-role interpretation and dialect conversion.

Compared to similar products in the market, the release timing of the Doubao series of large models is not particularly early, but they have consistently maintained a fast update pace.

Currently, ByteDance plans to prioritize the product of Jimeng and aims to create a “TikTok” for the AI era through new pathways. This also indicates that ByteDance has higher expectations for the implementation of large models in consumer products.

SEE ALSO: Ant Group: Currently No Plan for An IPO

Report