Moonshot AI Releases Multimodal Image Understanding Model Moonshot-V1-Vision-Preview

On January 15, Beijing-based startup Moonshot AI, Kimi’s developer, launched its new multimodal image understanding model, Moonshot-V1-Vision-Preview (hereafter referred to as the Vision model). This release strengthens the multimodal capabilities of the Moonshot-V1 series.

The Vision model excels in image recognition, accurately identifying intricate details and subtle differences in visual content. From distinguishing between similar-looking foods to differentiating between animals, the model showcases precision unmatched by human perception. For example, it can correctly identify 16 highly similar images, such as the often-confused blueberry muffins and Chihuahua dogs.

In addition to general image recognition, the Vision model leads in advanced applications like OCR text extraction and image-based comprehension. It outperforms traditional scanning and OCR software, accurately reading even messy handwritten content on receipts or shipping labels.

SEE ALSO: Allen Zhu Exposes Conflict Between Moonshot AI and Recurrent AI Shareholders

The Vision model supports advanced features such as multi-turn dialogues, streaming output, tool calling, JSON Mode, and Partial Mode. However, it currently does not support internet search, creating Context Cache with image content, or handling URL-based image inputs. Instead, it processes base64-encoded images and can use pre-existing Context Cache for enhanced interactions.

Moonshot AI offers pay-as-you-go pricing for the Vision model, ranging from CNY12.00 to CNY60.00 per million tokens depending on the model version. To enhance accessibility, the Kimi Open Platform now provides Context Caching to all users without renewal fees, significantly reducing overall costs.

With the launch of the Vision model, Moonshot AI continues to push the boundaries of multimodal AI capabilities, setting new standards for precision and usability in image recognition.

Report