Naver Cloud has announced two AI models a day before the government’s first unveiling of the national AI initiative. These models are capable of understanding text, voice, and visuals simultaneously, incorporating the “omnimodal” technology, which is considered a core technology for future AI that understands real-life contexts beyond simple text. The omnimodal structure allows the AI to learn from text, images, and audio data simultaneously from the start, unlike traditional multimodal approaches where text-based AI learns to understand images and sounds incrementally.
The first model, the “HyperCLOVA X SEED 8B Omni,” is a native omnimodal model that can integrate and infer information regardless of modality differences. It’s designed to have practical applications in real-world environments where text, visual, and audio information interact. The model also features an omnimodal generation function, allowing it to create or edit images based on textual commands.
The other model, “HyperCLOVA X SEED 32B Think,” combines the abilities of visual understanding, voice interaction, and tool usage, maintaining capabilities in text and image reasoning while incorporating voice dialogue. It demonstrates high-level performance by solving university entrance exam questions and achieving top grades in major subjects, showcasing its advanced problem-solving abilities by understanding problems directly from images.
Through this announcement, Naver Cloud has validated its native omnimodal AI development methodology, aiming for further scale-up through structured learning. This development represents a strategic move to expand specialized omnimodal models for industry and daily services more efficiently.
The Ministry of Science and ICT and the National IT Industry Promotion Agency (NIPA) are set to hold a public presentation of the first achievements from the AI Foundation Model project, featuring contributions from elite teams like Naver Cloud.
