Written by 11:18 AM Tech

Domestic AI, is the gap with ChatGPT and Gemini this significant…? Solving CSAT math reveals the difference.

According to a study conducted by a research team led by Professor Jong-rak Kim of the Department of Mathematics at Sogang University, Korean teams’ large language models (LLMs), which aim to challenge the title of national representative artificial intelligence (AI), have shown significantly lower performance in solving university entrance exam math and essay questions compared to overseas models.

On December 15, the research team revealed that they had tasked the primary LLMs of five domestic AI teams, as well as five overseas models including ChatGPT, with answering 20 university entrance (수능) math questions and 30 essay questions.

For the entrance exam questions, the team selected the five most difficult questions from common subjects, probability and statistics, calculus, and geometry, setting a total of 20 questions. For essay questions, the team set up 30 questions comprising past exam questions from 10 domestic universities, 10 entrance exam questions from Indian universities, and 10 math questions from the University of Tokyo’s Graduate School of Engineering.

In the test, Korean models such as Upstage’s “Solar Pro-2,” LG AI Research’s “ExaONE 4.0.1,” Naver’s “HCX-007,” SK Telecom’s “A.X 4.0 (72B),” and NCSoft’s lightweight model “Rama Barco 8B Instruct” participated. Overseas models included GPT-5.1, Gemini 3 Pro Preview, Claude Opus 4.5, Groq 4.1 Fast, and DeepSeek V3.2.

As a result, overseas models scored between 76 and 92 points, while the only Korean model to score 58 points was Solar Pro-2, with the others scoring in the low 20s. Rama Barco 8B Instruct scored the lowest with 2 points.

The team highlighted that even though the five domestic models were redesigned to use Python as a tool to increase problem-solving efficacy, they still failed to solve most problems with basic reasoning alone.

In another test involving ten questions from a problem set called “EntropyMath,” which includes questions ranging from university to professor-level research papers, overseas models scored between 82.8 to 90 points, while domestic models scored between 7.1 to 53.3 points.

During a method testing where the correct answer was accepted after three attempts, Groq scored a perfect score, the other overseas models scored 90 points, while Korean models varied with Solar Pro-2 at 70 points and ExaONE at 60 points. HCX-007 scored 40 points, A.X 4.0 scored 30, and Rama Barco 8B Instruct scored 20.

Professor Kim stated, “There were many inquiries as to why there were no evaluations of our sovereign AI models on university entrance exam questions. So, our team carried out the test, revealing the significant lag behind overseas frontier models.”

The research team plans to retest the performance when new national representative AI versions from the five domestic teams are released, using internally developed problem sets.

Visited 1 times, 1 visit(s) today
Close Search Window
Close
Exit mobile version