Written by 11:14 AM Tech

KAIST Develops Core NPU Technology to Improve ChatGPT Inference Performance by Over 60%

KAIST and HyperExcel Develop Core Technology for High-Performance, Low-Power Neural Processing Units

Recent generative AI models like OpenAI’s ChatGPT-4 and Google’s Gemini 2.5 require not only high memory bandwidth but also large memory capacity. Major AI cloud operators like Microsoft and Google are investing in thousands of Nvidia GPUs to meet these demands. In response, a Korean research team has developed NPU (Neural Processing Unit) core technology that can enhance the inference performance of generative AI models by more than 60% on average, while consuming 44% less power compared to current GPUs.

On July 4, KAIST announced that a research team led by Professor Park Jong-se (Computer Science Department) and HyperExcel (a company founded by Professor Kim Joo-young from the Electrical and Electronics Engineering Department) successfully developed high-performance, low-power NPU core technology specialized for generative AI clouds like ChatGPT. This technology was accepted by the International Symposium on Computer Architecture (ISCA 2025), a leading conference in the field of computer architecture.

The key focus of the research is to minimize accuracy loss during inference through lightweight processes while solving memory bottlenecks to improve the performance of large-scale generative AI services. This research is particularly valued for its integration of AI semiconductor and AI system software design, both crucial components of AI infrastructure.

Existing GPU-based AI infrastructures require many GPU devices to meet high memory bandwidth and capacity demands. In contrast, the new technology allows for the same level of AI infrastructure with fewer NPU devices by quantizing the memory-intensive KV caches, significantly reducing the cost of setting up generative AI clouds.

The team designed their hardware architecture to integrate with existing NPU architecture without altering its computational logic, implementing the proposed quantization algorithm, efficient page-level memory management techniques, and a new encoding scheme optimized for quantized KV caches. The result is a cost-effective and power-efficient NPU-based AI cloud infrastructure, anticipated to substantially reduce operational costs by leveraging the NPU’s high-performance and low-power properties.

Professor Park stated, “This research, done in collaboration with HyperExcel, found solutions in lightweight inference algorithms for generative AI and succeeded in developing NPU core technology that addresses ‘memory issues’. By combining a lightweight approach to reduce memory demands while maintaining inference accuracy with optimized hardware design, we have realized an NPU that enhances performance by over 60% compared to current GPUs.”

He further emphasized, “This technology demonstrates the potential for creating high-performance, low-power infrastructure tailored to generative AI, and is expected to play a critical role not only in generative AI cloud data centers but also in AI transformation environments like agentic AI.”

The research was presented by KAIST Ph.D. student Kim Min-su and HyperExcel’s Dr. Hong Seong-min as co-first authors at ISCA 2025, held from June 21 to June 25 in Tokyo, Japan. ISCA, a renowned international conference, received 570 submissions this year, with only 127 papers being accepted.

Visited 1 times, 1 visit(s) today
Close Search Window
Close
Exit mobile version