Future processor architecture with 21% performance boost

The near future is promising in term of computing performance, after the yesterday article about the new invent in HDD read/write performance boost, here is another push to the computing performance, today Researchers from North Carolina State University have announce the ability of pushing the performance of the CPU up to 21% compared with the current architecture with what they call a new Fused CPU-GPU Architectures.

From the name you can get what is the “CPU-Assisted GPGPU on Fused CPU-GPU Architectures” technology is about, the idea of the ‘fused architecture’ is to put the GPU and CPU in a single chip allowing the GPU to directly perform complex task without the interaction of the CPU, meaning that the GPU cores can directly access the data in the system memory so booth will use the same bus to access the RAM and will execute many individual functions faster than the CPU and leave other complex tasks to the CPU all that result of an increase of overall performance up to 21.4 percent.

And there is more, “This approach decreases manufacturing costs and makes computers more energy efficient. However, the CPU cores and GPU cores still work almost exclusively on separate functions. They rarely collaborate to execute any given program, so they aren’t as efficient as they could be. That’s the issue we’re trying to resolve.” said Dr. Huiyang Zhou, an associate professor of electrical and computer engineering who co-authored a paper on the research.

Bellow the press release presented at New Orleans:

Abstract: This paper presents a novel approach to utilize the CPU resource to facilitate the execution of GPGPU programs on fused CPU-GPU architectures. In our model of fused architectures, the GPU and the CPU are integrated on the same die and share the on-chip L3 cache and off-chip memory, similar to the latest Intel Sandy Bridge and AMD accelerated processing unit (APU) platforms. In our proposed CPU-assisted GPGPU, after the CPU launches a GPU program, it executes a pre-execution program, which is generated automatically from the GPU kernel using our proposed compiler algorithms and contains memory access instructions of the GPU kernel for multiple threadblocks. The CPU pre-execution program runs ahead of GPU threads because (1) the CPU pre-execution thread only contains memory fetch instructions from GPU kernels and not floating-point computations, and (2) the CPU runs at higher frequencies and exploits higher degrees of instruction-level parallelism than GPU scalar cores. We also leverage the prefetcher at the L2-cache on the CPU side to increase the memory traffic from CPU. As a result, the memory accesses of GPU threads hit in the L3 cache and their latency can be drastically reduced. Since our pre-execution is directly controlled by user-level applications, it enjoys both high accuracy and flexibility. Our experiments on a set of benchmarks show that our proposed preexecution improves the performance by up to 113% and 21.4% on average.

If you like this article, please share it on facebook, tweet it and +1 it to spread it :) !