Chip architecture design for high-performance computing

The evolution towards multi-chip integration and new types of memory processing signifies a paradigm shift where flexibility, efficiency, and optimization for a variety of workloads become crucial.

Leading hyperscale cloud data center companies such as Amazon, Google, Meta, Microsoft, Oracle, and Akamai are introducing heterogeneous multi-core architectures specifically tailored for cloud computing, which has an impact on the development of high-performance CPUs across the entire chip industry.

These chips are unlikely to be commercially sold. They are optimized for specific data types and workloads, with substantial design budgets, but can save costs by improving performance and reducing power consumption. The industry's goal is to accommodate more computing power in a smaller area while reducing cooling costs, and the best way to achieve this is through customized architectures, tightly integrated micro-architectures, and carefully designed data flows.

This trend began nearly a decade ago when AMD started adopting heterogeneous architectures and accelerated processing units, replacing the previous homogeneous multi-core CPU model, but it started slowly. Since then, heterogeneous architectures have gained momentum, following the footsteps of designs for mobile consumer devices, which require handling very compact footprints as well as strict power and thermal requirements.

Quadric's Vice President of Marketing, Steve Roddy, said: "Almost every product code from industry giants like Intel has artificial intelligence NPU on a single piece of silicon. Of course, artificial intelligence pioneer NVIDIA has long been mixing CPUs, shader (CUDA) cores, and tensor (Tensor) cores in its highly successful data center products. The shift towards chiplets in the coming years will solidify this transformation, as system buyers can choose the type of computation and interconnect based on the specific needs of the design slots, thus determining the combination of chiplets."

Advertisement

This is largely due to physics and economics. As scaling advantages diminish and advanced packaging technologies mature - which allows for the addition of more customized features in the design, features that were previously limited by die size - the competition for performance per watt and per dollar has reached a fever pitch.

Neil Hand, Marketing Director of Siemens EDA IC Division, said: "Nowadays, everyone is building their own architecture, especially data center companies, and a large part of processor architecture depends on the appearance of the workload. At the same time, these developers are also exploring the best paths for acceleration, as there are many ways to accelerate. You can choose the way of parallel processing, which is not effective for some tasks but very effective for others. At the same time, the limitations of memory bandwidth by applications are increasing, so you will find some high-performance computing companies starting to put all their efforts into memory controllers. There are also companies that say: 'This is actually a decomposition problem, we are going the accelerator route, with separate cores.' But I don't think there is a one-size-fits-all approach."

Roddy pointed out that the CPU cores inside these new super chips still follow the well-established principles of high-performance CPU design: fast, deep pipelines, and extremely efficient pointer chasing, but this is no longer the only focus of the design team. He said: "These large CPUs now share space with other programmable engines - such as GPUs and general-purpose programmable NPUs, for accelerating AI workloads. A significant difference compared to highly specialized SoCs in mass consumer devices is that tasks like video transcoding or matrix acceleration in AI workloads avoid hard-wired logic blocks (accelerators). Devices designed for data centers need to maintain programmability to cope with a variety of workloads, not just a single known function in consumer devices."However, all of this requires more analysis, and the design community continues to push for more steps in the process. Hand said: "Whether through tools, or through simulation or virtual prototypes, you have the tools to help understand the data. Moreover, the industry has grown and become specialized enough to justify the expense. The first part is to reduce the risk of manufacturing new hardware because if you have the tools to understand the situation, you don't have to be conservative. Now, the market has begun to segment, and its importance is worth the investment. In addition, there are now methods to achieve this goal. In the past, when Intel launched processors, it was almost impossible to compete with Intel. Now, through the combined effect of ecosystems, technology, and other factors, competition has become much easier. For high-performance computing companies, the low-hanging fruit at the beginning was: 'We just need to get a good platform that allows us to dimension it in our own way, and then put in some accelerators. So we began to see AI accelerators and video accelerators, and then some more profound companies began to pursue machine learning. What does this mean? It means they need very high MAC performance. They will focus the processor architecture on this and stand out in this way."

Adding RISC-V, reusable chip sets, and hard IP, the architecture begins to look very different from a few years ago. Hand said: "If you look at today's data centers and the entire software stack in the data center, adding something to the stack is not as difficult as it used to be; you don't have to rebuild the entire data center. What has become important today is the ability to perform system-level analysis, and the system-level collaborative design of applications has become very important and easier, which is a mobile data center."

Many believe that new architectures should be developed to overcome the memory challenges faced by several generations of CPUs. Andy Heinig, head of the Efficient Electronics department at the Adaptive Systems Engineering Division of Fraunhofer IIS, said: "The demand for AI/ML will accelerate the development of new application-specific architectures. Traditional CPUs can be part of this revolution if they can provide a better memory interface to solve memory problems. If the CPU can provide this new memory architecture, then AI/ML accelerators can become the best solution for data centers together with the CPU. The CPU is responsible for classic tasks that require flexibility, while the accelerator provides the best performance for specific tasks."

For example, Arm has been working directly with several hyperscale cloud providers to develop Neoverse-based computing solutions to achieve high performance, customized flexibility, and a robust software and hardware ecosystem. This has resulted in publicly released chips, such as AWS's Graviton and Nitro processors, Google's Mt.Evans DPU, Microsoft Azure's Cobalt 100, Nvidia's Grace CPU superchip, and Alibaba's Yitian 710.

Brian Jeff, Senior Director of Product Management for Arm's Infrastructure Business Line, said: "We have learned a lot from these and other design partners. One of the main ways we shape high-performance CPUs and platform development is by deeply understanding infrastructure workloads to achieve specific architectural and microarchitectural enhancements, especially enhancements to the CPU pipeline front end and CMN mesh structure."

However, capturing this workload and developing chip architecture for it is not always so simple. This is especially true for AI training and inference, as changes in algorithms can lead to changes in the workload.

Priyank Shukla, Chief Product Manager of Interface IP at Synopsys, said: "Different models are currently being trained, such as the Llama model and Chat GPT model published by Meta. All these models have a pattern and a certain number of parameters. Take GPT-3, for example, it has 175 billion parameters, each parameter is 2 bytes wide, that is, 16 bits. You need to store so much information in 2 bytes - 175 billion parameters, equivalent to 35 billion bytes of memory. This memory needs to be stored in all accelerators that share this model, and the model needs to be placed in the structure of the accelerator, and the parameters need to be placed in the memory related to that accelerator. Therefore, you need a structure that can accept larger models and process them. You can implement the model in different ways, that is, the way to implement the algorithm. Some work can be done serially, and some work can be done in parallel. Work done serially needs to be consistent with the cache and minimize latency. This work done serially will be divided within a rack to minimize latency. Work done in parallel will be distributed between different racks through an expansion network. We see system personnel creating this model and algorithm and implementing it in custom hardware."

Assembling various processing components is not easy. Patrick Verbist, Product Manager of ASIP tools at Synopsys, said: "They are heterogeneous multi-core architectures, usually a mix of general-purpose CPUs and GPUs, depending on the type of company, because they prefer one of them. Then there are RTL accelerators with fixed functions, which are mixed with these heterogeneous multi-core architectures. The types of application loads these accelerators run generally include data operations, matrix multiplication engines, activation functions, parameter compression/decompression, graphic weights, etc. But all these applications have one thing in common, which is the need for a large amount of computation. Usually, these calculations are done on standard or custom data types. Many processing architectures support Int 16, but if you only need to process 16-bit data, there is no need to waste 16 bits in a 32-bit data path. This must be customized. Therefore, accelerators need to support not only floating-point 32 data types but also int 8 and/or int 16, half-precision floating-point, custom int, or custom floating-point data types, and functional units, operators are usually a combination of vector adders, vector multipliers, adder trees, and activation functions. These activation functions are usually exponential or hyperbolic functions, square roots, large division, and other transcendental functions, but they are all vectorized and have single-cycle throughput requirements because new calculations for these things need to be made every cycle. For these accelerators, in terms of heterogeneity, we see many customers using ASIPs (Application-Specific Instruction Processors) in the heterogeneous space. ASIP allows for the customization of operators, so the data path and instruction set can only perform a limited set of operations more efficiently than conventional DSPs."

DSPs are usually not flexible enough because they are too general. On the other hand, fixed-function RTL may not be flexible enough, which creates a space for the need for "something that is more flexible than fixed-function RTL and less flexible than general DSPs." If you look at GPUs, to some extent, GPUs are also general-purpose. They must support various workloads, but not all workloads. This is where ASIPs come in, supporting flexibility and programmability. You need this flexibility to support a range of computational algorithms to adapt to the constantly changing requirements of software or artificial intelligence diagrams, as well as the constantly changing requirements of artificial intelligence algorithms themselves."

Hand from Siemens believes that considering the workload is a daunting challenge."To address this issue, vertically integrated companies are investing in high-performance computing in this manner, as high-performance computing is no different from AI; you can only work based on the data patterns you see," said Hand. "If you are a company like Amazon or Microsoft, then you have a vast amount of tracking data, and without infringing on any data, you know what bottlenecks your machines have. You can use this information to say, 'We've discovered we're getting memory bandwidth, we must do something about this, or is it a network bandwidth issue, or is it an AI throughput issue, we're encountering problems in these areas.' This is not much different from the challenges happening at the edge. The edge's goal is different; we often think, 'What can I get rid of? What do I not need?' or 'Where can I reduce the power range?' Whereas in the data center, you ask, 'How can I pass more data and do it in a way that won't burn out the equipment? As the equipment gets larger, how can I do this in a scalable way?'"

Hand believes that the shift towards multi-chip packaging will drive many interesting developments, a technology already being used by companies like AMD and Nvidia. "Now, you can start to provide some interesting plug-and-play components for these high-performance computing applications, and to a large extent, you can start to say, 'What interconnect chip does this application need? What is the processing chip for this application?' It provides a middle ground between building a standard computer and not making too many changes. What can I do? I can install different processes, different network cards, different DIMMs. As a cloud service provider, there are limits to what I can do. On the other end, large cloud providers like Microsoft and Azure will say, 'I can build my own complete SOC and do whatever I want.' But now you can be in the middle ground, for example, if you think there's a market for bio-computing data centers, with enough people entering this field, you can make some money. Can you assemble a 3D IC and make it work in that environment? It will be interesting to see what kind of things will emerge, as this will lower the barrier to entry. We've already seen companies like Apple, Intel, AMD, and Nvidia using it as a way to speed up product development and provide more diversity without having to test large chips. When you start combining them with things like full digital twins of the environment, you can start to understand the workloads in the environment, understand the bottlenecks, then try different partitions, and then advance."

Jeff from Arm also believes that data center chip architectures are changing to accommodate AI/ML functionalities. "Inference on the CPU is very important, and we see partners leveraging our SVE pipeline and matrix math enhancements, as well as data types, to run inference. We also see that AI accelerators tightly coupled through high-speed coherent interfaces are playing a role, and DPUs are expanding their bandwidth and intelligence to connect nodes together."

Multi-chip integration is inevitable

The chip industry is well aware that for many compute-intensive applications, single-chip solutions have become unrealistic. The biggest question of the past decade has been when the shift to multi-chip solutions will become mainstream. "The entire industry is at an inflection point where you can no longer avoid this issue," said Sutirtha Kabir, Director of R&D at Synopsys. "We talk about Moore's Law and 'SysMoore,' but designers have to add more functionality in CPUs and GPUs, and due to die size limitations, yield limitations, and other reasons, they simply cannot do it. Multi-chip is inevitable here, and it brings some interesting considerations. First, take a piece of paper and fold it. This is essentially an example of multi-chip. You take a chip and fold it, and if you design it cleverly, you can think of significantly shortening timing. If you have to go from the top chip to the bottom chip, you might only go through a small part of the chip's wiring, but most of it is ball or wire bonds between the chips."

Challenges faced by multi-chip designs include determining how many paths need to be synchronized, whether timing should be placed between two chips or turned off individually, whether L1 should be on the top chip or the bottom chip, and whether L4 can be added.

Kabir explained, "From a three-dimensional perspective, layout design becomes very interesting. You can convert a single-story house into a three or four-story building, but with that comes other design challenges. You can no longer ignore thermal issues. Thermal management used to be a PCB thing, and now system designers are considering that these chips are very hot. Jensen Huang recently said at SNUG that you put room-temperature water at one end, and hot spring water comes out at the other. He was joking, but the fact is, from a temperature perspective, these chips are indeed very hot, and if you don't consider this in your layout design, your processor will be burned out. This means you have to start doing this work much earlier. In terms of three-dimensional layout design, when it comes to workloads, how do you ensure that you have analyzed the different workloads of multi-chip and ensure that even without a circuit schematic, you have considered key impacts such as infrared, thermal, and timing? We call this the zero-circuit schematic phase. These considerations become very interesting because you can no longer avoid doing multi-chip, so from the wafer fab's perspective, from the EDA's perspective, these are the frontier and center of the ecosystem, and designers are in the middle."

Related to the thermal issues of data center chips is the issue of low-power design.

"Data centers consume a huge amount of power," said Marc Swinnen, Director of Product Marketing at Ansys. "I attended the ISSCC in San Francisco, and our booth was right next to Nvidia, which was showcasing its AI training box - a large box with eight chips, numerous fans, and heat sinks. We asked how much power it consumes, and they said, 'Oh, it's up to 10,000 watts at the highest, but it's also 6,000 watts on average.' The power is getting crazier and crazier."Jeff from Arm also believes that the best way to tackle new challenges in data center chips is to adopt a full-system approach, which includes instruction set architecture, software ecosystem and specific optimizations, CPU microarchitecture, interconnect structures, system memory management and interrupt control, as well as I/O both within and outside the package. "A complete system approach allows us to work with partners to customize SoC designs based on modern workloads and process nodes, while leveraging a chipset-based design methodology," he said.

This customized chip design approach enables data center operators to optimize their power consumption costs and computational efficiency. Jeff said, "Our Neoverse N series' high efficiency allows for the number of cores per socket to reach 128c to 192c or even higher. These same N-series products can be scaled to DPU and 6g L2 designs and edge servers within a smaller footprint. Our V series is aimed at cloud computing, offering higher single-thread performance and higher vector performance (for workloads such as AI inference and video transcoding), while still providing high efficiency. A wide selection of accelerator attachments allows our partners to integrate the right combination of customized processing and cloud-native computing into SoCs tailored to their workloads."

Conclusion

Due to the evolutionary nature of high-performance computing, as well as the various aspects of data center optimization, the end result is almost unpredictable. Hand from Siemens said, "In the early days of the explosive development of networking technology, people began to establish north-south and east-west routing within data centers, which changed all network switch architectures because it was a major bottleneck. This led to a complete rethinking of the data center. Similar things have happened in the field of memory, and when you start integrating optical technologies and some smarter memory, you will find that it will be very interesting."

Hand mentioned an Intel Developer Conference a few years ago, where the company explained how to use surface-emitting optical technology in silicon photonics to separate memory from storage in the data center rack. He said, "They have a unified memory structure that can be shared between servers and memory can be allocated from different servers. As a result, the topology of the data center begins to become very interesting. Even within the rack, you can see AI system structures owned by companies like NVIDIA. The biggest change is that people can look at it, and if there is market demand, you can build it. We have always believed that the key to the architecture is whether the core is fast. We have transitioned from 'Is the core fast enough?' to 'Do I have enough cores?' But the issue is far more than that. Once you start breaking away from the von Neumann architecture, start using different memory flows, start focusing on in-memory computing, it becomes very cool. Then you will think, 'What does high-performance computing really mean?'"

Comment