3/21/2021 0 Comments Intel Processor Guide
Further down the Xeon road, Intel is promising to add instructions and other features that will provide a step function improvement in both inference and training natively on the Xeons, but it is not being specific about how it will accomplish this.But the actual product roadmaps, which insiders and key customers and ODMs and OEMs see are several inches thick when they are printed out, as Raja Koduri, senior vice president of the Core and Visual Computing Group, general manager of edge computing solutions, and chief architect at Intel, explained in his keynote at the Architecture Day event.That stack of data does not count the substantial number of revisions that this broad portfolio of compute, networking, and storage products undergoes as they make their way to market or, as has happened often in the past couple of years, dont.
But only a fool would ever think for a second that Intel, when it is focused by its legendary paranoia, cant recover from missteps and deliver compelling chippery. It has happened time and time again, as we all know, and the reason why is simple: What Intel, and its rivals, do is exceedingly difficult. Modern CPUs, GPUs, and FPGAs are arguably the most complex devices ever created, and it is important and fittingly kind to step back and appreciate what has been accomplished in six decades of computing in the datacenter and the key role that Intel played in making innovation happen both in terms of the manufacturing and the evolving architecture of chips. This is a market that is always hard, and that is why the rewards are so great for the victors. Singhal got his bachelors and masters degrees in electrical and computer engineering from Carnegie Mellon and went immediately to work at Intel after graduating in 1997, and notably was on the performance teams for the Pentium 4 processor, whose NetBurst architecture was the one that Intel once thought it could push to 10 GHz way back when. The thermal densities were too high for this to ever work, as the company discovered to all of our chagrin.) Singhal lead the performance teams for the transformational Nehalem Xeons, which debuted in 2009 with a revamped architecture, and their follow-on Westmere Xeons, and after that led the core development for the Haswell Xeons. These days, Singhal is responsible for the CPU core designs for the Core, Atom, and Xeon families of chips. Without further ado, here is the roadmap for the XeonCore and Atom cores. This means that the old tick-tock model is officially dead for the Cores and Xeons, a manufacturing and design approach that Intel has used effectively for more than a decade to mitigate risk by only changing one thing processor or microarchitecture at a time. But the AMD and Arm competition is picking up the pace, with an annual cadence of design refinements coupled with manufacturing process improvements, so Intel has to quicken its steps and absorb a little more risk. We figure that Intel is hedging a bit these days, and aims to convert the monolithic Cores and Xeons into multichip module designs, mixing chiplets with different functions in appropriate processes, as AMD, Xilinx, and Barefoot Networks have confirmed they are doing with their chips coming in 2019. We will not be surprised at all if processing cores of the future Ice Lake Xeons are implemented in 10 nanometers but other parts of the socket probably memory and IO controllers stay in a very refined and mature 14 nanometer process. Basically, Intel is supporting 8-bit integer (INT8) and 16-bit integer (INT16) data formats in the AVX-512 vector coprocessors on the Xeons, allowing for more data to be crammed and chewed on for inference workloads. In an INT8 convolutional inner loop for inference, it takes three instructions to process on the Skylake using 16-bit floating point (FP16), and now it takes one instruction on Cascade Lake. So the AVX-512 units can process three times as much data per clock. If customers want to use the fatter INT16 format, they can get a 2X speedup over the way it was done on Skylake with FP32. This will be aimed at HPC customers in particular, as far as we know. The big change with Cooper Lake will be support for the bfloat16 format that Google created for the third generation of its Tensor Processing Units (TPUs) and that is also used in the Nervana neural network processors. ![]() The bfloat format can express the same range of numbers as an FP32 number, but can do it in the same 16 bits as the official FP16 exponents and mantissas, like this: The microarchitecture of the cores in Cascade Lake has to be tweaked to support this bfloat16 format, and we wonder why FP16 half precision wasnt just done this way to begin with. Keeping the same exponent keeps the same numerical expression range, and this seems pretty obvious in hindsight, especially considering the limitations on the range with FP16. Any format that means any number higher than 65,520 is rounded to infinity seems like it might play havoc with simulations or machine learning algorithms that would like to have the ability to express very small or very large numbers. This, says Singhal, will help accelerate machine learning training on Intel Xeons.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |