Today, NVIDIA announced its new Ampere architecture, along with the new A100 on which it runs. This is a significant improvement over Turing, already an AI-focused architecture that operates high-end data centers and ML-driven ray tracing in the consumer graphics space.
For a complete summary of all highly technical details, read NVIDIA's in-depth architecture overview. We will break down the most important things.
The New Die Is Absolutely Massive
From the gate they come out with the new chip. The last-generation Tesla V100 array was 815 mm on TSMC's already mature 14 nm processing node, with 21.1 billion transistors. Already quite big, but the A100 is ashamed of 826mm on TSMC's 7nm, a much denser process and a lot of 54.2 billion transistors. Impressive for the new node.
This new GPU has 19.5 teraflops with FP32 performance, 6,912 CUDA cores, 40 GB memory and 1.6 TB / s memory bandwidth. In a fairly specific workload (sparse INT8), the A100 actually bursts 1 PetaFLOPS of raw computing power. Of course it is on INT8, but still the card is very powerful.
Then, just like the V100, they've taken eight of these GPUs and created a mini-supercomputer that they sell for $ 200,000. You'll probably see them coming to cloud providers like AWS and Google Cloud Platform soon.
But unlike the V100, this is not a massive GPU – there are actually 8 separate GPUs that can be virtualized and rented on their own for different tasks, along with 7x higher memory transfer that start.
When it comes to using all these transistors to use, the new chip runs much faster than the V100. For AI training and inference, the A100 offers a 6x speedup for FP32, 3x for FP16 and 7x speedup inference when all these GPUs are used together.
Note that the V100 marked in the second graph is the 8 GPU V100 server, not a single V100.
NVIDIA also promises up to 2x speedup in many HPC workloads:
When it comes to raw TFLOP numbers, the A100 FP64 is double precision in performance 20 TFLOPs, vs. 8 for V100 FP64.
TensorFloat-32: A new number format optimized for tensor cores
With Ampere, NVIDIA uses a new number format that is designed to replace FP32 in certain workloads. In essence, FP32 uses 8 bits for number range (how big or small it can be) and 23 bits for precision.
NVIDIA's claim is that these 23 precision pieces are not absolutely necessary for many AI workloads, and you can get similar results and performance from just 10 of them. This new format is called Tensor Float 32. This is, at the top of the matrix shrinking and the core number is increasing, how they claim the massive 6x speedup in AI training.
They claim that "Users do not need to make any code changes, since the TF32 runs only within the A100 GPU. TF32 works on FP32 inputs and gives results in FP32. Non-tensor operations continue to use FP32 ”. This means that there should be a reduction in compensation for workloads that do not need the extra precision.
Compare FP performance on V100 to TF performance on A100, you will see where these massive speedups come from. TF32 is up to ten times faster. Of course, much of this is also because the other improvements in Ampere are generally twice as fast and not a direct comparison.
They have also introduced a new concept called fine-graded structured sparsity, which contributes to the calculation of deep neural networks. In principle, some weights are less important than others, and the matrix mathematics can be compressed to improve throughput. While throwing out data does not seem like a good idea, they claim it does not affect the accuracy of the trained network for inferencing, and simply speeds up.
For Sparse INT8 calculations, the top performance for a single A100 is 1250 TFLOPS, an astonishingly high number. Of course, you will be hard pressed to find a real workload that only violates the INT8, but speedups are speedups.