On September 1, 2020, NVIDIA unveiled its new range of gaming GPUs: the RTX 3000 series, based on their Ampere architecture. We discuss what̵7;s new, the AI - powered software that comes with it and all the details that make this generation truly amazing.
Meet the RTX 3000 Series GPUs
NVIDIA’s main announcement was its shiny new graphics processors, all built on a custom 8 nm manufacturing process, and all brought great speeds in both rasterization and beam tracking.
In the lower part of the lineup is the RTX 3070, which comes in at $ 499. It’s a bit expensive for the cheapest card that NVIDIA presented at the first announcement, but it’s an absolute steal when you understand that it beats the existing RTX 2080 Ti, a top of the line card that regularly sells for over $ 1400. Following NVIDIA’s announcement, however, third-party prices dropped, with a large number of them panicking on eBay for under $ 600.
There are no solid benchmarks from the announcement, so it is unclear if the card is really objectively “better” than a 2080 Ti, or if NVIDIA twists marketing a bit. The benchmarks that were run were at 4K and probably had RTX on, which can make the gap look bigger than it will be in purely rasterized games, as the Ampere-based 3000 series will work over twice as well at beam tracking than Turing. But since radiation tracking is now something that does not hurt performance much and is supported in the latest generation of consoles, it is an important selling point to get it running as fast as the previous generation of flagships for almost a third of the price.
It is also unclear whether the price will be so. Third-party designers regularly add at least $ 50 to the price tag, and with how much demand is likely to be, it will not be surprising to see that it sells for $ 600 coming October 2020.
Just above is the RTX 3080 at $ 699, which should be twice as fast as the RTX 2080 and get in about 25-30% faster than the 3080.
Then, at the top end, is the new flagship RTX 3090, which is comically huge. NVIDIA is well aware of and referred to it as a “BFGPU”, which the company says stands for “Big Ferocious GPU.”
NVIDIA did not show any direct performance metrics, but the company showed that it runs 8K games at 60 FPS, which is seriously impressive. Granted, NVIDIA almost certainly uses DLSS to reach that mark, but 8K games are 8K games.
Of course, there will eventually be a 3060 and other variations of more budget-oriented cards, but they usually come in later.
To actually cool things down, NVIDIA needed a modernized cooler design. The 3080 is rated for 320 watts, which is quite high, so NVIDIA has chosen a double fan design, but instead of placing both vwinf fans in the bottom, NVIDIA has put a fan in the upper end where the back plate usually goes. The fan directs air upwards towards the CPU cooler and the top of the housing.
Judging by how much performance can be affected by poor airflow in a case, it’s perfect. However, the circuit board is very crowded because of this, which is likely to affect third party sales prices.
DLSS: A software advantage
Radiation tracking is not the only benefit of these new cards. Actually, everything is a bit hack – the RTX 2000 series and the 3000 series are not that much better at doing actual beam tracking compared to older card generations. Beam tracking of an entire scene in 3D software like Blender usually takes a few seconds or even minutes per frame, so it is not possible to force it in less than 10 milliseconds.
Of course, there is dedicated hardware for running radiation calculations, called RT cores, but basically NVIDIA chose a different approach. NVIDIA improved the denoisation algorithms, which allow GPUs to make a very cheap single pass that looks awful and somehow – through AI magic – make it something a gamer wants to watch. In combination with traditional rasterization-based techniques, it provides a pleasant experience enhanced by radiation tracking effects.
To do this quickly, however, NVIDIA has added AI-specific processor cores called Tensor cores. These process all the math required to run machine learning models and do it very quickly. They are a total game changer for AI in the cloud server space, as AI is widely used by many companies.
In addition to denoising, the main use of the Tensor cores for players is called DLSS, or deep learning super sampling. It takes in a low quality frame and upscales it to full-native quality. This basically means that you can play with 1080p levels while watching a 4K image.
This also helps track performance quite a bit – benchmarks from PCMag show an RTX 2080 Super run Verify in ultra quality, with all beam tracking settings cranked up to max. At 4K it struggles with only 19 FPS, but with DLSS on it gets much better 54 FPS. DLSS is free performance for NVIDIA, enabled by the Tensor cores at Turing and Ampere. Any game that supports it and is GPU limited can see serious speeds just from software alone.
DLSS is not new and was announced as a feature when the RTX 2000 series was launched two years ago. At that time, it was supported by very few games, as it required NVIDIA to train and set up a machine learning model for each individual game.
But at that time, NVIDIA has written about it completely and called the new version DLSS 2.0. It is a general API, which means that all developers can implement it, and it is already picked up by most major editions. Instead of working with a frame, it takes in motion vector data from the previous frame, in the same way as TAA. The result is much sharper than DLSS 1.0, and in some cases it actually looks better and sharper than even built-in resolution, so there’s not much reason not to turn it on.
There is a catch – when you change the scene completely, as in clip scenes, DLSS 2.0 must render the very first image in 50% quality while you wait for motion vector data. This can result in a small drop in quality in a few milliseconds. But 99% of everything you watch will be rendered correctly, and most people will not notice it in practice.
RELATED: What is NVIDIA DLSS and how will it make radiation tracking faster?
Ampere Architecture: Built for AI
Ampere is fast. Seriously fast, especially in AI calculations. The RT core is 1.7 times faster than Turing, and the new Tensor core is 2.7 times faster than Turing. The combination of the two is a real generational step in radiation tracking performance.
Earlier in May, NVIDIA released the Ampere A100 GPU, a data center GPU designed to run AI. With that, they detailed a lot about what makes Ampere so much faster. For data centers and high-performance computing, Ampere is generally about 1.7 times faster than Turing. For AI training, it is up to six times faster.
With Ampere, NVIDIA uses a new number format designed to replace the industry standard “Pourpoint 32” or FP32 in certain workloads. Under the hood, each number your computer processes takes up a predefined number of bits in memory, whether it is 8 bits, 16 bits, 32, 64 or even larger. Figures that are larger are harder to process, so if you can use a smaller size, you have less to crack.
FP32 stores a 32-bit decimal number and it uses 8 bits for the number (how big or small it can be) and 23 bits for precision. NVIDIA claims that these 23 precision bits are not absolutely necessary for many AI workloads, and you can get similar results and much better performance from just 10 of them. Reducing the size to just 19 bits, instead of 32, makes a big difference in many calculations.
This new format is called the Tensor Float 32, and the Tensor Cores in the A100 are optimized to handle the strangely large format. This is, on top of the matrix shrinking and the core number increasing, how they get the massive 6x speed in AI training.
In addition to the new number format, Ampere sees large performance speeds in specific calculations, such as FP32 and FP64. These do not translate directly into more FPS for the layman, but they are part of what makes it almost three times faster overall in Tensor surgeries.
To speed up the calculations even more, they have introduced the concept of fine-grained structured sparsity, which is a very nice word for a fairly simple concept. Neural networks work with large lists of numbers, so-called weights, which affect the final production. The more numbers to crack, the slower it gets.
But not all of these numbers are actually useful. Some of them are literally just zero and can basically be thrown out, leading to massive speeds when you can crack multiple numbers at once. Sparsity mainly compresses the numbers, which requires less effort to make calculations with. The new “Sparse Tensor Core” is built to work on compressed data.
Despite the changes, NVIDIA says that this should not affect the accuracy of trained models at all.
For Sparse INT8 calculations, one of the smallest number formats, the top performance of a single A100 GPU over 1.25 PetaFLOP is a staggeringly high number. Of course, it’s only when you dial a specific type of number, but it’s still impressive.