TensorRT-LLM for Windows speeds up generative AI performance on GeForce RTX GPUs

Alfonso Maruccia

Posts: 1,025   +301
Staff
A hot potato: Nvidia has thus far dominated the AI accelerator business within the server and data center market. Now, the company is enhancing its software offerings to deliver an improved AI experience to users of GeForce and other RTX GPUs in desktop and workstation systems.

Nvidia will soon release TensorRT-LLM, a new open-source library designed to accelerate generative AI algorithms on GeForce RTX and professional RTX GPUs. The latest graphics chips from the Santa Clara corporation include dedicated AI processors called Tensor Cores, which are now providing native AI hardware acceleration to more than 100 million Windows PCs and workstations.

On an RTX-equipped system, TensorRT-LLM can seemingly deliver up to 4x faster inference performance for the latest and most advanced AI large language models (LLM) like Llama 2 and Code Llama. While TensorRT was initially released for data center applications, it is now available for Windows PCs equipped with powerful RTX graphics chips.

Modern LLMs drive productivity and are central to AI software, as noted by Nvidia. Thanks to TensorRT-LLM (and an RTX GPU), LLMs can operate more efficiently, resulting in a significantly improved user experience. Chatbots and code assistants can produce multiple unique auto-complete results simultaneously, allowing users to select the best response from the output.

The new open-source library is also beneficial when integrating an LLM algorithm with other technologies, as noted by Nvidia. This is particularly useful in retrieval-augmented generation (RAG) scenarios where an LLM is combined with a vector library or database. RAG solutions enable an LLM to generate responses based on specific datasets (such as user emails or website articles), allowing for more targeted and relevant answers.

Nvidia has announced that TensorRT-LLM will soon be available for download through the Nvidia Developer website. The company already provides optimized TensorRT models and a RAG demo with GeForce news on ngc.nvidia.com and GitHub.

While TensorRT is primarily designed for generative AI professionals and developers, Nvidia is also working on additional AI-based improvements for traditional GeForce RTX customers. TensorRT can now accelerate high-quality image generation using Stable Diffusion, thanks to features like layer fusion, precision calibration, and kernel auto-tuning.

In addition to this, Tensor Cores within RTX GPUs are being utilized to enhance the quality of low-quality internet video streams. RTX Video Super Resolution version 1.5, included in the latest release of GeForce Graphics Drivers (version 545.84), improves video quality and reduces artifacts in content played at native resolution, thanks to advanced "AI pixel processing" technology.

Permalink to story.

 
I've owned 8 nVidia GPUs and I doubt I'll ever own another one if they keep up with all this RT and AI nonsense. I'm well aware they're making tons of money right now but I think we've hit market saturation. They already had to introduce Ray Tracing as a way to sell more powerful GPUs and I think we're only 2 generations away from native ray tracing. Unless they start pushing 8k or something but I'm honestly happy at 4k. 1080p was always lacking to me, 1440p was nice and I think 4k is perfect.

There has to come a day when there isn't anything else to improve, where everything is "good enough."

That last thing I see coming to games is AI that you can talk to and it generates a realistic voice and a unique response to what you say to it. I don't know what kind of hardware requirements that would take but I think having a dedicated AI GPU might be something we see in the future. I've already seen ChatGPT mods for games that do this, I'm curious to see what it would take to do this in realtime on a PC.

I'm sure AMD and nVidia would love for people to start buying 2 GPUs, one for rendering and one for AI.
 
Yes, I agree. That worked so well with PhysX. ;)
I'm familiar with PhysX. The reason that I feel a dedicated AI card is different from a PhysX card is the relatively tame resources it had. A physX card had about half as much much VRAM as a a highend GPU at the time. Now, if you want to do generative AI stuff, you need several GPUs and over half a Terabyte of VRAM backed by Terabytes of system ram. To do real time raytracing you need a 4090. To do real time generative AI you need a rack of GPUs with petabytes of highspeed storage to draw from. If generative AI dialog in games ever becomes a reality I think that is one thing that might actually be a cloud service because of how cost prohibitive it would be to an individual.
 
I'm familiar with PhysX. The reason that I feel a dedicated AI card is different from a PhysX card is the relatively tame resources it had. A physX card had about half as much much VRAM as a a highend GPU at the time. Now, if you want to do generative AI stuff, you need several GPUs and over half a Terabyte of VRAM backed by Terabytes of system ram. To do real time raytracing you need a 4090. To do real time generative AI you need a rack of GPUs with petabytes of highspeed storage to draw from. If generative AI dialog in games ever becomes a reality I think that is one thing that might actually be a cloud service because of how cost prohibitive it would be to an individual.
Which all might mean something along the lines of an "AINow" subscription platform similar to "GeForceNow". I'm not so sure that people are going to buy into something like that, though. There is some indication that interest in AI is waning - https://www.washingtonpost.com/technology/2023/07/07/chatgpt-users-decline-future-ai-openai/ (its from July).
 
I'm sure AMD and nVidia would love for people to start buying 2 GPUs, one for rendering and one for AI.

I used to buy 2 GPUs, even up to Maxwell. Ran SLI setups for many years from 7600GT through 8800GTS 640MB (Step-Up with EVGA to the awesome G92 version), 8800 GTS 512MB, GTX 280, GTX 570 and GTX 980Ti.

Now I just buy one since SLI is a thing of the past.
 
I'm familiar with PhysX. The reason that I feel a dedicated AI card is different from a PhysX card is the relatively tame resources it had. A physX card had about half as much much VRAM as a a highend GPU at the time. Now, if you want to do generative AI stuff, you need several GPUs and over half a Terabyte of VRAM backed by Terabytes of system ram. To do real time raytracing you need a 4090. To do real time generative AI you need a rack of GPUs with petabytes of highspeed storage to draw from. If generative AI dialog in games ever becomes a reality I think that is one thing that might actually be a cloud service because of how cost prohibitive it would be to an individual.
Yes, 4090 can do real time raytracing, any card can, but it's really far away from doing fully raytraced games and that should be the future of gaming
 
The death of science and technological innovation? I hope not.

Yes, 4090 can do real time raytracing, any card can, but it's really far away from doing fully raytraced games and that should be the future of gaming
The only time I like Ray tracing is not when it is used to make things look realistic, it's when it is used stylistically. Don't get me wrong, I think Ray tracing adds all kinds of artistic opportunities for developers. My problem with it is that many devs just say "hey, look, we have Ray tracing" without actually using it to its full potential while adding the full performance hit.

There are plenty of games I've played that have foregone realism in favor of style that, in my opinion, look better than games that add raytracing "just because"

If you look at the crysis 1 RT mod, you can hardly tell the difference between RT and Raster.

Hopefully the new unreal engine will solve those problems or atleast streamline how RT is used. I know I often come off as anti-raytracing but that's incorrect. I'm anti-laziness.
 
The only time I like Ray tracing is not when it is used to make things look realistic, it's when it is used stylistically. Don't get me wrong, I think Ray tracing adds all kinds of artistic opportunities for developers. My problem with it is that many devs just say "hey, look, we have Ray tracing" without actually using it to its full potential while adding the full performance hit.

There are plenty of games I've played that have foregone realism in favor of style that, in my opinion, look better than games that add raytracing "just because"

If you look at the crysis 1 RT mod, you can hardly tell the difference between RT and Raster.

Hopefully the new unreal engine will solve those problems or atleast streamline how RT is used. I know I often come off as anti-raytracing but that's incorrect. I'm anti-laziness.
I'm definitely in agreement that art style is more important overall. Not all games need things to look real (some may even be better for it). That said, in games where looking realistic is a goal/benefit to that particular game, then I'm all for OPTIONAL shiny bells and whistles that make the end result pop. I say 'optional' in caps because I also think being flexible to different people's hardware requirements or even aesthetic tastes is important, but I do support their existence and working towards making them better/more performant.
 
Back