The Nvidia CUDA Toolkit provides a development environment for creating high performance GPU-accelerated applications. With the CUDA Toolkit, you can develop, optimize, and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms and HPC supercomputers. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, and a runtime library to deploy your application.

Using built-in capabilities for distributing computations across multi-GPU configurations, scientists and researchers can develop applications that scale from single GPU workstations to cloud installations with thousands of GPUs.

Features

  • C/C++ compiler
  • Visual Profiler
  • GPU-accelerated BLAS library
  • GPU-accelerated FFT library
  • GPU-accelerated Sparse Matrix library
  • GPU-accelerated RNG library
  • Additional tools and documentation

Highlights:

  • Easier Application Porting
    • Share GPUs across multiple threads
    • Use all GPUs in the system concurrently from a single host thread
    • No-copy pinning of system memory, a faster alternative to cudaMallocHost()
    • C++ new/delete and support for virtual functions
    • Support for inline PTX assembly
    • Thrust library of templated performance primitives such as sort, reduce, etc.
    • Nvidia Performance Primitives (NPP) library for image/video processing
    • Layered Textures for working with same size/format textures at larger sizes and higher performance
  • Faster Multi-GPU Programming
    • Unified Virtual Addressing
    • GPUDirect v2.0 support for Peer-to-Peer Communication
  • New & Improved Developer Tools
    • Automated Performance Analysis in Visual Profiler
    • C++ debugging in CUDA-GDB for Linux and MacOS
    • GPU binary disassembler for Fermi architecture (cuobjdump)
    • Parallel Nsight 2.0 now available for Windows developers with new debugging and profiling features.

What's New

  • CUDA 12 introduces support for the NVIDIA Hopper and Ada Lovelace architectures, Arm server processors, Lazy Module and Kernel Loading, revamped Dynamic Parallelism APIs, enhancements to the CUDA graphs API, performance-optimized libraries, and new developer tool capabilities.
  • Support for the NVIDIA Hopper architecture includes next generation Tensor Cores and Transformer Engine, hi-speed NVLink Switch system, mixed precision modes, 2nd generation Multi-Instance GPU (MIG), advanced memory management, and standard C++/Fortran/Python parallel language constructs.

Complete release notes can be found here.