Features:

  • C/C++ compiler
  • Visual Profiler
  • GPU-accelerated BLAS library
  • GPU-accelerated FFT library
  • GPU-accelerated Sparse Matrix library
  • GPU-accelerated RNG library
  • Additional tools and documentation

Highlights:

  • Easier Application Porting
    • Share GPUs across multiple threads
    • Use all GPUs in the system concurrently from a single host thread
    • No-copy pinning of system memory, a faster alternative to cudaMallocHost()
    • C++ new/delete and support for virtual functions
    • Support for inline PTX assembly
    • Thrust library of templated performance primitives such as sort, reduce, etc.
    • Nvidia Performance Primitives (NPP) library for image/video processing
    • Layered Textures for working with same size/format textures at larger sizes and higher performance
  • Faster Multi-GPU Programming
    • Unified Virtual Addressing
    • GPUDirect v2.0 support for Peer-to-Peer Communication
  • New & Improved Developer Tools
    • Automated Performance Analysis in Visual Profiler
    • C++ debugging in CUDA-GDB for Linux and MacOS
    • GPU binary disassembler for Fermi architecture (cuobjdump)
    • Parallel Nsight 2.0 now available for Windows developers with new debugging and profiling features.

What's New:

  • Added support for NVIDIA Ampere GPU architecture based GA10x GPUs GPUs (compute capability 8.6), including the GeForce RTX-30 series.
  • Enhanced CUDA compatibility across minor releases of CUDA will enable CUDA applications to be compatible with all versions of a particular CUDA major release.
  • CUDA 11.1 adds a new PTX Compiler static library that allows compilation of PTX programs using set of APIs provided by the library. See https://docs.nvidia.com/cuda/ptx-compiler-api/index.html for details.
  • Added the 7.1 version of the Parallel Thread Execution instruction set architecture (ISA). For more details on new (sm_86 target, mma.sp) and deprecated instructions, see https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-isa-version-7-1 in the PTX documentation.
  • Added support for Fedora 32 and Debian 10.3 Buster on x86_64 platforms.
  • Unified programming model for:
  • async-copy
  • async-pipeline
  • async-barrier (cuda::barrier)
  • Added hardware accelerated sparse texture support.
  • Added support for read-only mapping for cudaHostRegister.
  • Multi-threaded launch to different CUDA streams is supported.
  • CUDA Graphs enhancements:
  • improved graphExec update
  • external dependencies
  • extended memcopy APIs
  • presubmit
  • Introduced new system level interface using /dev based capabilities for cgroups style isolation with MIG.
  • Improved MPS error handling when using multi-GPUs.
  • A fatal GPU exception generated by a Volta+ MPS client will be contained within the devices affected by it and other clients using those devices. Clients running on the other devices managed by the same MPS server can continue running as normal.
  • Users can now configure and query the per-context time slice duration for a GPU via nvidia-smi. Configuring the time slice will require administrator privileges and the allowed settings are default, short, medium and long. The time slice will only be applicable to CUDA applications that are executed after the configuration is applied.
  • Improved detection and reporting of unsupported configurations.

Complete release notes can be found here.