• C/C++ compiler
  • Visual Profiler
  • GPU-accelerated BLAS library
  • GPU-accelerated FFT library
  • GPU-accelerated Sparse Matrix library
  • GPU-accelerated RNG library
  • Additional tools and documentation


  • Easier Application Porting
    • Share GPUs across multiple threads
    • Use all GPUs in the system concurrently from a single host thread
    • No-copy pinning of system memory, a faster alternative to cudaMallocHost()
    • C++ new/delete and support for virtual functions
    • Support for inline PTX assembly
    • Thrust library of templated performance primitives such as sort, reduce, etc.
    • Nvidia Performance Primitives (NPP) library for image/video processing
    • Layered Textures for working with same size/format textures at larger sizes and higher performance
  • Faster Multi-GPU Programming
    • Unified Virtual Addressing
    • GPUDirect v2.0 support for Peer-to-Peer Communication
  • New & Improved Developer Tools
    • Automated Performance Analysis in Visual Profiler
    • C++ debugging in CUDA-GDB for Linux and MacOS
    • GPU binary disassembler for Fermi architecture (cuobjdump)
    • Parallel Nsight 2.0 now available for Windows developers with new debugging and profiling features.

What's New:

This section summarizes the changes in CUDA 11.2.1 (11.2 Update 1) since the 11.2.0 GA release.

CUDA Compiler

Resolved Issues:

  • Previously, when using recent versions of VS 2019 host compiler, a call to pow(double, int) or pow(float, int) in host or device code sometimes caused build failures. This issue has been resolved.


New Features:

  • New singular value decomposition (GESVDR) is added. GESVDR computes partial spectrum with random sampling, an order of magnitude faster than GESVD.
  • libcusolver.so no longer links libcublas_static.a; instead, it depends on libcublas.so. This reduces the binary size of libcusolver.so. However, it breaks backward compatibility. The user has to link libcusolver.so with the correct version of libcublas.so.


New Features:

  • New Tensor Core-accelerated Block Sparse Matrix - Matrix Multiplication (cusparseSpMM) and introduction of the Blocked-Ellpack storage format.
  • New algorithms for CSR/COO Sparse Matrix - Vector Multiplication (cusparseSpMV) with better performance.
  • Extended functionalities for cusparseSpMV:
  • Support for the CSC format.
  • Support for regular/complex bfloat16 data types for both uniform and mixed-precision computation.
  • Support for mixed regular-complex data type computation.
  • Support for deterministic and non-deterministic computation.
  • New algorithm (CUSPARSE_SPMM_CSR_ALG3) for Sparse Matrix - Matrix Multiplication (cusparseSpMM) with better performance especially for small matrices.
  • New routine for Sampled Dense Matrix - Dense Matrix Multiplication (cusparseSDDMM) which deprecated cusparseConstrainedGeMM and provides better performance.
  • Better accuracy of cusparseAxpby, cusparseRot, cusparseSpVV for bfloat16 and half regular/complex data types.
  • All routines support NVTX annotation for enhancing the profiler time line on complex applications.


  • cusparseConstrainedGeMM has been deprecated in favor of cusparseSDDMM.
  • cusparseCsrmvEx has been deprecated in favor of cusparseSpMV.
  • COO Array of Structure (CooAoS) format has been deprecated including cusparseCreateCooAoS, cusparseCooAoSGet, and its support for cusparseSpMV.

Known Issues:

  • cusparseDestroySpVec, cusparseDestroyDnVec, cusparseDestroySpMat, cusparseDestroyDnMat, cusparseDestroy with NULL argument could cause segmentation fault on Windows.

Resolved Issues:

  • cusparseAxpby, cusparseGather, cusparseScatter, cusparseRot, cusparseSpVV, cusparseSpMV now support zero-size matrices.
  • cusparseCsr2cscEx2 now correctly handles empty matrices (nnz = 0).
  • cusparseXcsr2csr_compress now uses 2-norm for the comparison of complex values instead of only the real part.
  • NPPNew features:New APIs added to compute Distance Transform using Parallel Banding Algorithm (PBA):
  • nppiDistanceTransformPBA_xxxxx_C1R_Ctx() – where xxxxx specifies the input and output combination: 8u16u, 8s16u, 16u16u, 16s16u, 8u32f, 8s32f, 16u32f, 16s32f
  • nppiSignedDistanceTransformPBA_32f_C1R_Ctx()

Resolved issues:

  • Fixed the issue in which Label Markers adds zero pixel as object region.

New Features:

  • nvJPEG decoder added a new API to support region of interest (ROI) based decoding for batched hardware decoder:
  • nvjpegDecodeBatchedEx()
  • nvjpegDecodeBatchedSupportedEx()

cuFFTKnown Issues:

  • cuFFT planning and plan estimation functions may not restore correct context affecting CUDA driver API applications.
  • Plans with strides, primes larger than 127 in FFT size decomposition and total size of transform including strides bigger than 32GB produce incorrect results.

Resolved Issues:

  • Previously, reduced performance of power-of-2 single precision FFTs was observed on GPUs with sm_86 architecture. This issue has been resolved.
  • Large prime factors in size decomposition and real to complex or complex to real FFT type no longer cause cuFFT plan functions to fail.
  • CUPTIDeprecations early notice:The following functions are scheduled to be deprecated in 11.3 and will be removed in a future release:
  • NVPW_MetricsContext_RunScript and NVPW_MetricsContext_ExecScript_Begin from the header nvperf_host.h.
  • cuptiDeviceGetTimestamp from the header cupti_events.h

Complete release notes can be found here.