- C/C++ compiler
- Visual Profiler
- GPU-accelerated BLAS library
- GPU-accelerated FFT library
- GPU-accelerated Sparse Matrix library
- GPU-accelerated RNG library
- Additional tools and documentation
- Easier Application Porting
- Share GPUs across multiple threads
- Use all GPUs in the system concurrently from a single host thread
- No-copy pinning of system memory, a faster alternative to cudaMallocHost()
- C++ new/delete and support for virtual functions
- Support for inline PTX assembly
- Thrust library of templated performance primitives such as sort, reduce, etc.
- Nvidia Performance Primitives (NPP) library for image/video processing
- Layered Textures for working with same size/format textures at larger sizes and higher performance
- Faster Multi-GPU Programming
- Unified Virtual Addressing
- GPUDirect v2.0 support for Peer-to-Peer Communication
- New & Improved Developer Tools
- Automated Performance Analysis in Visual Profiler
- C++ debugging in CUDA-GDB for Linux and MacOS
- GPU binary disassembler for Fermi architecture (cuobjdump)
- Parallel Nsight 2.0 now available for Windows developers with new debugging and profiling features.
This section summarizes the changes in CUDA 11.2.1 (11.2 Update 1) since the 11.2.0 GA release.
- Previously, when using recent versions of VS 2019 host compiler, a call to pow(double, int) or pow(float, int) in host or device code sometimes caused build failures. This issue has been resolved.
- New singular value decomposition (GESVDR) is added. GESVDR computes partial spectrum with random sampling, an order of magnitude faster than GESVD.
- libcusolver.so no longer links libcublas_static.a; instead, it depends on libcublas.so. This reduces the binary size of libcusolver.so. However, it breaks backward compatibility. The user has to link libcusolver.so with the correct version of libcublas.so.
- New Tensor Core-accelerated Block Sparse Matrix - Matrix Multiplication (cusparseSpMM) and introduction of the Blocked-Ellpack storage format.
- New algorithms for CSR/COO Sparse Matrix - Vector Multiplication (cusparseSpMV) with better performance.
- Extended functionalities for cusparseSpMV:
- Support for the CSC format.
- Support for regular/complex bfloat16 data types for both uniform and mixed-precision computation.
- Support for mixed regular-complex data type computation.
- Support for deterministic and non-deterministic computation.
- New algorithm (CUSPARSE_SPMM_CSR_ALG3) for Sparse Matrix - Matrix Multiplication (cusparseSpMM) with better performance especially for small matrices.
- New routine for Sampled Dense Matrix - Dense Matrix Multiplication (cusparseSDDMM) which deprecated cusparseConstrainedGeMM and provides better performance.
- Better accuracy of cusparseAxpby, cusparseRot, cusparseSpVV for bfloat16 and half regular/complex data types.
- All routines support NVTX annotation for enhancing the profiler time line on complex applications.
- cusparseConstrainedGeMM has been deprecated in favor of cusparseSDDMM.
- cusparseCsrmvEx has been deprecated in favor of cusparseSpMV.
- COO Array of Structure (CooAoS) format has been deprecated including cusparseCreateCooAoS, cusparseCooAoSGet, and its support for cusparseSpMV.
- cusparseDestroySpVec, cusparseDestroyDnVec, cusparseDestroySpMat, cusparseDestroyDnMat, cusparseDestroy with NULL argument could cause segmentation fault on Windows.
- cusparseAxpby, cusparseGather, cusparseScatter, cusparseRot, cusparseSpVV, cusparseSpMV now support zero-size matrices.
- cusparseCsr2cscEx2 now correctly handles empty matrices (nnz = 0).
- cusparseXcsr2csr_compress now uses 2-norm for the comparison of complex values instead of only the real part.
- NPPNew features:New APIs added to compute Distance Transform using Parallel Banding Algorithm (PBA):
- nppiDistanceTransformPBA_xxxxx_C1R_Ctx() – where xxxxx specifies the input and output combination: 8u16u, 8s16u, 16u16u, 16s16u, 8u32f, 8s32f, 16u32f, 16s32f
- Fixed the issue in which Label Markers adds zero pixel as object region.
- nvJPEG decoder added a new API to support region of interest (ROI) based decoding for batched hardware decoder:
- cuFFT planning and plan estimation functions may not restore correct context affecting CUDA driver API applications.
- Plans with strides, primes larger than 127 in FFT size decomposition and total size of transform including strides bigger than 32GB produce incorrect results.
- Previously, reduced performance of power-of-2 single precision FFTs was observed on GPUs with sm_86 architecture. This issue has been resolved.
- Large prime factors in size decomposition and real to complex or complex to real FFT type no longer cause cuFFT plan functions to fail.
- CUPTIDeprecations early notice:The following functions are scheduled to be deprecated in 11.3 and will be removed in a future release:
- NVPW_MetricsContext_RunScript and NVPW_MetricsContext_ExecScript_Begin from the header nvperf_host.h.
- cuptiDeviceGetTimestamp from the header cupti_events.h
Complete release notes can be found here.