Nvidia CUDA Toolkit

Nvidia CUDA Toolkit

The CUDA Installers include the CUDA Toolkit, SDK code samples, and developer drivers.

1.3 GB
4.3 18 votes


  • C/C++ compiler
  • Visual Profiler
  • GPU-accelerated BLAS library
  • GPU-accelerated FFT library
  • GPU-accelerated Sparse Matrix library
  • GPU-accelerated RNG library
  • Additional tools and documentation


  • Easier Application Porting
    • Share GPUs across multiple threads
    • Use all GPUs in the system concurrently from a single host thread
    • No-copy pinning of system memory, a faster alternative to cudaMallocHost()
    • C++ new/delete and support for virtual functions
    • Support for inline PTX assembly
    • Thrust library of templated performance primitives such as sort, reduce, etc.
    • Nvidia Performance Primitives (NPP) library for image/video processing
    • Layered Textures for working with same size/format textures at larger sizes and higher performance
  • Faster Multi-GPU Programming
    • Unified Virtual Addressing
    • GPUDirect v2.0 support for Peer-to-Peer Communication
  • New & Improved Developer Tools
    • Automated Performance Analysis in Visual Profiler
    • C++ debugging in CUDA-GDB for Linux and MacOS
    • GPU binary disassembler for Fermi architecture (cuobjdump)
    • Parallel Nsight 2.0 now available for Windows developers with new debugging and profiling features.

What's New:

cuBLAS Library

This update contains performance enhancements and bug-fixes to the cuBLAS library in CUDA Toolkit 8. Deep Learning applications based on Recurrent Neural Networks (RNNs) and Fully Connected Networks (FCNs) will benefit from new GEMM kernels and improved heuristics in this release.

This update supports the x86_64 architecture on Linux, Windows, and Mac OS operating systems, and the ppc64le architecture on Linux only.

The highlights of this update are as follows:

  • Performance enhancements for GEMM matrices used in speech and natural language processing
  • Integration of OpenAI GEMM kernels
  • Improved GEMM heuristics to select optimized algorithms for given input sizes
  • Heuristic fixes for batched GEMMs
  • GEMM performance bug fixes for Pascal and Kepler platforms

CUDA Tools

  • CUDA Compilers. The CUDA compiler now supports Xcode 8.2.1.
  • NVRTC. NVRTC is no longer considered a preview feature.

CUDA Libraries

  • cuBLAS. The cuBLAS library added a new function cublasGemmEx(), which is an extension of cublas<t/>gemm(). It allows the user to specify the algorithm, as well as the precision of the computation and of the input and output matrices. The function can be used to perform matrix-matrix multiplication at lower precision.

Resolved Issues

General CUDA

  • CUDA Installer. On some SLES or openSUSE system configurations, the NVIDIA GL library package may need to be locked before the steps for a GL-less installation are performed. The NVIDIA GL library package can be locked with this command:
  • sudo zypper addlock nvidia-glG04
  • Unified memory. On GP10x systems, applications that use cudaMallocManaged() and attempt to use cuda-gdb will incur random spurious MMU faults that will take down the application.
  • Unified memory. Functions cudaMallocHost() and cudaHostRegister() don't work correctly on multi-GPU systems with the IOMMU enabled on Linux. The only workaround is to disable unified memory support with the CUDA_DISABLE_UNIFIED_MEMORY=1 environment variable.
  • Unified memory. Fixed an issue where cuda-gdb or cuda-memcheck would crash when used on an application that calls cudaMemPrefetchAsync().
  • Unified memory. Fixed a potential issue that can cause an application to hang when using cudaMemPrefetchAsync().

CUDA Tools

  • CUDA Compilers. Fixed an issue with wrong code generation for computing the address of an array when using a 64-bit index.
  • CUDA Compilers. When a program is compiled with whole program optimization, applying launch bounds to recursive functions or to indirect function calls may have unpredictable results.
  • CUDA Profiler. The PC sampling warp state counts were incorrect in some cases.
  • CUDA Profiler. Profiling applications using nvprof or Visual Profiler on systems without an NVIDIA driver resulted in an error. This is now reported as a warning.
  • cuSOLVER. Fixed an issue with the cuSOLVER library where some of its functions were not exposed, resulting in link errors.
  • NVTX. The NVIDIA Tools Extension SDK (NVTX) function nvtxGetExportTable() was missing from the export table list.

CUDA Libraries

  • cuBLAS. Fixed GEMM performance issues on Kepler and Pascal for different matrix sizes, including small batches. Note that this fix is available only in the cuBLAS packages on the CUDA network repository.
  • cuBLAS. Updated the cuBLAS headers to use comments that are in compliance with ANSI C standards.
  • cuBLAS. Made optimizations for mixed-precision (FP16, INT8) matrix-matrix multiplication of matrices with a small number of columns (n).
  • cuBLAS. Fixed an issue with the trsm() function for large-sized matrices.

Known Issues

General CUDA

  • CUDA library. Function cuDeviceGetP2PAttribute() was not published in the cuda library (libcuda.so). Until a new build of the toolkit is issued, users can either use the driver version, cudaDeviceGetP2PAttribute(), or perform the link to use libcuda directly instead of the stub (usually it can be done by adding -L/usr/lib64).

CUDA Tools

  • CUDA Profiler. When a device is in the "exclusive" process compute mode, the profiler may fail to collect events or metrics in "application replay" mode. In this case, use "kernel replay" mode.
  • CUDA Profiler. In the Visual Profiler, the Run > Configure Metrics and Events... dialog does not work for the device that has NVLink support. It's suggested to collect all metrics and events using nvprof and then import into nvvp.
  • CUDA Profiler, CUPTI. Some devices with compute capability 6.1 don't support multi-context scope collection for metrics. This issue affects nvprof, Visual Profiler, and CUPTI.

Previous version 6.5.14: