• C/C++ compiler
  • Visual Profiler
  • GPU-accelerated BLAS library
  • GPU-accelerated FFT library
  • GPU-accelerated Sparse Matrix library
  • GPU-accelerated RNG library
  • Additional tools and documentation


  • Easier Application Porting
    • Share GPUs across multiple threads
    • Use all GPUs in the system concurrently from a single host thread
    • No-copy pinning of system memory, a faster alternative to cudaMallocHost()
    • C++ new/delete and support for virtual functions
    • Support for inline PTX assembly
    • Thrust library of templated performance primitives such as sort, reduce, etc.
    • Nvidia Performance Primitives (NPP) library for image/video processing
    • Layered Textures for working with same size/format textures at larger sizes and higher performance
  • Faster Multi-GPU Programming
    • Unified Virtual Addressing
    • GPUDirect v2.0 support for Peer-to-Peer Communication
  • New & Improved Developer Tools
    • Automated Performance Analysis in Visual Profiler
    • C++ debugging in CUDA-GDB for Linux and MacOS
    • GPU binary disassembler for Fermi architecture (cuobjdump)
    • Parallel Nsight 2.0 now available for Windows developers with new debugging and profiling features.

What's New:

16-bit floating point (FP16) data format

  • Store up to 2x larger datasets in GPU memory
  • Reduce memory bandwidth requirements by up to 2x
  • New mixed precision cublasSgemmEX() routine supports 2x larger matrices

New cuSPARSE GEMVI routines

  • Optimized dense matrix x sparse vector routines - ideal for Natural Language Processing

Instruction-level profiling helps pinpoint performance bottlenecks

  • Quickly identify the specific lines of source code limiting the performance of GPU code

Apply advanced performance optimizations more easily

Previous version 6.5.14: