Optimising black-box CUDA kernels

Prabindh Sundareson
2 min readNov 24, 2023

Optimising a black-box CUDA kernel is not an easy task, without knowledge of the kernel algorithm itself. But sometimes, this task comes up. Some tips on how to go about optimising such a kernel.

  1. Start with the right profiling tools — Profiling the existing kernel to identify the bottlenecks is the most important task. Start with the Nsight Compute tool (ncu or ncu-ui) to obtain metrics and bottleneck areas for the various kernels. If the target kernel has multiple dependent kernels, they would have to be partitioned for the purpose of profiling using Replay modes — Replay modes are described in the documentation: https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#range-replay
  2. Start with the right input data and benchmark data — Identify the target use-cases and the benchmark datasets required for that use-case
  3. Identify the target GPUs and their capabilities — for example, tensor cores and HW engines. The CUDA capabilities for GPUs can be obtained from documentation or tools like https://prabindh.github.io/mygpu/
  4. Implement and benchmark iteratively — Depending on the bottleneck, perform iterative optimisations. These can include for example, shifting the model of operation to asynchronous / concurrent operations using multiple CUDA Streams (ex https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf ). Other areas could be to use the latest Warp primitives (https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/), optimising the grid/block assignments depending on the loading of the SMs.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet