Optimising black-box CUDA kernels
2 min readNov 24, 2023

Optimising a black-box CUDA kernel is not an easy task, without knowledge of the kernel algorithm itself. But sometimes, this task comes up. Some tips on how to go about optimising such a kernel.
- Start with the right profiling tools — Profiling the existing kernel to identify the bottlenecks is the most important task. Start with the Nsight Compute tool (ncu or ncu-ui) to obtain metrics and bottleneck areas for the various kernels. If the target kernel has multiple dependent kernels, they would have to be partitioned for the purpose of profiling using Replay modes — Replay modes are described in the documentation: https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#range-replay
- Start with the right input data and benchmark data — Identify the target use-cases and the benchmark datasets required for that use-case
- Identify the target GPUs and their capabilities — for example, tensor cores and HW engines. The CUDA capabilities for GPUs can be obtained from documentation or tools like https://prabindh.github.io/mygpu/
- Implement and benchmark iteratively — Depending on the bottleneck, perform iterative optimisations. These can include for example, shifting the model of operation to asynchronous / concurrent operations using multiple CUDA Streams (ex https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf ). Other areas could be to use the latest Warp primitives (https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/), optimising the grid/block assignments depending on the loading of the SMs.