CUDA Python Fundamentals and Best Practices

In this course, you’ll learn how to make Python fly with accelerated computing! Building on proven curricula from CUDA Python and modern CUDA C++ workshops, the tutorial uses CuPy for drop‑in NumPy acceleration, Numba CUDA for handcrafted kernels, nvmath‑python for fast math primitives, and the new cuda.cooperative APIs for cross‑block collaboration. Participants will explore GPU thread hierarchies, shared‑memory tiling, memory‑coalescing strategies, and other fundamentals that underlie high‑performance GPU code- all delivered through a Python‑first lens that preserves the language’s renowned readability and popularity.

Sept 23
Magazinet Kongsberg
1 day
07:00 - 15:00 UTC
Bryce Adelstein Lelbach
Conor Hoekstra
-

The learning happens entirely inside interactive Jupyter notebooks, where you can tweak parameters, rerun cells, and visualize results in real time. Step‑by‑step labs culminate in profile‑driven tuning sessions that capture execution traces with NVIDIA Nsight Systems, spotlight memory bottlenecks, and quantify the speed‑ups your optimizations unlock - mirroring the disciplined workflow championed in advanced CUDA C++ training. By the end, you’ll walk away with a practical toolkit for transforming everyday Python scripts into GPU‑powered engines and a systematic approach to squeezing every last flop from modern accelerators.

In this class, you'll learn to:

- Spot workloads ripe for GPU speed‑ups and explain the CUDA thread-block-grid model.
- Swap in CuPy or Numba to accelerate NumPy code with minimal changes.
- Write and coordinate custom CUDA kernels- including cuda.cooperative launches -entirely in Python.
- Maximize throughput via coalesced memory access, shared‑memory tiling, and lean host/device transfers.
- Profile, diagnose, and iterate on performance using Nsight Systems directly from Jupyter notebooks.

The course material is open source and can be found here: https://github.com/NVIDIA/accelerated-computing-hub/tree/main/gpu-python-tutorial

Bryce Adelstein Lelbach

Principal Architect at NVIDIA

Bryce Adelstein Lelbach has spent over a decade developing programming languages, compilers, and software libraries. He is passionate about parallel programming and strives to make it more accessible for everyone.

Bryce is a Principal Architect at NVIDIA, where he leads programming language efforts and drives the technical roadmap for NVIDIA's compute compilers and libraries.

He is one of the leaders of the systems programming language community, having served as chair of the Standard C++ Library Evolution group and the US standards committee for programming languages (INCITS/PL22). He has been an organizer and program chair for many conferences over the years.

On the C++ Committee, he has personally worked on concurrency primitives, parallel algorithms, executors, and multidimensional arrays. He is one of the founding developers of the HPX parallel runtime system.

Outside of work, Bryce is passionate about airplanes and watches. He lives in Midtown Manhattan with his girlfriend and dog.

Conor Hoekstra

Research Scientist, NVIDIA

Conor (he/him) is a Research Scientist at NVIDIA working on array programming models and languages. He is extremely passionate about programming languages, algorithms and beautiful code. He is the founder and organizer of the Programming Languages Virtual Meetup, he has a YouTube channel @code_report and is the host of three software podcasts:

- ADSP (a podcast about algorithms, data structures, programming and more)

- ArrayCast (a podcast about array languages)

- Tacit Talk (a podcast about array languages and tacit programming)

Conor is also an avid conference speaker. You can find all of Conor’s conference talks and podcast appearances (on other podcasts) here.