Julia performance - Threads & GPU

Thu, Mar 25, 2021 3-minute read

This article gives real-life examples of how to improve performance by using multithreaded execution and GPU compute.

First, some scientific background

I have recently played around with scientific simulations for bond-resolved atomic force microscopy (AFM). If you do not know what this is, just a one-liner explanation: local force measurements can be used to image the bond-structure of single molecules. More information can be found in the seminal paper by Leo Gross et al.

Now the whole imaging mechanism is a little complicated, but very useful models have been developed to simulate and analyze such images. One mile stone was the introduction of the Probe Particle Model, developed by Prokop Hapala, Pavel Jelinek et al., see here and here.

AFM simulation of an olympicene-like molecule
Probe Particle Model simulation of an AFM image of a single olympicene-like molecule (image width: 1.6 nm)

In this model a force field is calculated (Lennard Jones and Coulomb forces) and the probe particle (which is attached to the atomic force microscopy tip) is relaxed within this force field, i.e. it can respond to the forces and shift out of its starting position. The original code that has been made open source (highly appreciated!) is programmed in Python and C.

Ok, so finally we talk about code

I have implemented some basic parts of it in Julia and want to talk a little bit about performance. The original code runs relatively slow for a 100x100x100 grid of points - so the first most obvious thing was to use multi-threading. This is actually very easy to add!

In the simple code there is some main loop that looks like this:

for i in eachindex(grid)
    # computation for each point in the grid
    # ...
end

All I had to do was to add the @Threads.threads macro:

@Threads.threads for i in eachindex(grid)
    # computation for each point in the grid
    # ...
end

That’s it! Now just run julia with threading support: julia --threads 4.

On my old rusty i5-3570K with four threads this approach will lead to a speedup by a factor of 3.5! Isn’t that awesome?

The grand finale: GPUs for the win!

The loop above can also be written in one function that broadcasts over grid. You will see in a moment why this is a convenient way for GPU computation.

forcefield = force_at_point.(grid, (parameter1, ), (parameter2, ))

The brackets around parameter1 and parameter2 are there to avoid broadcasting over those variables.

But now let’s get to the pralellization on the GPU. We will make use of CUDA arrays that work with NVIDIA cards - I have a NVIDIA GeForce GTX 1080 in my PC.

using CUDA
grid_cu = cu(grid)  # creates a CUDA array
forcefield = force_at_point.(grid_cu, (parameter1, ), (parameter2, ))

Broadcasting will automatically work on CUDA arrays. It can be as easy as that! In some cases you might want to write a GPU kernel function. OK, so how much faster did we get? A whopping 50-fold increase in performance compared to the single-threaded calculation.