Accelerating SPM & fMRI with GPUs for Neuroimaging
The field of neuroimaging is progressing rapidly with state-of-the-art medical image acquisition devices, a plethora of advanced brain imaging techniques, and computationally fast hardware to process and analyze these images. However, understanding of how the brain works is ellusive. Neuroimaging studies demand extensive analysis of brain images and simulations over multiple subjects with high resolution images. Bringing down the time for these simulations would enable the neuroimagers to make faster deductions and attempt still larger simulations. The authors studied the performance of a popular MATLAB-based neuroimaging software called SPM - Statistical Parametric Mapping.
SPM is a popular software package for analysis of brain imaging data sequences. SPM(Statistical Parametric Mapping) refers to the construction and assessment of spatially extended statistical processes used to test hypotheses about functional imaging data. SPM is generally used to identify functionally specialized brain responses and is the most prevalent approach to characterizing anatomy and disease related changes. It is a voxel based approach employing classical reference to make some comment about regionally specific responses in the brain to experimental factors. SPM can process a sequence of images from multiple patients together or even analyze the time series of images of a single person. The current version of SPM and the one used in this study is SPM8. SPM8 is capable of handling the analysis of fMRI(functional Magnetic Resonance Imaging), PET, SPECT, EEG and MEG images.
SPM is a MATLAB-based software package consisting of an exhaustive collection of m-files implementing many medical image processing algorithms like realignment, spatial normalization, segmentation, smoothening, etc. The software is also capable of handling a significant number of C based MEX files. Though MATLAB provides an excellent environment for numerical computations and image processing, it can be slow, especially given the large scale of simulations involved in SPM. The authors attempted to accelerate SPM by parallelizing some computationally intensive modules, using NVIDIA CUDA, and running these modules on the GPU. The particular module selected for the study is that of bspline interpolation given its time consuming nature, extensive usage in SPM workflows, and good scope for parallelization. Jacket, a GPU computing platform for MATLAB, was used to further expedite bspline interpolation. Jacket efficiently manages CUDA runtime overheads. The authors also created CPU-based multithreaded bspline interpolation modules for performance comparison.
The MEX facilities in this study were provided by Jacket. Jacket enables a MATLAB programmer to write MATLAB-like code which will run on the GPU instead of the CPU. Jacket allows one to create GPU data structures just like one would create CPU data structures. For example, the function A = gsingle(B) casts a MATLAB matrix B to a single precision floating point GPU matrix A. Once the GPU data structure has been created, operations take place on the GPU instead of the CPU. To turn off GPU computation, simply cast the data back to the CPU using one of the MATLAB data types e.g. double. These functions are used as follows:
With the help of these new GPU based data types and a function library mirroring the important functions provided by MATLAB, Jacket enables MATLAB code to run on the GPU providing a full runtime system which optimizes GPU-specific programming aspects such as memory transfers, kernel configurations, and execution launches for the MATLAB user. Jacket allows data to remain on the GPU between successive function calls rather than resorting to a round trip memory transfer for each call. It also provides an open interface to write custom CUDA code and link the CUDA functions into the optimized Jacket runtime to alleviate the overhead of repeated memory transfers between the CPU and GPU. This interface is Jacket SDK.
Jacket SDK provides "jkt" functions which mimic some standard MEX API functions and help in integrating custom CUDA kernels in the Jacket runtime. Instead of using the conventional "mexFunction", Jacket SDK provides the "jktFunction" as the entry point into the MEX file. Within a jktFunction access to several jkt API functions are available to do tasks such as getting MATLAB input, allocating device memory, casting the input CPU mxArrays to GPU arrays, finding the dimensions of the mxArrays, etc. Explicitly allocating memory on the GPU using cudaMalloc and then freeing it using cudaFree is not necessary with Jacket runtime. The CUDA kernel remains the same, reducing coding effort. Jacket SDK also provides an easy-to-use utility based on the "nvmex" script by NVIDIA for compiling the Jacket MEX file.
Bspline interpolation is used in several workflows in SPM. Bspline interpolation allows images to be redrawn based on realignment with a reference image or segmented with respect to model maps. The New Segmentation routine of the fMRI workflow is useful for neuroimaging and spends significant time in bspline interpolation.
The New Segmentation routine is an advanced version of the Segmentation process in the fMRI workflow. This function performs segmentation, bias correction, and spatial normalization all in the same routine. It uses an extended set of prior tissue probability maps which includes white matter, grey matter, cerebrospinal fluid, bone, soft tissue, and air/background and accepts multiple channel and multiple subject structural images. Like the normal segmentation routine, the new segmentation makes intensive use of bspline interpolation.
From the following figure the authors found that for the New Segmentation routine, the best case CPU multithreaded version gets a speedup of 1.2X, the raw CUDA MEX file improves performance by 3X while the Jacket optimized CUDA runs faster by 3.57X as compared to the unoptimized orginal baseline demonstrating that significant benefits can be obtained by running MEX files on the GPU. Productivity enhancements gained by leveraging the Jacket GPU computing platform.
The study was done with a Desktop workstation containing a Quad core Intel Xeon processor, each core running at 2.67 Ghz and capable of running 8 threads in total using Intel Hyperthreading technology, 8 MB L2 cache and 6GB of DDR3 memory.
The GPU used was an NVIDIA Tesla C1060 consisting of 240 streaming processor cores running at 1.3 Ghz; 4 GB of DDR3 VRAM and support for double precision floating point operations. The operating system was Red Hat Enterprise Linux Client release 5.4. MATLAB R2009a and SPM8 were used for the study. The CUDA version was 2.3 and the Jacket release was v1.2.1.
Georgia Institute of Technology
- Aniruddha Dasgupta - School of Electrical and Computer Engineering
- Hyesoon Kim - School of Computer Science
- Chris Rorden - School of Psychology
-  NVIDIA CUDA Programming Guide v2.3
-  NVIDIA Tesla C1060 Computing Processor
-  Statistical Parametric Mapping
« Back to Case Studies