Mutation modeling for HCV on the GPU
The Centers for Disease Control and Prevention (CDC) is currently performing Research and Development (R&D) in computational science in order to accelerate research and reduce the costs of the massive parallelization required by some biological and statistical algorithms. Relevant factors in platform decisions are: time, security, costs, processors required, scalability, memory model, instruction and data models, visualization, repeatability and sustainability. A case study of interest is the biological research regarding coordinated mutations of the Hepatitis C Virus (HCV). AccelerEyes provided collaborative R&D resources and greatly improved the speed of this HCV research with the use of parallelization, reducing the computing time from 40 days to less than 1 day. This vast improvement was achieved with a low-cost Personal Supercomputer with NVIDIA GPU's, Mathworks MATLAB®, and AccelerEyes Jacket. Moderately faster execution times were achieved with a high-cost supercomputer at Cornell University, a TeraGrid computing resource with 512 dedicated cores running MATLAB.
Hepatitis C virus (HCV) is a major cause of liver disease worldwide (Alter, 2007). It has been previously shown that despite its extensive heterogeneity, HCV evolution of Hepatitis C virus is shaped by numerous coordinated substitutions between pairs of amino acid (aa) sites, which can be organized into a scale-free network (Campo et al 2008). This finding creates a bridge between molecular evolution and the science of complex networks. This scientific discipline studies the topological features of networks, which highly influence the dynamics of the processes executed on them. The HCV network of coordinated substitutions has a topology with some key characteristics identified in other biological, technological, and social complex networks. The topological structure and hierarchical organization of this network suggest that a small number of aa sites exert extensive impact on hepatitis C virus evolution. It was also found that HCV is highly resistant to disadvantageous mutations at random sites, but it is very sensitive to mutations at the sites with the highest number of links (hubs). The HCV network topology may help devise novel molecular intervention strategies for disrupting viral functions or impeding compensatory changes for vaccine escape or drug resistance mutations.
There are three different sources of covariation in related biological sequences (such as the dataset of HCV sequences): (i) chance, (ii) common ancestry and (iii) structural or functional constraints. Effectively discriminating among these underlying causes is a task with many statistical and computational difficulties. In oder to rule out the effect of the first source of covariation (chances), the statistical significance of each correlation was established with a permutation test whereby the aa at each site in the sequence alignment was vertically shuffled. Ten thousand random alignments were created this way, simulating the distribution of correlation values under the null hypothesis that substitutions of aa at two sites are statistically independent. For physiochemical factors, a pair of sites was considered significantly correlated if its correlation value in the observed dataset was higher than the correlation value for those two sites in any of the random datasets (p = 0.0001). This is a simple but computationally intensive process that was greatly benefited by parallelization. However, the confounding effects of common ancestry still need to be addressed and a new method was developed. A phylogenetic tree was built and the ancestral sequences were estimated by maximum likelihood. For each pair of sites, a new sample of mutation events is obtained by crawling from the tips to the root of the tree (Willis, unpublished data). This new subset is used to calculate the physicochemical correlation between the two sites, which is then compared with the distribution of 10,000 random correlation values obtained by permutation of the same subset. This method was intractable on regular scientific workstations but the usage of the TeraGrid Computing resource greatly reduced the analysis time via deployment of a computationally efficient parallel algorithm .
The Solution: Personal Supercomputer
AccelerEyes transformed the existing HCV covariation algorithm to GPUs leveraging the Jacket software platform. Several steps were taken to migrate the existing algorithm to GPUs and Jacket. First and foremost, the code was profiled using the MATLAB profiler to identify the most expensive regions of code that would be targets for optimization.
Upon completion of the profiling step, it was clear that the code would benefit most significantly, for both CPUs and GPUs, by moving the computationally expensive regions of code to vectorized implementations within MATLAB. As a result, step two was the vectorization process.
Here is a quick snapshot of vectorization on a small piece of the code:
The code is expanded a bit in the following example and is typical of many MATLAB algorithms in that the use of "for loops" is prevalent. By removing the "for loop" and flattening the code, performance improvements will result.
Vectorization of the "for loop" in the above example results in MATLAB code that looks like the following code snapshot:
Vectorization of MATLAB code provides several benefits even though it requires time and thought. First and foremost, vectorization provides substantial performance improvements in most codes. Furthermore, vectorization is likely a better approach then porting to low level programming languages such as C, C++, or Fortran in that vectorized MATLAB code is a hardware independent implementation of performance improvements and therefore can be easily deployed to other users without recompilation or optimization for different computing architectures.
During the vectorization process, the AccelerEyes and CDC team also incorporated Jacket specific functions in order to GPU enable the code along the way. You will notice through the Vectorized code example that the use of commands such as "gsingle" and "gzeros" are used. These are GPU datatypes established with the Jacket software platform and signal to the MATLAB code that this data is to be directed to the GPU for processing. Once data is in GPU memory, Jacket's just-in-time compiler and run-time system take over in order to execute the computationally intensive portions of the code on the GPU. Use of Jacket on GPU datatypes in this simplest form of GPU enabling algorithms and is made possible due to the vectorization effort.
To implement this algorithm for GPUs in less than 2 weeks from start to finish, the team first profiled the MATLAB code then vectorized the code for optimization and applied Jacket datatypes to move the code to GPUs.
A significant result of this project is that less than 2 weeks were spent preparing the code for performance improvements for both CPU and GPU architectures. It is not uncommon for months or even years to be spent migrating high level algorithms written in MATLAB or other very high level languages to C, C++, or Fortran along with parallelization labor to get codes working with MPI or OpenMP. It is possible that these multi-month or year long efforts can result in significant speed ups, but the trade off of labor time with computing resources must be considered.
The primary objective of the project was to evaluate the performance possibilities of GPU technology and the Jacket platform for the MATLAB based algorithm. The following results were achieved and compared with alternative computing architectures. Any comparative data associated with labor time to run the code on each architecture was not contemplated.
The base line for running the algorithm was 10,000 permutations for each computing architecture. Some modifications to the algorithm for each architecture were assumed to be made but the output was only considered valid if consistent with the original CPU based code.
- Scientific Workstation - CPU Only
To understand the significant run time of the 10,000 values only an estimate was made of the runtime using the original MATLAB code for CPU only. The CPU used for estimating was Intel's Nehalem CPU with 4 cores (8 threads) running at 2.66 GHz.
Cost of the workstation = ~$2,000US
Estimated Run time = ~39 days
- Scientific Workstation - CPU + GPU hybrid
An initial estimate of the runtime for the hybrid workstation configuration with the same Nehalem CPU and 4 Nvidia C1060 GPUs was approximately 24 hours. As a result, the team ran the algorithm in total for 10,000 permutations to get the final data output and run time for the vectorized and Jacket-based code.
Cost of the hybrid workstation = ~$5,000US
Actual Run time = 22.5 hours
- TeraGrid - 512 core CPU cluster
The same base line code used for this architectural work was also used to run the algorithm on Cornell's 512 core Teragrid resource. The same 10,000 permutations were used and as stated in the opening. Some algorithmic modifications were made to the code to maximize the potential of the cluster.
Cost of cluster = >$250,000
Actual Run time = ~5 hours
The performance and speed up results prove that various architectures yield different performance and, depending on the urgency of the results and the users access to computing resources, there are ways to solve computationally intense problems based on these circumstances. Most importantly the price/ performance ratios tell a strong story about what is possible in technical computing in 2010 depending on the user and the type of budget or access available. The economics of technical computing change dramatically with GPUs when using traditional measures of price/performance. In the following chart, run time speed up (X speed up) and amount of dollars spent (X $ spent) are calculated as usual when evaluating price/performance. However, with GPUs delivering substantial performance gain at such a low relative cost to traditional high end computing resources, it is necessary to calculate a relative price/performance (Relative P-P) figure to understand how much more performance an application gains given a small increment of dollars spent.
Although absolute performance may be the goal on any given algorithm or
application and any cost could be expended to get the absolute lowest run
times, when calculating relative price to performance the higher number
will always represent the best value for any given application. Any
relative price performance number less than 100% means that more dollars
are being spent than the number of times an algorithm is sped up from the
base line computing configuration.
- CDC Research and Development Team
- Gallagher Pryor, AccelerEyes
- Sandeep Maddipatla, AccelerEyes
ReferencesAlter M. 2007. Epidemiology of hepatitis C virus infection. World J Gastroenterol 13:2436-2441.
Campo, D. Dimitrova, Z. Mitchell, R. Lara, J. Khudyakov, Y. 2008. Coordinated evolution of the hepatitis C virus. Proc Natl Acad Sci USA.105(28):9685-9690.
Dimitrova, Z. Campo, D. Neuhaus, E. Khudyakov, Y. Woody, N. Lifka, D. Seamless Scaling of a Desktop MATLAB Application to an Experimental TeraGrid Resource. Poster. 21st Supercomputing Conference, November 12-14, Portland, Oregon. 
« Back to Case Studies