When a system with hybrid architecture has been chosen and the existing software is to be ported to a platform with graphics accelerators, the following steps should be accomplished:
- Step 1. Profile the application. Evidently, only the bottlenecks rather than the full application are to be ported to GPU. Once revealed, such bottlenecks enable to define the general scope of the project as well as to forecast the achievable performance gain.
- Step 2. Make some changes in the algorithm and data structures. In most cases, if the application has already been parallelized, the structures and classes as well as the algorithm itself should be slightly modified, e.g. to shift from the Array-Of-Structures pattern to the Structure-Of-Arrays one. However, if the level of parallelization is not high enough, you will have to ask some industry expert for advice and to select a new yet similar algorithm.
- Step 3. Port to GPU all application bottlenecks as well as a small part of ‘general’ code since in most cases the calls of ‘high’ kernels are intermitted by some data preparation. Being entirely accomplished by CPU such a preparation will lead to numerous exchanges between CPU and GPUs thus resulting in 20 to 50 per cent (or even more) decrease in application performance.
- Step 4. Make a ‘deep’ optimization of GPGPU kernels that seem to be the major bottlenecks. When the achieved performance gain is not high enough, the flexibility of CUDA architecture provides the developers with a broad range of techniques for its augmentation including asynchronous replication, the tuning of kernels to a particular GPU, the use of a dynamically shared memory, etc. However, such methods should be applied only to problematic kernels that appear to become new application bottleneck.
- Step 5. Analyze the efficiency of the application after porting. This step is pivotal in decision making on purchasing new hardware for new applications. All pros and cons should be accounted for when, for example, the customer has to choose between two processor units with three Tesla cards in each and four units each equipped with only two Tesla cards. In such a case, some additional benchmarking is inevitable but it could be easily justified by the ability to save thousands or even dozens of thousands of dollars.
The experts of ttgLabs will readily accomplish the complete set of these activities or some of the abovementioned steps provided the previous contractor was underqualified. Our solid experience in software optimization for GPGPU has been implemented in a set of auxiliary libraries. We use these libraries to offer our customers some additional unique services for free.
For instance, after Step 1 the customer gets a software assembly which, once the application stops, generates an HTML report with the estimation of performance gain in the GPU version compared with the existing one. On the Step 4, we embed the ttgLib library into customer software and due to the procedures of dynamic optimization of the application and its tuning to the processed data this automatically leads to an increase in application performance by 10 to 50 percent.