A New Method of Co-Scheduling Cuda Kernels

Abstract

Merging CUDA kernels is an effective method to improve the overall GPU performance. In CUDA, the kernel running on GPU can’t be suspended unless it is finished. As a result, we should co-schedule all the pending kernels to take full utilization of GPU resource. In this paper, we mainly make two contributions to the kernel co-scheduling. The former is we provide a model to predict the running time of merged kernel. According to experiments, our model predicts accurately in most situations. The latter is we put forward a new method in selecting kernels from pending list to be merged. Comparing to simply merging neighbor kernels, the co-scheduling method presented in this paper could utilize GPU resource more fully and improve the overall performance.

1. Introduction With the rapid advance in IC technology, and the bottleneck of performance improvement in traditional single processor architecture, CMP was proposed as significant approach to make better use of the increasing transistors. With its advantage in exploit of higher level parallelism in DLP, TLP, and less consumption compared with single processor architecture, Multicore processors has become the main stream architecture. However, the performance cannot always be improved by adding cores into a chip because of the memory bandwidth limit. The performance is especially bad for data-intensive programs like informatics applications. Moreover, the function of multicore processors is fixed whose flexibility is not enough for numerable applications. Heterogeneous multicore architecture with specialized accelerator is a more flexibility way for high performance computer. Reconfigurable computer and graphics processing unit (GPU) are considered as two promising approach in this situation. The nature of GPU, which is particularly designed for applications with large computational requirements, provides huge parallel speedups over traditional multicore processors. Moreover, GPU’s advantage in both programmability and capability has make it popular not only be the graphics processor, but also a general purpose parallel processor accessible to any application [1]. This is known as general-purpose computing on graphics processing units (GPGPU). Today, GPU is integrated into today’s mainstream commercial computing systems. CUDA is introduced by NIVIDIA as a general purpose parallel computing architecture and enabled on its GeForce, Quadra, and Tesla products. Even though GPU offers high parallelism and larger memory bandwidth [2], independent kernels from the same process cannot be loaded into GPU working concurrently, which prevent the fully use of GPU resources. In this paper, we propose a novel method by merging CUDA kernels to exploit efficient use of GPU resources. Based on the model for making predication of the running time of merged kernel, which turns out to be accurate in most situations, our new method in selecting kernels from pending list to be merged achieves a great improvement in GPU resource utilization. The rest of the paper is organized as follows. Section 2 describes the background of CUDA architecture. Section 3 presents our detailed method and implementation in experiment. Section 4 describes the related work. Section 5 presents the conclusion and future work.

2. Background of CUDA

With the rapid advance in parallel computing, modern GPUs have been at the leading edge of increasing chip-level parallelism for some time [3]. CUDA was introduced by NVIDIA in November 2006 as a general purpose parallel computing architecture with a new parallel programming model and instruction set architecture to leverage the parallel compute engine in NVIDIA GPUs, thus can solve many complex computational problems in a more efficient way than on a CPU [4]. It has achieves a great success because of its advantage in easy programmability and flexibility in writing scalable parallel programs. The architecture provides programmers with multi-level task partitions, either in fine-grained data parallelism and threads parallelism, or nested within coarse-grained data parallelism and task parallelism. Programmers can partition the problem into coarse sub-problems that can be solved independently in parallel by blocks, as illustrated by Figure 1.Each sub-problem can also be partitioned into finer pieces that can be solved cooperatively in parallel by all threads within the block. As showed in Figure 1, a compiled CUDA program can execute on different number of processor cores.

Figure 1, The Reusing of Stream Multiprocessors

Our experiment platform is NVIDIA’s TESLA GPU architecture, which delivers high computational throughput on massively parallel problems [3]. As a typical CUDA architecture, teslaC1060 contains 30 stream multi-processors, each of which corresponds to a block. Each stream multi-processor contains 8 stream-processors supporting at most 256 threads executing at the same time. As illustrated in Figure 1, a multi-processor will be reused by different blocks if the number of blocks exceeds the number of multi-processors.

3. Approach& Experiments

3.1 Merging Performance

3.1.1 The Influence of Block Numbers on Running Time As mentioned in section 2, one kernel may contain more blocks than SMs of GPU so that several blocks will be mapped into the same SMs. As a result, some SMs may have a heavier workload and delay the whole running time eventually. In the situation that each kernel has the same computing complexity, we merely increase blocks number and record the running time correspondingly. The result is showed in the figure below. Figure 2, Running time of kernels whose computing complexity, threads per block is unchanged, but block number is increasing.

With the block number increasing, the shape of running time growth is similar to stairs. Assuming that the kernel request A SMs and GPU contains B SMs, the running time is actually determined by parameter of (B-1)/A+1, which presents the biggest number of blocks mapped into the same SMs.

Figure 3, Unbalance in Workload

As the figure 3 shows, the first SMs are mapped by three blocks. However, other SMs are only mapped by two. As a result, the overall running time is actually the running time of the first SMs. At the end of running period, the other SMs merely keep stalling and wait for the first SMs to finish the job. If the kernel contains three additional blocks, the running time is still the same so that the throughput of GPU wills increase.

Figure 4, The Throughout of GPU When the Block Number Increases.

The figure above illustrates two important points about the influence of block number on GPU throughout. The former is that, due to the increase of block numbers, single SMs will run more blocks than before concurrently. Thus, SMs could switch between the blocks and overlap the memory access penalty. In addition, more blocks will take a more fully use of GPU’s resource, such as memory bandwidth, stream processors. The latter point is that the throughout is not always increasing when the block number increases. As the figure 3 shows, the workloads’ unbalance could lead to the degradation of throughout. When the block number is exactly a multiple of SMs, GPU’s throughout will reach its local peak. In contrast, in the situation that only one SMs has a heavier workload than others’, the throughput of GPU will drop to its local bottom. We define the LEVEL NUMBER as the biggest number of blocks mapped into the same SMs. In the figure 3, the LEVEL NUMBER is 3. Assuming the kernel’s computing complexity is the stable, its running time and the LEVEL NUMBER are linearly related, which is showed in table 1.

BLOCK NUMBER LEVEL NUMBER RUNNING TIME
15 1 0.36
25 1 0.38
35 2 0.54
45 2 0.55
55 2 0.55
65 3 0.73
75 3 0.73
85 3 0.74
95 4 0.92
105 4 0.92
115 4 0.92

Table1, the Relationship between Level Number and Running Time

According to Table1, we can conclude that if two kernels’ level numbers are the same, their running time will be also very close. It is level number, instead of block numbers, exerts a direct influence on kernel’s running time.

3.1.2 The Influence of Additional Branch Instruction on Running Time of Merged Kernel The running of a merged kernel is more complex and confusing than single kernel. Firstly, the performance of merged kernel is affected greatly by the characteristics of its sub kernels. For example, if both of its sub kernels exert heavy pressure on global memory access, the bandwidth of GPU will become saturated and extend the running time as a result. Moreover, a merged kernel will add some branch instructions to distinguish different kernels; these branch instructions will also increase the execution time. Finally, merging kernel will expand the scale of kernel code and extend the code transfer time. We would like to compare the running time of merged kernel with the running time of a single kernel with same size. Our main purpose is to observe the influence of additional branch instruction of merged kernel. The result is showed in figure 5.

Figure 5, the Overhead of Merging Kernels

In figure 5, big kernel, merged kernel and series kernels all have the same size. The difference between them is that merged kernel consists of two single kernels. The length of merged kernel’s code is nearly twice as long as a single kernel. Big kernel means a single kernel maintaining the same block number with merged kernel. The series kernels are sub kernels of merged kernels and they are executed serially on GPU. The series kernels take the longest time in this group due to three reasons. Firstly, fewer blocks may underutilize the GPU resource. For example, more blocks could overlap the memory access penalty and take full use of stream processors. Secondly, starting a kernel need to transfer the code from CPU to GPU. More kernels will consume more times during this process. Thirdly, the level number of merged kernel may less than the sum of each sub kernel’s level number. According to our experiment, the merged kernel and big kernel nearly consume the same period of time. Thus, it proves that the addition branch instruction and longer kernel code don’t play an important role on running time.

3.1.3 Predicting Model

Based on the analysis in section 3.1.1 and 3.1.2, we try to put forward a new model to predict the running time of kernels’ merging accurately, because such predicting model is important for kernel merging decision, which we will discuss in section 3.2. Our model mainly takes two factors, which could affect the merging performance greatly according to previous experiments, into consideration. The former is the kernel numbers because more kernels will take more times to transfer and start. The latter is the level number of merged kernel. According to the table 1, bigger level number usually indicates longer running time.

Figure 6, the Actually Running Time of Merged Kernel and Series Kernels

The figure 6 shows the running time of two single kernels and their merged kernel. We can conclude that merging kernels could always reduce the running time. As mentioned in last paragraph, the two factors we consider are kernels numbers and level number. The influence of the two factors on running time is showed in the figure below.

Figure 7, The Influence of kernel numbers and level numbers on running time

The start time of a kernel is related to the size of kernel’s code and the running time of one level is determined by the kernel’s computing complexity. To simplify our model, we just assume that the start time of a kernel is same with one level’s running time. As a result, we can roughly claim that the running time is actually the layers number under SMs showed in figure 7. In the left part of figure 7, there are 6 layers in total, two of which are “Time to Start a Kernel” and other four layers are kernel execution. In the right part of figure 7, there are 4 layers under SMs. Thus, their estimated running time ration is 4/6.

Figure 8, The Y Axis Indicate the Sum of Level Number and Kernel Number. The X Axis is The Number of Block Numbers

In the figure 8, the X axis indicates the block numbers. The Y axis is sum of level number and kernel number, which is also named layer number in last paragraph. The figure 8 is very similar to figure 6. That means we could use the sum of level number and kernel number to estimate the real running time. The estimation model is following.

Running Time∝LevelNumber+KernelNumber

(Merged Kernel Running Time)/(Series Kernels Running Time)
=
(1+LevelNumber of Merged Kerne)/(2+Level Number of Kernel 1+Level Number of Kernel 2)

Level Number=(Block Number-1)/(SMs Number)+1

For estimating the running time of merged kernel, we merely need to know the blocks number of kernel 1 and kernel 2 and use the blocks number to calculate their level number respectively. The block number of a kernel is easy to get and be stored from the program. Then we use the equations above to calculate the estimated running time of merged kernel. In the figure 9, we make some comparisons between the real running time and time estimated by our model.

Figure 9, the Actually Running Time and Estimated Running Time of Merged Kernel

From figure 9, the running time predicted by our model is very close to the actually running time of merged kernel. But if the block number becomes very big, the GPU may meet its resource bottleneck and affect the performance of merged kernel. As a result, the accuracy of our model is also affected by GPU’s resource bottleneck.

3.2 New Co-Scheduling Method

3.2.1 Implementation As mentioned in related work, we also need to construct a “issue queue” in our design. Every kernel launched by user will be saved in this queue firstly. Through the “issue queue”, we could make real-time decision of merging kernels. Our co-scheduling method need the information of block-numbers and it should be saved as long as the kernel is loaded into the “issue queue”. We should also call the function cudaGetDeviceProperties() to get the information of SMs number. When the GPU is available for launching a new kernel, our scheduler would select a kernel, except the first kernel, from the “issue queue” to be merged with the first kernel. Different from the traditional nearest kernels merging, after our merging, we hope every SMs could have an balanced workload. This paragraph will explain how we find the most suitable kernel to be merged with the first kernel. First, we sum block numbers of the first kernel and the selected kernel up. Second, we calculate the reminder of the sum divided by the number of SMs. Finally, we select the kernel which generates the biggest reminder through previous processes with the first kernel. The pseudo-code is showed below.

Algorithm1,
Find the Merging Pair of The First Kernel for i ← 2 to Kernel Number in Issue Queue
{
reminder ← ( first kernel block number + ith kernel block number) % SMs number; if ( current reminder is the biggest ) result ← i;
}
Merge resultth kernel with the first kernel

3.2.2 Experiment Results We use the application of matrix multiplication to conduct the co-scheduling experiments. The blocks number of each application is random generated from 1 to 100. The result is showed in figure 10.

Figure 10, Running Time of Series Mode, Neighbor Merging mode and Alignment Merging Mode

The X axis means the kernel number in the “issue queue” and the Y axis presents the running time per kernel. The experiment results show that the new co-scheduling method has 1.23 x speeds up comparing with series execution mode and 1.11 x speeds up comparing with neighbor merging mode.

4. Related Work Ordinarily, GPU is merely used to speed up the graphics operations [5], such as ray-casting, trimmed NURBS surfaces. However, because the fast development of GPU, and its special structure, researchers realize that GPU is also effective in many general purpose applications [6]. People generally consider that GPU is merely suitable for the data parallelism jobs [7]. Current GPU programming platform, such as CUDA, doesn’t support multiple kernels are running on a same device concurrently [8]. Once a kernel is launched, it can’t be stopped unless it is finished. These attributes limits the development of task parallelism on GPU. As a result, GPU can’t be as flexible and effective as CPU. If GPU want to have a further development, it must have a high-efficiency scheduling strategy like CPU’s. Some works has illustrated their ideas about the scheduling method in heterogamous environment [9]. For the purpose to realize a complete scheduler of GPU, some researchers have tried to create an issue queue for CUDA kernel [10]. Then the kernels in the issue queue could be co-scheduled after an over-round consideration. Besides, the scheduler could also make the decision of merging kernels to increase the throughout. However, they didn’t analyze the running time of merging in detail. The paper also ignored the workload balance after merging. Actually, whether the workload is balanced among SMs affects the final running time of merged kernel greatly. As a result, based on these previous works, our emphasis is put on how to predict the running time of merged kernel accurately. In addition, we have also improved the method of selecting kernels to be merged from the issue queue.

5. Conclusion In this paper, we firstly discuss the workload unbalance between different SMs. According the experiments, the running time of a kernel is determined by the SMs with the heaviest workload. As a result, with the block number increasing, the growth shape of running time is similar to a latter. In addition to the block number, the kernel number also exerts a big influence on running time. Based on the previous analysis, with no need to know the kernel’s computing complexity, we put forward a mode for predicting merged kernel’s running time merely based on block number of both sub kernels. Through experiments verification, our model predicts running time very close to real running time. Moreover, the paper presents a new kernels’ co-scheduling method, which pays attention to the workload balance among SMs after merging. Comparing with serially executing every kernel, our co-scheduling method achieves a 1.23x speedup. Comparing with neighbor kernel merging method, the new co-scheduling method achieves a 1.11x speedup.

6. Future Work

The paper has two obvious shortcomings. The former is that for the purpose to highlight the influence of block number on running time, the paper assumes all kernels have the same computing complexity. The latter is that we haven’t taken the bottleneck of GPU resource into consideration. In the future, we firstly will undertake a deeper research on the kernel merging of kernels with different computing complexity. According to current knowledge, the kernels with similar computing complexity will achieve a higher speed up after merging. If their computing complexity differs much, after the merging, it is more probable that the workload among SMs is unbalanced. On the basis of research, we will modify our predicting model and co-scheduling method. Besides, we will also pay attention to the bottleneck of GPU resource. Actually, as showed in figure 9, the bottleneck of GPU resource has already affected our prediction. Finally, we hope to build an accurate, stable, robust model to predict the running time of merged kernel. Based on the predicting model, we also try to design an effective co-scheduling method. With the help of kernels issue queue, we look forward increasing the throughput of GPU and extending the range of GPU application.

References

[1] By John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and James C. Phillips, GPU Computing. Univ. of California, Davis , 2008
[2] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, and K. Skadron. A performance study of general-purpose applications on graphics processors using CUDA. Journal of Parallel and Distributed Computing, 68(10):1370–1380, 2008.
[3] Garland, M. Le Grand, S. Nickolls,J. Anderson, J. Hardwick, J. Morton, S. Phillips, E. Yao Zhang. Volkov, V. Parallel Computing Experiences with CUDA Micro, IEEE , 2008
[4] NVIDIA CUDA C Programming Guide
[5] Adarsh Krishnamurthy, Rahul Khardekar and Sara McMainsDirect, “Evaluation of NURBS Curves and Surfaces on the GPU”, SPM 2007, Beijing, China, June 04 – 06, 2007.
[6] Bartlett S.H. (Scott) Michel, “General-purpose GPU computing: practice and experience”, SC’06, New York, NY, USA, 2006
[7] David Tarditi, Sidd Puri, Jose Oglesby, “Accelerator: using data parallelism to program GPUs for general-purpose uses”, the 12th international conference on Architectural support for programming languages and operating systems, 2006, San Jose, California, USA
[8] M.McCool. Scalable programming models for massively multicore processors. Proceedings of the IEEE, 96(5):816–831, May 2008.
[9] G. F. Diamos and S. Yalamanchili. Harmony: an execution model and runtime for heterogeneous many core systems.In HPDC ’08: Proceedings of the 17th International Symposium on High-Performance Distributed Computing, pages 197–200, New York, NY, USA, 2008. ACM.
[10] Marisabel Guevara, Chris Gregg, Kim Hazelwood, Kevin Skadron. "Enabling Task Parallelism in the CUDA Scheduler," in Proceedings of the Workshop on Programming Models for Emerging Architectures (PMEA). Raleigh, NC. September 2009, pages 69-76.

A New Method of Co-Scheduling Cuda Kernels

Similar Documents

Nit-Silchar B.Tech Syllabus

Popular Essays