This is the report on a OpenMP/MPI hybrid parallelization project that I did during my internship at Sun more than 10 years ago. The report is "as-is".
Parallelize a representative scientific computing code using OpenMP/MPI.
Introduction
This project was carried out as a part of my 3 month internship at Sun Microsystems during the summer of 1999. The major objective of the project was to parallelize a representative scientific computing code using OpenMP/MPI. The major objective includes- Parallelize a representative scientific computing code using OpenMP
- Parallelize a representative scientific computing code using MPI
- Parallelize a representative scientific computing code using a hybrid MPI/OpenMP model
- Evaluate the different approaches from the standpoints of ease of implementation, functionality and performance on UltraSparc/Solaris SMP systems
The representative scientific application that was chosen is a public-domain code form the University of Maryland in the field of semiconductor device simulation. This code is a f77 program around 7K
in size. The code solves the Boltzmann Transport Equations using a combination of Spherical Harmonic expansions and finite-volume discretization. Two electron energy bands (lower: 12, upper: 34) are modeled and the Gummel-Scharfetter block iteration method is used to decouple the partial differential equations. The decoupled transport equations (which after discretization reduce to solving a 3-d problem: 2d in space and 1d in energy) are solved iteratively using the LSOR method. The computation domain is Cartesian in space and curvilinear in energy, which adds to the computational complexity in terms of slowing down the convergence.
The basic structure of the program can be described by the following figure
Figure 1: Gummel loop figureEach Band can then be solved iteratively using any of the SOR techniques.
Figure 2: solveC.f structure
Profiling the program
The program was profiled using the tools provided in Workshop 5.0. the results were as expected. the majority of the time was spent in solving the bands and the LSOR routines. Thus to successfullyparallelize the code, we have to parallelize the LSOR routines and possibly the Gummel loop.
Analysis of the program showed us that there are two levels of parallelism present in this program. Band Level and Task Level. Since multiple bands are being solved, a natural way to parallelize the program is to solve the bands in parallel. This is complicated by the presence of data dependencies between the two bands. The data dependencies between the bands can be represented by the following
equations. In the serial case, Band 12 is solved and then Band 34 is solved. This is stated as follows
Eqn #1 C12(i) uses C12(i-1) and C34(i-1) C34(i) uses C12(i) and C34(i-1)
When we applied band parallelization, we effectively changed the above dependence to
Egn #2 C12(i) uses C12(i-1) and C34(i-1) C34(i) uses C12(i-1) and C34(i-1)
This lead to an increase in the number of iterations for convergence. Serial case -- 22 iterations for the normal data set. Parallel case -- 38 iterations for the normal data set
The increase in iterations can be attributed to (as explained by Surinder) inter-band scattering (scattering from band 12 to 34, and vice versa). Each band uses the C() from the other band. This is a large scattering, and has a strong effect on the solution. So to make the solution converge faster, it needs the best values that it can get (latest Gummel).
In the OpenMP version, if we parallelize the bands and synchronize them at the end of Gummel only, then it is possible to replicate the behavior of the serial case. In this case band 34 C34(i) uses some values in between C12(i) and C12(i-1). This leads to convergence in fewer number of iterations. This was successfully replicated over the three data sets.
To simplify our analysis, we modified the serial program to use Eqn 2 instead of Eqn 1. All the numbers presented in this report is with respect to the modified serial version. The timings for the serial version are shown below in table 1.
Data Set | LSOR | JSOR | HSOR |
---|---|---|---|
Sparse | 1.30 | 4.14 | 3.26 |
Normal | 32.30 | 115 | |
Dense | 164.5 | 667 |
Band parallelism using OpenMP is pretty easy. We used the sections
pragma to parallelize the Gummel loop. Each call to solve the bands
was enclosed in a sections pragma. The threads that are executing the
sections are synchronized in each Gummel loop and after
initialization of matrix in solveC.f. An important thing to be
mentioned is that the procedure that solves the bands has lots of
local variables that are allocated on the stack. The stack usage by
the routine is approximately 55MB (found out by running assureview).
So we need to set the KMP_STACKSIZE environmental variable higher
than that else the program SEGVs.
The MPI version is implemented by each process solving one band ,
exchanging values for C and proceeding to the next Gummel. The size
of C transfered is approximately 50MB. We tried optimizing the size
of the data being sent, but then the whole C array is required by the
other band. The program also carries out some computations in the
initialization phase. We decided to make both the MPI processes do
the computation rather than make one MPI process do the computation
and send it to the other process. Synchronization is carried out by
using MPI_BARRIER calls and synchronous Send-Recv calls. The
processes after synchronized at the start of the Gummel loop. Data
values that are exchanged are the C array and error values (so that
we can synchronize the exit).
The bands are heavily imbalanced. Band 12 takes approximately 4-5
times more time than the other band (band 34). Thus the speedup
achieved cannot exceed more than 1.2. This was also demonstrated in
the parallel versions that we developed. the results are tabulated
in table 2.
Method | Total Time |
---|---|
Serial | 1920 sec |
OpenMP | 1625 sec |
MPI | 1700 sec |
Each band is solved by iteratively calling a SOR routine. The author
chooses the Line SOR method due to its superior serial performance.
Line SOR theoretically cannot be parallelized. So Jacobi LSOR (which
can be parallelized) was used. Jacobi LSOR is very slow with respect
to the Line SOR and takes 4 times more time than Line SOR. Jacobi
LSOR was successfully parallelized and the results are tabulated in
table 3.
Numbers of procs | Time (sec) | Speedup |
---|---|---|
1 | 8400 | 1 |
2 | 4325 | 1.94 |
4 | 2267 | 3.7 |
8 | 1263 | 6.6 |
16 | 814 | 10.31 |
"hybrid SOR". Hybrid SOR is is a hybrid of Jacobi-Gauss-Seidel
iteration methods.
Essentially, HSOR does following
x-sweep: Jacobi in y, GS updates in H
y-sweep: Jacobi in x, GS updates in H
H-sweeps: Jacobi in x, GS updates in y
This method is better than the Jacobi SOR, but takes more time with
respect to the Line SOR. the results are tabulated in table 4
Numbers of procs | Time (sec) | Speedup |
---|---|---|
1 | 8121 | 1 |
2 | 4221 | 1.92 |
4 | 2192 | 3.7 |
8 | 1218 | 6.6 |
16 | 793 | 10.24 |
32 | 651 | 12.5 |
experimenting with parallelizing Line SOR and the results were
amazing. The solution still converged in the same number of Gummel
and the time taken for convergence is superior to the Jacobi LSOR
method. The idea here was to break down each sweep into chunks and
then solve these chunks in parallel hoping that the the chunk
boundaries fell in regions where the solution was weak. The results
are tabulated in table 5.
Another possible reason could be that the parallelization leads to
more # of inner iterations. This could in some way be contributing to
the convergence in the same number of gummels
Numbers of procs | Time (sec) | Speedup |
---|---|---|
1 | 2277 | 1 |
2 | 1186 | 1.91 |
4 | 638 | 3.56 |
8 | 352 | 6.4 |
16 | 226 | 10 |
32 | 173 | 13.16 |
The author provided us with an implementation of Zebra LSOR. Zebra
LSOR carries out odd-even sweeps in x, y, and H directions. The
results obtained with Zebra LSOR are tabulated in table 6.
Numbers of procs | Time (sec) |
---|---|
1 | 2300 |
2 | 1389.38 |
3 | 1335.66 |
4 | 1762.04 |
5 | 1489.9 |
10 | 850.48 |
15 | 856.48 |
20 | 899.5 |
25 | 1025.9 |
use OpenMP for task level parallelism. This decision was motivated by
the need to achieve greater parallelism by using band parallelism as
well as task level parallelism. OpenMP currently does not support
nested parallelism. Thus MPI was chosen to achieve band parallelism.
OpenMP was was used to achieve task level parallelism. The results
although preliminary, look promising. They are tabulated in table 7.
We still have not been able to beat straight parallelization of LSOR,
but as the number of bands increases, the hybrid model should be
better.
Load balancing is required in the hybrid model. This is because the
bands are heavily imbalanced. Band 34 takes 1/4 time of band 12. This
is further complicated by the fluctuating ratios of times for the
bands from Gummel to Gummel. Three heuristics were chosen to perform
load balancing.
- Allocate a fixed number of threads to each band.
- Increase (Decrease) the number of threads by a fixed number (in
this case 2) for each band after each Gummel based on the ratio of
times. - Dynamically increase or decrease the number of threads after each
Gummel. The amount by which the threads are increased/decreased
depends on the ratio of the times for the bands.
Testing of the above 3 options showed us that option #2 is the best
option for this problem (normal data set).
Number of procs | Parallel LSOR | Hybrid |
---|---|---|
40 | 177.3 | 225.8 |
32 | 235.1 | 216 |
25 | 231.8 | 220 |
20 | 234.3 | 321 |
16 | 316.35 | 345.2 |
10 | 385.95 | 496.22 |
8 | 485.3 | 763.3 |
Implementation Details
It is very hard to find the parallelism in a program until youactually know what the program is doing. One of main benefits of
OpenMP that I found is that it is possible to parallelize the program
by having a fair knowledge of what the program does.
Band parallelism was accomplished through the use of the sections
directive. The threads were synchronized by the implicit barrier call
at the end of a parallel section pragma. Parallelization of LSOR was
other SOR techniques was very similar. We used parallel, do,
critical, master, single pragmas in the code to accomplish this.
The MPI version was developed by introducing a barrier call at the
end of each Gummel. Synchronous sends & receives were used. The
approach was to simulate the OpenMP band parallel version.
Compilation of the hybrid model was straightforward. Used -lmpi with
guidef77
Implementation difficulties.
I made the classic mistake of parallelizing inner loops in LSOR. Theoverhead of creating and destroying threads at each outer iteration
was too high and so had to go back and try parallelizing the outer
most loops. Another amusing problem that we faced was stack-sizes.
There are a lot of stack-allocated variables in the program and the
default stack-size of 32K was too small. The program failed with a
segmentation fault in some cases and in the other cases, failed by
giving a message "Abort: Out of memory".
Estimating stack-sizes is a tricky exercise. One option is to
manually calculate the sizes of all the stack-allocated (automatic)
variables. Another option is to use "trial and error" philosophy.
Assureview is an excellent tool to find errors in a parallel program.
It has a lot of facilities for finding write-write, read-write, etc
errors in a parallel program. it can also be used to estimate
stack-sizes required by each thread. But one of the main drawback is
that it is 10-20 times slower than the normal run and cannot be used
to validate programs calling any of the OpenMP functions
(omp_get_thread_num, etc..)
Signal handling by KAI complied program needs to be improved. The
program in some cases refuses to accept SIGKILL and in some cases,
handles SIGKILL by generating a SIGSEGV!
There are a lot of scheduling options available that programs can use
to tune their program. Our experiments with these options showed us
that some of the options are best suited for programs with less
threads and some are good for programs with a larger number of
threads.
Another problem while developing OpenMP programs is whether to make
variables private or shared. This ultimately depends on the problem
and the region that needs to be parallelized. Making variables
private increases the stack usage, and since these are made private
at each instance, more time is required.
Some program may depend on the serial semantics of loops and use the
variables after the do loop. for example, a program may use the last
value of the loop index after the loop to make some computations. In
such cases, the lastprivate pragma can be used.
Implementation differences.
MPI and OpenMP are two different parallel programming constructs. Our(my) belief is that OpenMP is easier to use to parallelize programs
that are already been developed. MPI can only be used to do
domain-decomposition and cannot be used to parallelize just loops.
Another feature of OpenMP is that we just need to insert pragmas. No
major modification in the flow of the program has to be made.
Tuning
KMP_LIBRARY set to gang gives us the best performance.Stack-size plays a role in the execution times of the program. Our
experiments with FT kernel of NAS parallel benchmarks showed us that
stack-size can effect performance by as much as 10%
Making the DO loops nowait helped improve performance by as mush as
5%. This can be used when there is unequal amount of work for each
thread and the free threads can be put to work on the next loop or
parallel construct.
Conclusions
We have demonstrated that- OpenMP to be an effective method to parallelize
programs. This is reflected in the good scaling achieved. - SOR was parallelized and we obtained good speedup.
- We have also demonstrated that OpenMP and MPI can be used in the
same programming model effectively.
Some notes about the project
- The maximum speedup that can be obtained by Band Parallelism is
1.1 to 1.2. this is due to the fact that Band 34 currently takes 4-5
times more time than Band 12. Max speedup is 1.x where x is the ratio
of times for the bands - The best serial performance is obtained with the Line SOR.
- Line SOR cannot be mathematically parallelized. But our results
with parallelizing it with breaking it down into chunks and then
solving in parallel showed that this is possible. One possible
explanation provided by the author was that these chunk boundaries
lies in weak regions or in regions where the solution does not change
much. But when we used a chunk size of 1, (i.e. each grid point is
solved simultaneously, the solution still converged in the same
number of Gummel iterations. This is something that has to be
examined in greater detail. - In the original serial case, EQN #1 is used. By using EQN #2, it
was shown that the number of iterations increases by almost 2x. Thus
the availability of newer data is crucial for faster convergence. - In the MPI version, we tried experimenting with sending only data
that is required for the optical band interchange. The other region
where the data form the other band is required is Impact Ionization.
Thus we need to send the whole C array since we cannot ignore impact
ionization. - The goal of load balancing was to make the bands take equal
amounts of time. There are a lot of indeterminate variables in the
problem that make paper analysis difficult. - The bands take unequal amounts of time. This is highly evident in
the sparse data set, where Band 12 takes almost 10 times more time
than Band 34. This ratio comes down to 4 in the normal data set. - The times for the bands gets equal as the Gummel loop progresses.
Makes load balancing difficult. - The problem is sensitive to the values of omega. Currently for
LSOR, omega is 1.35 for band 12. Experimenting with the normal data
set showed that the problem converges in the same number of
iterations for omega=0.80. We also experimented with varying the
value for omega based on the iteration number. Using omega=0.8 for
odd numbered gummels and 1.35 for even numbered gummels makes the
solution go hay-wire. Decreasing the value of omega by 0.01 for each
Gummel iteration showed some promise. - The total number of iterations (gummels and inner) varies
according to the SOR method used. These can be summarized by the
following table
Normal data set Method
Gummel
inner
LSOR
21
264
Jacobi SOR
61
851
Hybrid SOR
64
746
Sparse data set Method
Gummel
inner
LSOR
28
148
Zebra
28
263
- Reorganization of loops to ensure cache hits increased the serial
time by 2 seconds for the sparse data set. - In the SOR routines being used, the outer index is nx.
(approximately 40 for normal data set). If the outer loop were to be
over k, then the max outer index would be k_max (400 for the normal
data set). With OpenMP, we could get better scaling over larger
number of processors. - Interchanging the bans had no effect on the number of Gummel
iterations. - Solving one band before the Gummel loop has no effect on the
number of Gummel iterations. - The values for errorfo keep fluctuating during the first few
iterations, - KMP_LIBRARY set to gang gives us the best performance
possible. - Putting a barrier pragma in the OpenMP version of the
original
serial program increases the number of iterations by 2x. If we don't
put the barrier, then newer data is available to the bands and thus
they converge faster even if we break the Gummel dependence. - Making the OpenMP do loops no-wait, decreases the time by a
minute amount. this time depends on the amount of work done by each
thread. This option will help if the threads take unequal amount of
time to execute. - Surinder had an idea where if we break the Gummel loop concept,
and let the bands go on executing until they converge. - why wavefront parallelism wont work? In our case with this
approach is (1) very few inner iterations (10)
to parallelize on (2) we do x,y,H sweeps...that will cause problems
in the wavefront as the directions change. - Zebra LSOR does odd-even sweeps. This effectively decreases the
outer loop maximum by half. So it hard to obtain speedup with greater
number of processors. There is also an increase in the number of
parallel constructs. 3 more implicit barrier calls.