SQL in HBase Thursday, January 31, 2013

Salesforce (the company I work for) recently released a SQL layer on top of HBase. Like James Taylor says, "We put the SQL back in the NoSQL"!. SQL is a nice language for asking questions about your data. Having it available with HBase means you need not write code to answer many kinds of questions.

As a performance engineer and one of the early users of phoenix, I am impressed about its performance. It uses several techniques like Parallel scans, filtering on the region servers, aggregate pushdown, hash joins, etc. Even if you don't use phoenix, you can learn quite a bit about HBase performance just by seeing how phoenix goes about executing queries.

Epson GT S50 & Linux (Ubuntu 10.10) Thursday, December 30, 2010

It works! It was quite easy getting to work. I installed the linux drivers from AvaSys and used SimpleScan. SimpleScan is a really nice tool. Wish there was a way to create a searchable PDF now :-)

Sun; the end of an era Thursday, January 21, 2010

I read the EU's approval of the Sun/Oracle merger with mixed feelings. Glad that it is approved and sad that Sun as a company is ending. I am sure the essence of Sun will continue inside Oracle. Ever since I saw a SunOS login promt, I wanted to work for Sun. I worked for 10 years; probably the best time of my life. I made a lot of good friends, learnt a whole lot of stuff. Thank you Sun! I wish the very best for its employees

Diff - Solaris, Linux and Ruby Tuesday, December 15, 2009

I am looking for diff functionality with Ruby for a project I am involved in. There seems to be a few available. The one I liked is differ. It allows you to do comparisons line by line, word by word, or character by character.

This got me thinking about the diff(1) command. With a little searching, I found that GNU diff and Solaris diff implement two different diff algorithms.

Solaris diff is based off some old code from ATT&T (cfront?) and uses the algorithm described in Hunt, J.W. and McIlroy, M.D. ‘‘An Algorithm for Differential File Comparison.’’ Computing Science Techn- ical Report 41, Bell Laboratories (1975).

The GNU diff (the one used by Linux) utility uses the algorithmn described in An O(ND) Difference Algorithm and Its Variations - Eugene Myers, Algorithmica Vol. 1 No. 2, 1986. This is a faster algorithm than the one used in Solaris, although it can be a slower for cases where the input is large and the differences are many.

Very interesting. I plan to read more about this.

Scala? Friday, December 11, 2009

I attended a Scala talk at work today. Seems like an interesting language. I was a little disappointed that the talk was for only and hour and the speaker did not have a chance to delve into the concurrent aspects of Scala like Actors,etc.. While it does seem to speed up code writing, I am not sure how it will perform when its deployed. Yes, I know twitter uses them, and they are a poster child for scala, but I would like to understand the reasons why scala would be faster at deploy time. Now time to plug my favorite language. Ruby Rocks!!

Hybrid OpenMP/MPI - A blast from the past Saturday, November 07, 2009

This is the report on a OpenMP/MPI hybrid parallelization project that I did during my internship at Sun more than 10 years ago. The report is "as-is".

Parallelize a representative scientific computing code using OpenMP/MPI.


Introduction

This project was carried out as a part of my 3 month internship at  Sun Microsystems during the summer of 1999. The major objective of the project was to parallelize a representative scientific computing code using  OpenMP/MPI. The major objective includes

  • Parallelize a representative scientific computing code using OpenMP
  • Parallelize a representative scientific computing code using MPI
  • Parallelize a representative scientific computing code using a hybrid MPI/OpenMP model
  • Evaluate the different approaches from the standpoints of ease of implementation, functionality and performance on UltraSparc/Solaris SMP systems

The representative scientific application that was chosen is a public-domain code form the University of Maryland in the field of semiconductor device simulation. This code is a f77 program around 7K
in size. The code solves the Boltzmann Transport Equations using a combination of Spherical Harmonic expansions and finite-volume discretization. Two electron energy bands (lower: 12, upper: 34) are modeled and the Gummel-Scharfetter block iteration method is used to decouple the partial differential equations. The decoupled transport equations (which after discretization reduce to solving a 3-d problem: 2d in space and 1d in energy) are solved iteratively using the LSOR method. The computation domain is Cartesian in space and curvilinear in energy, which adds to the computational complexity in terms of slowing down the convergence.


The basic structure of the program can be described by the following figure


Figure 1: Gummel loop figure
Each Band can then be solved iteratively using any of the SOR techniques.
Figure 2: solveC.f structure

Profiling the program

The program was profiled using the tools provided in Workshop 5.0. the results were as expected. the majority of the time was spent in solving the bands and the LSOR routines. Thus to successfully
parallelize the code, we have to parallelize the LSOR routines and possibly the Gummel loop.

Analysis of the program showed us that there are two levels of parallelism present in this program. Band Level and Task Level. Since multiple bands are being solved, a natural way to parallelize the program is to solve the bands in parallel. This is complicated by the presence of data dependencies between the two bands. The data  dependencies between the bands can be represented by the following
equations. In the serial case, Band 12 is solved and then Band 34 is solved. This is stated as follows


Eqn #1
C12(i) uses C12(i-1) and C34(i-1)
C34(i) uses C12(i) and C34(i-1)

When we applied band parallelization, we effectively changed the above dependence to

Egn #2
   C12(i) uses C12(i-1) and C34(i-1)
   C34(i) uses C12(i-1) and C34(i-1)
   

This lead to an increase in the number of iterations for convergence. Serial case -- 22 iterations for the normal data set. Parallel case -- 38 iterations for the normal data set

The increase in iterations can be attributed to (as explained by Surinder) inter-band scattering (scattering from band 12 to 34, and vice versa). Each band uses the C() from the other band. This is a large scattering, and has a strong effect on the solution. So to make the solution converge faster, it needs the best values that it can get (latest Gummel).

In the OpenMP version, if we parallelize the bands and synchronize them at the end of Gummel only, then it is possible to replicate the behavior of the serial case. In this case band 34 C34(i) uses some values in between C12(i) and C12(i-1). This leads to convergence in  fewer number of iterations. This was successfully replicated over the three data sets.

To simplify our analysis, we modified the serial program to use Eqn 2 instead of Eqn 1. All the numbers presented in this report is with respect to the modified serial version. The timings for the serial version are shown below in table 1.

Table 1 (version 1,time in minutes)
Data Set
LSOR
JSOR
HSOR
Sparse
1.30
4.14
3.26
Normal
32.30
115
Dense
164.5
667

Band parallelism using OpenMP is pretty easy. We used the sections
pragma to parallelize the Gummel loop. Each call to solve the bands
was enclosed in a sections pragma. The threads that are executing the
sections are synchronized in each Gummel loop and after
initialization of matrix in solveC.f. An important thing to be
mentioned is that the procedure that solves the bands has lots of
local variables that are allocated on the stack. The stack usage by
the routine is approximately 55MB (found out by running assureview).
So we need to set the KMP_STACKSIZE environmental variable higher
than that else the program SEGVs.

The MPI version is implemented by each process solving one band ,
exchanging values for C and proceeding to the next Gummel. The size
of C transfered is approximately 50MB. We tried optimizing the size
of the data being sent, but then the whole C array is required by the
other band. The program also carries out some computations in the
initialization phase. We decided to make both the MPI processes do
the computation rather than make one MPI process do the computation
and send it to the other process. Synchronization is carried out by
using MPI_BARRIER calls and synchronous Send-Recv calls. The
processes after synchronized at the start of the Gummel loop. Data
values that are exchanged are the C array and error values (so that
we can synchronize the exit).


The bands are heavily imbalanced. Band 12 takes approximately 4-5
times more time than the other band (band 34). Thus the speedup
achieved cannot exceed more than 1.2. This was also demonstrated in
the parallel versions that we developed. the results are tabulated
in table 2.



Table 2
Method
Total Time
Serial
1920 sec
OpenMP
1625 sec
MPI
1700 sec

Each band is solved by iteratively calling a SOR routine. The author
chooses the Line SOR method due to its superior serial performance.
Line SOR theoretically cannot be parallelized. So Jacobi LSOR (which
can be parallelized) was used. Jacobi LSOR is very slow with respect
to the Line SOR and takes 4 times more time than Line SOR. Jacobi
LSOR was successfully parallelized and the results are tabulated in
table 3.



Table 3
Numbers of procs
Time (sec)
Speedup
1
8400
1
2
4325
1.94
4
2267
3.7
8
1263
6.6
16
814
10.31
Rajat developed a new solution technique appropriately labeled as
"hybrid SOR". Hybrid SOR is is a hybrid of Jacobi-Gauss-Seidel
iteration methods.
Essentially, HSOR does following

x-sweep: Jacobi in y, GS updates in H
y-sweep: Jacobi in x, GS updates in H
H-sweeps: Jacobi in x, GS updates in y


This method is better than the Jacobi SOR, but takes more time with
respect to the Line SOR. the results are tabulated in table 4


Table 4
Numbers of procs
Time (sec)
Speedup
1
8121
1
2
4221
1.92
4
2192
3.7
8
1218
6.6
16
793
10.24
32
651
12.5
During the parallelization of Jacobi SOR and hybrid SOR, we tried
experimenting with parallelizing Line SOR and the results were
amazing. The solution still converged in the same number of Gummel
and the time taken for convergence is superior to the Jacobi LSOR
method. The idea here was to break down each sweep into chunks and
then solve these chunks in parallel hoping that the the chunk
boundaries fell in regions where the solution was weak. The results
are tabulated in table 5.

Another possible reason could be that the parallelization leads to
more # of inner iterations. This could in some way be contributing to
the convergence in the same number of gummels



Table 5
Numbers of procs
Time (sec)
Speedup
1
2277
1
2
1186
1.91
4
638
3.56
8
352
6.4
16
226
10
32
173
13.16

The author provided us with an implementation of Zebra LSOR. Zebra
LSOR carries out odd-even sweeps in x, y, and H directions. The
results obtained with Zebra LSOR are tabulated in table 6.



Table 6
Numbers of procs
Time (sec)
1
2300
2
1389.38
3
1335.66
4
1762.04
5
1489.9
10
850.48
15
856.48
20
899.5
25
1025.9
The next approach was to use MPI to achieve band parallelism and then
use OpenMP for task level parallelism. This decision was motivated by
the need to achieve greater parallelism by using band parallelism as
well as task level parallelism. OpenMP currently does not support
nested parallelism. Thus MPI was chosen to achieve band parallelism.
OpenMP was was used to achieve task level parallelism. The results
although preliminary, look promising. They are tabulated in table 7.
We still have not been able to beat straight parallelization of LSOR,
but as the number of bands increases, the hybrid model should be
better.

Load balancing is required in the hybrid model. This is because the
bands are heavily imbalanced. Band 34 takes 1/4 time of band 12. This
is further complicated by the fluctuating ratios of times for the
bands from Gummel to Gummel. Three heuristics were chosen to perform
load balancing.

  1. Allocate a fixed number of threads to each band.
  2. Increase (Decrease) the number of threads by a fixed number (in
    this case 2) for each band after each Gummel based on the ratio of
    times.
  3. Dynamically increase or decrease the number of threads after each
    Gummel. The amount by which the threads are increased/decreased
    depends on the ratio of the times for the bands.

Testing of the above 3 options showed us that option #2 is the best
option for this problem (normal data set).


Table 7
Number of procs
Parallel LSOR
Hybrid
40
177.3
225.8
32
235.1
216
25
231.8
220
20
234.3
321
16
316.35
345.2
10
385.95
496.22
8
485.3
763.3


Implementation Details

It is very hard to find the parallelism in a program until you
actually know what the program is doing. One of main benefits of
OpenMP that I found is that it is possible to parallelize the program
by having a fair knowledge of what the program does.
Band parallelism was accomplished through the use of the sections
directive. The threads were synchronized by the implicit barrier call
at the end of a parallel section pragma. Parallelization of LSOR was
other SOR techniques was very similar. We used parallel, do,
critical, master, single pragmas in the code to accomplish this.
The MPI version was developed by introducing a barrier call at the
end of each Gummel. Synchronous sends & receives were used. The
approach was to simulate the OpenMP band parallel version.
Compilation of the hybrid model was straightforward. Used -lmpi with
guidef77


Implementation difficulties.

I made the classic mistake of parallelizing inner loops in LSOR. The
overhead of creating and destroying threads at each outer iteration
was too high and so had to go back and try parallelizing the outer
most loops. Another amusing problem that we faced was stack-sizes.
There are a lot of stack-allocated variables in the program and the
default stack-size of 32K was too small. The program failed with a
segmentation fault in some cases and in the other cases, failed by
giving a message "Abort: Out of memory".
Estimating stack-sizes is a tricky exercise. One option is to
manually calculate the sizes of all the stack-allocated (automatic)
variables. Another option is to use "trial and error" philosophy.
Assureview is an excellent tool to find errors in a parallel program.
It has a lot of facilities for finding write-write, read-write, etc
errors in a parallel program. it can also be used to estimate
stack-sizes required by each thread. But one of the main drawback is
that it is 10-20 times slower than the normal run and cannot be used
to validate programs calling any of the OpenMP functions
(omp_get_thread_num, etc..)


Signal handling by KAI complied program needs to be improved. The
program in some cases refuses to accept SIGKILL and in some cases,
handles SIGKILL by generating a SIGSEGV!


There are a lot of scheduling options available that programs can use
to tune their program. Our experiments with these options showed us
that some of the options are best suited for programs with less
threads and some are good for programs with a larger number of
threads.


Another problem while developing OpenMP programs is whether to make
variables private or shared. This ultimately depends on the problem
and the region that needs to be parallelized. Making variables
private increases the stack usage, and since these are made private
at each instance, more time is required.


Some program may depend on the serial semantics of loops and use the
variables after the do loop. for example, a program may use the last
value of the loop index after the loop to make some computations. In
such cases, the lastprivate pragma can be used.


Implementation differences.

MPI and OpenMP are two different parallel programming constructs. Our
(my) belief is that OpenMP is easier to use to parallelize programs
that are already been developed. MPI can only be used to do
domain-decomposition and cannot be used to parallelize just loops.
Another feature of OpenMP is that we just need to insert pragmas. No
major modification in the flow of the program has to be made.

Tuning

KMP_LIBRARY set to gang gives us the best performance.
Stack-size plays a role in the execution times of the program. Our
experiments with FT kernel of NAS parallel benchmarks showed us that
stack-size can effect performance by as much as 10%
Making the DO loops nowait helped improve performance by as mush as
5%. This can be used when there is unequal amount of work for each
thread and the free threads can be put to work on the next loop or
parallel construct.


Conclusions

We have demonstrated that
  1. OpenMP to be an effective method to parallelize
    programs. This is reflected in the good scaling achieved.
  2. SOR was parallelized and we obtained good speedup.
  3. We have also demonstrated that OpenMP and MPI can be used in the
    same programming model effectively.

Some notes about the project

  1. The maximum speedup that can be obtained by Band Parallelism is
    1.1 to 1.2. this is due to the fact that Band 34 currently takes 4-5
    times more time than Band 12. Max speedup is 1.x where x is the ratio
    of times for the bands
  2. The best serial performance is obtained with the Line SOR.
  3. Line SOR cannot be mathematically parallelized. But our results
    with parallelizing it with breaking it down into chunks and then
    solving in parallel showed that this is possible. One possible
    explanation provided by the author was that these chunk boundaries
    lies in weak regions or in regions where the solution does not change
    much. But when we used a chunk size of 1, (i.e. each grid point is
    solved simultaneously, the solution still converged in the same
    number of Gummel iterations. This is something that has to be
    examined in greater detail.
  4. In the original serial case, EQN #1 is used. By using EQN #2, it
    was shown that the number of iterations increases by almost 2x. Thus
    the availability of newer data is crucial for faster convergence.
  5. In the MPI version, we tried experimenting with sending only data
    that is required for the optical band interchange. The other region
    where the data form the other band is required is Impact Ionization.
    Thus we need to send the whole C array since we cannot ignore impact
    ionization.
  6. The goal of load balancing was to make the bands take equal
    amounts of time. There are a lot of indeterminate variables in the
    problem that make paper analysis difficult.
  7. The bands take unequal amounts of time. This is highly evident in
    the sparse data set, where Band 12 takes almost 10 times more time
    than Band 34. This ratio comes down to 4 in the normal data set.
  8. The times for the bands gets equal as the Gummel loop progresses.
    Makes load balancing difficult.
  9. The problem is sensitive to the values of omega. Currently for
    LSOR, omega is 1.35 for band 12. Experimenting with the normal data
    set showed that the problem converges in the same number of
    iterations for omega=0.80. We also experimented with varying the
    value for omega based on the iteration number. Using omega=0.8 for
    odd numbered gummels and 1.35 for even numbered gummels makes the
    solution go hay-wire. Decreasing the value of omega by 0.01 for each
    Gummel iteration showed some promise.
  10. The total number of iterations (gummels and inner) varies
    according to the SOR method used. These can be summarized by the
    following table



    Normal data set
    Method
    Gummel
    inner
    LSOR
    21
    264
    Jacobi SOR
    61
    851
    Hybrid SOR
    64
    746
    Sparse data set
    Method
    Gummel
    inner
    LSOR
    28
    148
    Zebra
    28
    263


  11. Reorganization of loops to ensure cache hits increased the serial
    time by 2 seconds for the sparse data set.
  12. In the SOR routines being used, the outer index is nx.
    (approximately 40 for normal data set). If the outer loop were to be
    over k, then the max outer index would be k_max (400 for the normal
    data set). With OpenMP, we could get better scaling over larger
    number of processors.
  13. Interchanging the bans had no effect on the number of Gummel
    iterations.
  14. Solving one band before the Gummel loop has no effect on the
    number of Gummel iterations.
  15. The values for errorfo keep fluctuating during the first few
    iterations,
  16. KMP_LIBRARY set to gang gives us the best performance
    possible.
  17. Putting a barrier pragma in the OpenMP version of the
    original
    serial program increases the number of iterations by 2x. If we don't
    put the barrier, then newer data is available to the bands and thus
    they converge faster even if we break the Gummel dependence.
  18. Making the OpenMP do loops no-wait, decreases the time by a
    minute amount. this time depends on the amount of work done by each
    thread. This option will help if the threads take unequal amount of
    time to execute.
  19. Surinder had an idea where if we break the Gummel loop concept,
    and let the bands go on executing until they converge.
  20. why wavefront parallelism wont work? In our case with this
    approach is (1) very few inner iterations (10)
    to parallelize on (2) we do x,y,H sweeps...that will cause problems
    in the wavefront as the directions change.
  21. Zebra LSOR does odd-even sweeps. This effectively decreases the
    outer loop maximum by half. So it hard to obtain speedup with greater
    number of processors. There is also an increase in the number of
    parallel constructs. 3 more implicit barrier calls.

Sun/Oracle/MySQL Friday, October 23, 2009

I am not sure why people are making such a big deal about the whole Sun/Oracle/MySQL issue. To say that MySQL competes with Oracle is incorrect. Oracle has "owned" InnoDB for some time now and InnoDB has added many new features. So its not like they are out to kill MySQL.