tag:blogger.com,1999:blog-73597372024-02-27T21:48:10.942-08:00Neelakanth Nadgir's blogrealneelhttp://www.blogger.com/profile/00374110497508299078noreply@blogger.comBlogger8125tag:blogger.com,1999:blog-7359737.post-29281937603125824562013-01-31T17:40:00.000-08:002013-01-31T17:40:35.989-08:00SQL in HBase <div dir="ltr" style="text-align: left;" trbidi="on">
Salesforce (the company I work for) recently released a <a href="http://phoenix-hbase.blogspot.com/2013/01/announcing-phoenix-sql-layer-over-hbase.html">SQL layer on top of HBase</a>. Like James Taylor says, <i>"We put the SQL back in the NoSQL"!</i>. SQL is a nice language for asking questions about your data. Having it available with HBase means you need not write code to answer many kinds of questions.
</div>
<P>
<div dir="ltr" style="text-align: left;" trbidi="on">
As a performance engineer and one of the early users of phoenix, I am impressed about its performance. It uses several techniques like Parallel scans, filtering on the region servers, aggregate pushdown, hash joins, etc. Even if you don't use phoenix, you can learn quite a bit about HBase performance just by seeing how phoenix goes about executing queries.
</div>
realneelhttp://www.blogger.com/profile/00374110497508299078noreply@blogger.com0tag:blogger.com,1999:blog-7359737.post-31674263611570720322010-12-30T20:02:00.000-08:002010-12-30T22:02:50.627-08:00Epson GT S50 & Linux (Ubuntu 10.10)It works! It was quite easy getting to work. I installed the linux drivers from AvaSys and used SimpleScan. SimpleScan is a really nice tool. Wish there was a way to create a searchable PDF now :-)realneelhttp://www.blogger.com/profile/00374110497508299078noreply@blogger.com3tag:blogger.com,1999:blog-7359737.post-43017912689192762010-01-21T20:06:00.000-08:002010-01-21T20:09:43.043-08:00Sun; the end of an eraI read the EU's approval of the Sun/Oracle merger with mixed feelings. Glad that it is approved and sad that Sun as a company is ending. I am sure the essence of Sun will continue inside Oracle. Ever since I saw a SunOS login promt, I wanted to work for Sun. I worked for 10 years; probably the best time of my life. I made a lot of good friends, learnt a whole lot of stuff. Thank you Sun! I wish the very best for its employeesrealneelhttp://www.blogger.com/profile/00374110497508299078noreply@blogger.com0tag:blogger.com,1999:blog-7359737.post-33299554361578516782009-12-15T08:36:00.000-08:002009-12-15T08:40:10.054-08:00Diff - Solaris, Linux and RubyI am looking for diff functionality with Ruby for a project I am involved in. There seems to be a few available. The one I liked is <a href="http://github.com/pvande/differ">differ</a>. It allows you to do comparisons line by line, word by word, or character by character.
<p>
This got me thinking about the <TT>diff(1)</TT> command. With a little searching, I found that GNU diff and Solaris diff implement two different diff algorithms.
<P>
Solaris diff is based off some old code from ATT&T (cfront?) and uses the algorithm described in <a href="http://www.cs.dartmouth.edu/~doug/diff.ps">Hunt, J.W. and McIlroy, M.D. ββAn Algorithm for Differential File Comparison.ββ Computing Science Techn- ical Report 41, Bell Laboratories (1975).</a>
<P>
The GNU diff (the one used by Linux) utility uses the algorithmn described in <a href="http://www.xmailserver.org/diff2.pdf">An O(ND) Difference Algorithm and Its Variations - Eugene Myers, Algorithmica Vol. 1 No. 2, 1986</a>. This is a faster algorithm than the one used in Solaris, although it can be a slower for cases where the input is large and the differences are many.
<P>
Very interesting. I plan to read more about this.realneelhttp://www.blogger.com/profile/00374110497508299078noreply@blogger.com0tag:blogger.com,1999:blog-7359737.post-14673164145764651582009-12-11T21:51:00.000-08:002009-12-11T22:00:03.126-08:00Scala?I attended a Scala talk at work today. Seems like an interesting language. I was a little disappointed that the talk was for only and hour and the speaker did not have a chance to delve into the concurrent aspects of Scala like Actors,etc.. While it does seem to speed up code writing, I am not sure how it will perform when its deployed. Yes, I know twitter uses them, and they are a poster child for scala, but I would like to understand the reasons why scala would be faster at deploy time.
Now time to plug my favorite language. Ruby Rocks!!realneelhttp://www.blogger.com/profile/00374110497508299078noreply@blogger.com0tag:blogger.com,1999:blog-7359737.post-54917807998249192052009-11-07T12:57:00.000-08:002009-11-07T18:24:01.423-08:00Hybrid OpenMP/MPI - A blast from the past<div>This is the report on a OpenMP/MPI hybrid parallelization project that I did during my internship at Sun more than 10 years ago. The report is "as-is".<br />
</div><div class="contentnnn"><h1>Parallelize a representative scientific computing code using OpenMP/MPI.</h1><br />
<h2>Introduction</h2>This project was carried out as a part of my 3 month internship at Sun Microsystems during the summer of 1999. The major objective of the project was to parallelize a representative scientific computing code using OpenMP/MPI. The major objective includes <br />
<br />
<ul><li>Parallelize a representative scientific computing code using OpenMP</li>
<li>Parallelize a representative scientific computing code using MPI</li>
<li>Parallelize a representative scientific computing code using a hybrid MPI/OpenMP model</li>
<li>Evaluate the different approaches from the standpoints of ease of implementation, functionality and performance on UltraSparc/Solaris SMP systems</li>
</ul><br />
The representative scientific application that was chosen is a public-domain code form the University of Maryland in the field of semiconductor device simulation. This code is a f77 program around 7K <br />
in size. The code solves the Boltzmann Transport Equations using a combination of Spherical Harmonic expansions and finite-volume discretization. Two electron energy bands (lower: 12, upper: 34) are modeled and the Gummel-Scharfetter block iteration method is used to decouple the partial differential equations. The decoupled transport equations (which after discretization reduce to solving a 3-d problem: 2d in space and 1d in energy) are solved iteratively using the LSOR method. The computation domain is Cartesian in space and curvilinear in energy, which adds to the computational complexity in terms of slowing down the convergence. <br />
<br />
<br />
The basic structure of the program can be described by the following figure<br />
<br />
<br />
<pre>Figure 1: Gummel loop figure
</pre>Each Band can then be solved iteratively using any of the SOR techniques.<br />
<pre>Figure 2: solveC.f structure
</pre><br />
<h2>Profiling the program</h2>The program was profiled using the tools provided in Workshop 5.0. the results were as expected. the majority of the time was spent in solving the bands and the LSOR routines. Thus to successfully <br />
parallelize the code, we have to parallelize the LSOR routines and possibly the Gummel loop.<br />
<br />
Analysis of the program showed us that there are two levels of parallelism present in this program. Band Level and Task Level. Since multiple bands are being solved, a natural way to parallelize the program is to solve the bands in parallel. This is complicated by the presence of data dependencies between the two bands. The data dependencies between the bands can be represented by the following <br />
equations. In the serial case, Band 12 is solved and then Band 34 is solved. This is stated as follows <br />
<br />
<br />
<pre>Eqn #1
C12(i) uses C12(i-1) and C34(i-1)
C34(i) uses C12(i) and C34(i-1)
</pre><br />
When we applied band parallelization, we effectively changed the above dependence to <br />
<br />
<pre>Egn #2
C12(i) uses C12(i-1) and C34(i-1)
C34(i) uses C12(i-1) and C34(i-1)
</pre><br />
This lead to an increase in the number of iterations for convergence. Serial case -- 22 iterations for the normal data set. Parallel case -- 38 iterations for the normal data set<br />
<br />
The increase in iterations can be attributed to (as explained by Surinder) inter-band scattering (scattering from band 12 to 34, and vice versa). Each band uses the C() from the other band. This is a large scattering, and has a strong effect on the solution. So to make the solution converge faster, it needs the best values that it can get (latest Gummel).<br />
<br />
In the OpenMP version, if we parallelize the bands and synchronize them at the end of Gummel only, then it is possible to replicate the behavior of the serial case. In this case band 34 C34(i) uses some values in between C12(i) and C12(i-1). This leads to convergence in fewer number of iterations. This was successfully replicated over the three data sets.<br />
<br />
To simplify our analysis, we modified the serial program to use Eqn 2 instead of Eqn 1. All the numbers presented in this report is with respect to the modified serial version. The timings for the serial version are shown below in table 1.<br />
<br />
<table><caption>Table 1 (version 1,time in minutes)</caption> <tbody>
<tr><th>Data Set<br />
</th><th>LSOR<br />
</th><th>JSOR<br />
</th><th>HSOR<br />
</th></tr>
<tr><td>Sparse<br />
</td><td>1.30<br />
</td><td>4.14<br />
</td><td>3.26<br />
</td></tr>
<tr><td>Normal<br />
</td><td>32.30<br />
</td><td>115<br />
</td></tr>
<tr><td>Dense<br />
</td><td>164.5<br />
</td><td>667<br />
</td></tr>
</tbody></table><br />
Band parallelism using OpenMP is pretty easy. We used the sections <br />
pragma to parallelize the Gummel loop. Each call to solve the bands <br />
was enclosed in a sections pragma. The threads that are executing the <br />
sections are synchronized in each Gummel loop and after <br />
initialization of matrix in solveC.f. An important thing to be <br />
mentioned is that the procedure that solves the bands has lots of <br />
local variables that are allocated on the stack. The stack usage by <br />
the routine is approximately 55MB (found out by running assureview). <br />
So we need to set the <tt>KMP_STACKSIZE</tt> environmental variable higher <br />
than that else the program SEGVs.<br />
<br />
The MPI version is implemented by each process solving one band , <br />
exchanging values for C and proceeding to the next Gummel. The size <br />
of C transfered is approximately 50MB. We tried optimizing the size <br />
of the data being sent, but then the whole C array is required by the <br />
other band. The program also carries out some computations in the <br />
initialization phase. We decided to make both the MPI processes do <br />
the computation rather than make one MPI process do the computation <br />
and send it to the other process. Synchronization is carried out by <br />
using <tt>MPI_BARRIER</tt> calls and synchronous Send-Recv calls. The <br />
processes after synchronized at the start of the Gummel loop. Data <br />
values that are exchanged are the C array and error values (so that <br />
we can synchronize the exit).<br />
<br />
<br />
The bands are heavily imbalanced. Band 12 takes approximately 4-5 <br />
times more time than the other band (band 34). Thus the speedup <br />
achieved cannot exceed more than 1.2. This was also demonstrated in <br />
the parallel versions that we developed. the results are tabulated <br />
in table 2.<br />
<br />
<br />
<br />
<table><caption>Table 2</caption> <tbody>
<tr><th>Method<br />
</th><th>Total Time<br />
</th></tr>
<tr><td>Serial<br />
</td><td>1920 sec<br />
</td></tr>
<tr><td>OpenMP<br />
</td><td>1625 sec<br />
</td></tr>
<tr><td>MPI<br />
</td><td>1700 sec<br />
</td></tr>
</tbody></table><br />
Each band is solved by iteratively calling a SOR routine. The author <br />
chooses the Line SOR method due to its superior serial performance. <br />
Line SOR theoretically cannot be parallelized. So Jacobi LSOR (which <br />
can be parallelized) was used. Jacobi LSOR is very slow with respect <br />
to the Line SOR and takes 4 times more time than Line SOR. Jacobi <br />
LSOR was successfully parallelized and the results are tabulated in <br />
table 3.<br />
<br />
<br />
<br />
<table><caption>Table 3</caption> <tbody>
<tr><th>Numbers of procs<br />
</th><th>Time (sec)<br />
</th><th>Speedup<br />
</th></tr>
<tr><td>1<br />
</td><td>8400<br />
</td><td>1<br />
</td></tr>
<tr><td>2<br />
</td><td>4325<br />
</td><td>1.94<br />
</td></tr>
<tr><td>4<br />
</td><td>2267 <br />
</td><td>3.7<br />
</td></tr>
<tr><td>8<br />
</td><td>1263<br />
</td><td>6.6<br />
</td></tr>
<tr><td>16<br />
</td><td>814<br />
</td><td>10.31<br />
</td></tr>
</tbody></table>Rajat developed a new solution technique appropriately labeled as <br />
"hybrid SOR". Hybrid SOR is is a hybrid of Jacobi-Gauss-Seidel <br />
iteration methods.<br />
Essentially, HSOR does following<br />
<br />
x-sweep: Jacobi in y, GS updates in H<br />
y-sweep: Jacobi in x, GS updates in H<br />
H-sweeps: Jacobi in x, GS updates in y<br />
<br />
<br />
This method is better than the Jacobi SOR, but takes more time with <br />
respect to the Line SOR. the results are tabulated in table 4<br />
<br />
<br />
<table><caption>Table 4</caption> <tbody>
<tr><th>Numbers of procs<br />
</th><th>Time (sec)<br />
</th><th>Speedup<br />
</th></tr>
<tr><td>1<br />
</td><td>8121<br />
</td><td>1<br />
</td></tr>
<tr><td>2<br />
</td><td>4221<br />
</td><td>1.92<br />
</td></tr>
<tr><td>4<br />
</td><td>2192<br />
</td><td>3.7<br />
</td></tr>
<tr><td>8<br />
</td><td>1218<br />
</td><td>6.6<br />
</td></tr>
<tr><td>16<br />
</td><td>793<br />
</td><td>10.24<br />
</td></tr>
<tr><td>32<br />
</td><td>651<br />
</td><td>12.5<br />
</td></tr>
</tbody></table>During the parallelization of Jacobi SOR and hybrid SOR, we tried <br />
experimenting with parallelizing Line SOR and the results were <br />
amazing. The solution still converged in the same number of Gummel <br />
and the time taken for convergence is superior to the Jacobi LSOR <br />
method. The idea here was to break down each sweep into chunks and <br />
then solve these chunks in parallel hoping that the the chunk <br />
boundaries fell in regions where the solution was weak. The results <br />
are tabulated in table 5.<br />
<br />
Another possible reason could be that the parallelization leads to <br />
more # of inner iterations. This could in some way be contributing to <br />
the convergence in the same number of gummels<br />
<br />
<br />
<br />
<table><caption>Table 5</caption> <tbody>
<tr><th>Numbers of procs<br />
</th><th>Time (sec)<br />
</th><th>Speedup<br />
</th></tr>
<tr><td>1<br />
</td><td>2277<br />
</td><td>1<br />
</td></tr>
<tr><td>2<br />
</td><td>1186<br />
</td><td>1.91<br />
</td></tr>
<tr><td>4<br />
</td><td>638<br />
</td><td>3.56<br />
</td></tr>
<tr><td>8<br />
</td><td>352<br />
</td><td>6.4<br />
</td></tr>
<tr><td>16<br />
</td><td>226<br />
</td><td>10<br />
</td></tr>
<tr><td>32<br />
</td><td>173<br />
</td><td>13.16<br />
</td></tr>
</tbody></table><br />
The author provided us with an implementation of Zebra LSOR. Zebra <br />
LSOR carries out odd-even sweeps in x, y, and H directions. The <br />
results obtained with Zebra LSOR are tabulated in table 6.<br />
<br />
<br />
<br />
<table><caption>Table 6</caption> <tbody>
<tr><th>Numbers of procs<br />
</th><th>Time (sec)<br />
</th></tr>
<tr><td>1<br />
</td><td>2300<br />
</td></tr>
<tr><td>2<br />
</td><td>1389.38<br />
</td></tr>
<tr><td>3<br />
</td><td>1335.66<br />
</td></tr>
<tr><td>4<br />
</td><td>1762.04<br />
</td></tr>
<tr><td>5<br />
</td><td>1489.9<br />
</td></tr>
<tr><td>10<br />
</td><td>850.48<br />
</td></tr>
<tr><td>15<br />
</td><td>856.48<br />
</td></tr>
<tr><td>20<br />
</td><td>899.5<br />
</td></tr>
<tr><td>25<br />
</td><td>1025.9<br />
</td></tr>
</tbody></table>The next approach was to use MPI to achieve band parallelism and then <br />
use OpenMP for task level parallelism. This decision was motivated by <br />
the need to achieve greater parallelism by using band parallelism as <br />
well as task level parallelism. OpenMP currently does not support <br />
nested parallelism. Thus MPI was chosen to achieve band parallelism. <br />
OpenMP was was used to achieve task level parallelism. The results <br />
although preliminary, look promising. They are tabulated in table 7. <br />
We still have not been able to beat straight parallelization of LSOR, <br />
but as the number of bands increases, the hybrid model should be <br />
better.<br />
<br />
Load balancing is required in the hybrid model. This is because the <br />
bands are heavily imbalanced. Band 34 takes 1/4 time of band 12. This <br />
is further complicated by the fluctuating ratios of times for the <br />
bands from Gummel to Gummel. Three heuristics were chosen to perform <br />
load balancing.<br />
<br />
<ol><li>Allocate a fixed number of threads to each band.</li>
<li>Increase (Decrease) the number of threads by a fixed number (in <br />
this case 2) for each band after each Gummel based on the ratio of <br />
times.</li>
<li>Dynamically increase or decrease the number of threads after each <br />
Gummel. The amount by which the threads are increased/decreased <br />
depends on the ratio of the times for the bands.</li>
</ol><br />
Testing of the above 3 options showed us that option #2 is the best <br />
option for this problem (normal data set).<br />
<br />
<br />
<table><caption>Table 7</caption> <tbody>
<tr><th>Number of procs<br />
</th><th>Parallel LSOR<br />
</th><th>Hybrid<br />
</th></tr>
<tr><td>40<br />
</td><td>177.3<br />
</td><td>225.8<br />
</td></tr>
<tr><td>32<br />
</td><td>235.1<br />
</td><td>216<br />
</td></tr>
<tr><td>25<br />
</td><td>231.8<br />
</td><td>220<br />
</td></tr>
<tr><td>20<br />
</td><td>234.3<br />
</td><td>321<br />
</td></tr>
<tr><td>16<br />
</td><td>316.35<br />
</td><td>345.2<br />
</td></tr>
<tr><td>10<br />
</td><td>385.95<br />
</td><td>496.22<br />
</td></tr>
<tr><td>8<br />
</td><td>485.3<br />
</td><td>763.3<br />
</td></tr>
</tbody></table><br />
<br />
<h2>Implementation Details</h2>It is very hard to find the parallelism in a program until you <br />
actually know what the program is doing. One of main benefits of <br />
OpenMP that I found is that it is possible to parallelize the program <br />
by having a fair knowledge of what the program does. <br />
Band parallelism was accomplished through the use of the sections <br />
directive. The threads were synchronized by the implicit barrier call <br />
at the end of a parallel section pragma. Parallelization of LSOR was <br />
other SOR techniques was very similar. We used parallel, do, <br />
critical, master, single pragmas in the code to accomplish this.<br />
The MPI version was developed by introducing a barrier call at the <br />
end of each Gummel. Synchronous sends & receives were used. The <br />
approach was to simulate the OpenMP band parallel version.<br />
Compilation of the hybrid model was straightforward. Used -lmpi with <br />
<tt>guidef77</tt><br />
<br />
<br />
<h2>Implementation difficulties.</h2>I made the classic mistake of parallelizing inner loops in LSOR. The <br />
overhead of creating and destroying threads at each outer iteration <br />
was too high and so had to go back and try parallelizing the outer <br />
most loops. Another amusing problem that we faced was stack-sizes. <br />
There are a lot of stack-allocated variables in the program and the <br />
default stack-size of 32K was too small. The program failed with a <br />
segmentation fault in some cases and in the other cases, failed by <br />
giving a message "Abort: Out of memory".<br />
Estimating stack-sizes is a tricky exercise. One option is to <br />
manually calculate the sizes of all the stack-allocated (automatic) <br />
variables. Another option is to use "trial and error" philosophy. <br />
Assureview is an excellent tool to find errors in a parallel program. <br />
It has a lot of facilities for finding write-write, read-write, etc <br />
errors in a parallel program. it can also be used to estimate <br />
stack-sizes required by each thread. But one of the main drawback is <br />
that it is 10-20 times slower than the normal run and cannot be used <br />
to validate programs calling any of the OpenMP functions <br />
(<tt>omp_get_thread_num</tt>, etc..)<br />
<br />
<br />
Signal handling by KAI complied program needs to be improved. The <br />
program in some cases refuses to accept <tt>SIGKILL</tt> and in some cases, <br />
handles <tt>SIGKILL</tt> by generating a <tt>SIGSEGV</tt>!<br />
<br />
<br />
There are a lot of scheduling options available that programs can use <br />
to tune their program. Our experiments with these options showed us <br />
that some of the options are best suited for programs with less <br />
threads and some are good for programs with a larger number of <br />
threads.<br />
<br />
<br />
Another problem while developing OpenMP programs is whether to make <br />
variables private or shared. This ultimately depends on the problem <br />
and the region that needs to be parallelized. Making variables <br />
private increases the stack usage, and since these are made private <br />
at each instance, more time is required. <br />
<br />
<br />
Some program may depend on the serial semantics of loops and use the <br />
variables after the do loop. for example, a program may use the last <br />
value of the loop index after the loop to make some computations. In <br />
such cases, the lastprivate pragma can be used.<br />
<br />
<br />
<h2>Implementation differences.</h2>MPI and OpenMP are two different parallel programming constructs. Our <br />
(my) belief is that OpenMP is easier to use to parallelize programs <br />
that are already been developed. MPI can only be used to do <br />
domain-decomposition and cannot be used to parallelize just loops.<br />
Another feature of OpenMP is that we just need to insert pragmas. No <br />
major modification in the flow of the program has to be made.<br />
<br />
<h2>Tuning</h2><tt>KMP_LIBRARY</tt> set to gang gives us the best performance. <br />
Stack-size plays a role in the execution times of the program. Our <br />
experiments with FT kernel of NAS parallel benchmarks showed us that <br />
stack-size can effect performance by as much as 10%<br />
Making the DO loops <tt>nowait</tt> helped improve performance by as mush as <br />
5%. This can be used when there is unequal amount of work for each <br />
thread and the free threads can be put to work on the next loop or <br />
parallel construct.<br />
<br />
<br />
<h2>Conclusions</h2>We have demonstrated that<br />
<ol><li>OpenMP to be an effective method to parallelize <br />
programs. This is reflected in the good scaling achieved.</li>
<li>SOR was parallelized and we obtained good speedup.</li>
<li>We have also demonstrated that OpenMP and MPI can be used in the<br />
same programming model effectively.</li>
</ol><br />
<h2>Some notes about the project</h2><ol><li>The maximum speedup that can be obtained by Band Parallelism is <br />
1.1 to 1.2. this is due to the fact that Band 34 currently takes 4-5 <br />
times more time than Band 12. Max speedup is 1.x where x is the ratio <br />
of times for the bands</li>
<li> The best serial performance is obtained with the Line SOR.</li>
<li> Line SOR cannot be mathematically parallelized. But our results <br />
with parallelizing it with breaking it down into chunks and then <br />
solving in parallel showed that this is possible. One possible <br />
explanation provided by the author was that these chunk boundaries <br />
lies in weak regions or in regions where the solution does not change <br />
much. But when we used a chunk size of 1, (i.e. each grid point is <br />
solved simultaneously, the solution still converged in the same <br />
number of Gummel iterations. This is something that has to be <br />
examined in greater detail.</li>
<li> In the original serial case, EQN #1 is used. By using EQN #2, it <br />
was shown that the number of iterations increases by almost 2x. Thus <br />
the availability of newer data is crucial for faster convergence.</li>
<li> In the MPI version, we tried experimenting with sending only data <br />
that is required for the optical band interchange. The other region <br />
where the data form the other band is required is Impact Ionization. <br />
Thus we need to send the whole C array since we cannot ignore impact <br />
ionization.</li>
<li> The goal of load balancing was to make the bands take equal <br />
amounts of time. There are a lot of indeterminate variables in the <br />
problem that make paper analysis difficult.</li>
<li> The bands take unequal amounts of time. This is highly evident in <br />
the sparse data set, where Band 12 takes almost 10 times more time <br />
than Band 34. This ratio comes down to 4 in the normal data set.</li>
<li> The times for the bands gets equal as the Gummel loop progresses. <br />
Makes load balancing difficult.</li>
<li> The problem is sensitive to the values of omega. Currently for <br />
LSOR, omega is 1.35 for band 12. Experimenting with the normal data <br />
set showed that the problem converges in the same number of <br />
iterations for omega=0.80. We also experimented with varying the <br />
value for omega based on the iteration number. Using omega=0.8 for <br />
odd numbered gummels and 1.35 for even numbered gummels makes the <br />
solution go hay-wire. Decreasing the value of omega by 0.01 for each <br />
Gummel iteration showed some promise.</li>
<li> The total number of iterations (gummels and inner) varies <br />
according to the SOR method used. These can be summarized by the <br />
following table<br />
<br />
<br />
<br />
<table><caption>Normal data set</caption> <tbody>
<tr><th>Method<br />
</th><th>Gummel<br />
</th><th>inner<br />
</th></tr>
<tr><td>LSOR<br />
</td><td>21<br />
</td><td>264<br />
</td></tr>
<tr><td>Jacobi SOR<br />
</td><td>61<br />
</td><td>851<br />
</td></tr>
<tr><td>Hybrid SOR<br />
</td><td>64<br />
</td><td>746<br />
</td></tr>
</tbody></table><table><caption>Sparse data set</caption> <tbody>
<tr><th>Method<br />
</th><th>Gummel<br />
</th><th>inner<br />
</th></tr>
<tr><td>LSOR<br />
</td><td>28<br />
</td><td>148<br />
</td></tr>
<tr><td>Zebra<br />
</td><td>28<br />
</td><td>263<br />
</td></tr>
</tbody></table><br />
<br />
</li>
<li> Reorganization of loops to ensure cache hits increased the serial <br />
time by 2 seconds for the sparse data set.</li>
<li> In the SOR routines being used, the outer index is nx. <br />
(approximately 40 for normal data set). If the outer loop were to be <br />
over k, then the max outer index would be k_max (400 for the normal <br />
data set). With OpenMP, we could get better scaling over larger <br />
number of processors.</li>
<li> Interchanging the bans had no effect on the number of Gummel <br />
iterations.</li>
<li> Solving one band before the Gummel loop has no effect on the <br />
number of Gummel iterations.</li>
<li> The values for errorfo keep fluctuating during the first few <br />
iterations,</li>
<li><tt>KMP_LIBRARY</tt> set to gang gives us the best performance<br />
possible.</li>
<li> Putting a barrier pragma in the OpenMP version of the<br />
original <br />
serial program increases the number of iterations by 2x. If we don't <br />
put the barrier, then newer data is available to the bands and thus <br />
they converge faster even if we break the Gummel dependence.</li>
<li> Making the OpenMP do loops no-wait, decreases the time by a <br />
minute amount. this time depends on the amount of work done by each <br />
thread. This option will help if the threads take unequal amount of <br />
time to execute.</li>
<li> Surinder had an idea where if we break the Gummel loop concept, <br />
and let the bands go on executing until they converge.</li>
<li> why wavefront parallelism wont work? In our case with this <br />
approach is (1) very few inner iterations (10) <br />
to parallelize on (2) we do x,y,H sweeps...that will cause problems <br />
in the wavefront as the directions change. </li>
<li> Zebra LSOR does odd-even sweeps. This effectively decreases the <br />
outer loop maximum by half. So it hard to obtain speedup with greater <br />
number of processors. There is also an increase in the number of <br />
parallel constructs. 3 more implicit barrier calls.</li>
</ol></div><style type="text/css">
/* Thanks smashingmagazine.com */
.contentnnn table
{
margin: 5px;
border-collapse: collapse;
text-align: left;
}
.contentnnn th
{
font-size: 1.2em
font-weight: normal;
color: #039;
padding: 5px 4px;
border-bottom: 2px solid #6678b1;
}
.contentnnn td
{
/* color: #669;
line-height:1.1em;*/
font-size:0.9em;
padding: 9px 8px 0px 8px;
}
.contentnnn tbody tr:hover td
{
color: #009;
}
.contentnnn td.col1 { font-family:Monospace;}
pre { background:#eee;}
tt {color:#395d76; font-size:110%; font-weight:bold;}
</style>realneelhttp://www.blogger.com/profile/00374110497508299078noreply@blogger.com0tag:blogger.com,1999:blog-7359737.post-44432919813430377752009-10-23T07:19:00.000-07:002009-10-23T07:19:31.525-07:00Sun/Oracle/MySQLI am not sure why people are making such a big deal about the whole Sun/Oracle/MySQL issue. To say that MySQL competes with Oracle is incorrect. Oracle has "owned" InnoDB for some time now and InnoDB has added many new features. So its not like they are out to kill MySQL.realneelhttp://www.blogger.com/profile/00374110497508299078noreply@blogger.com0tag:blogger.com,1999:blog-7359737.post-1087603258253953442004-06-18T17:00:00.000-07:002004-06-18T17:00:58.253-07:00First blogMy first blog...realneelhttp://www.blogger.com/profile/00374110497508299078noreply@blogger.com0