Homework 2 (9/24 - 10/15): Sharks and Fish Particle Simulation

[ Intro ] [ Details ] [ Submit ] [ Grades ] [ References ]
[ Discussion ] [ FAQ ]

[ Millennium Essentials ] [ NERSC Essentials ]

Intro

The Sharks and Fish Particle Simulation (SFPS) is meant to represent an abstraction of natural phenomena that you may find in biology, chemistry, physics, astronomy, etc. and eventually want to model on a parallel machine. We intend to give you a taste of creating parallel algorithms, using parallel programming tools, libraries and compilers, and modelling performance and scalability on this simplified problem.

We give you an implementation of SFPS in its simplest form -- essentially a simulation over time of equal mass particles obeying gravitational-like forces. To start, the processor's calculate an initial configuration of the particles (either uniform distribution or circular distribution). At every time step, the forces on each particle are summed up due to all of the other particles. Then using Euler's method, the new position of the particles are calculated and distributed to all of the processors. You will analyze and tune this code, and then extend it to your liking as described below.

Read a discussion of the mathematics, time discretization, and parallelization of Sharks and Fish.

Details

You may work in groups of up to 3.

Run, tune, and profile performance of the Sharks and Fish mpi, pthreads, and titanium implementations (tgzdirect) on Millennium and seaborg (and optionally other machines of your choice). Compilation details may be found on the Millennium and NERSC essentials pages. A Python + Tk viewer is provided (e.g. fish -o filename.out; aquarium filename.out). The Sharks and Fish we give you have no difference and are just point masses (or charges) behaving according to inverse square of distance laws (e.g. gravitational or coulombic depending on your scale).

Use the performance characteristics of our naive code to direct your optimizations. Explain your performance characteristics as well as you understand them and compare your performance with the naive implementations. Possible optimization ideas are:

Try creating a performance model of both implementations that fits the performance curves you see. From this model, try to say something about the memory and speed scalability of the parallel implementations as you are presented with more fish and more processors.

Choose an implementation (mpi, pthreads, titanium, or your own choice) and extend Sharks and Fish in the following ways according to your interests (some of these may not make sense to combine):

What kind of implementation challenges do you find while making your extensions? What ways have you found to overcome these difficulties?

What kind of parallel performance losses/gains do these different extensions cause? Perhaps try to model the performance of your extensions.

Submit

Email your group's tarred write-up and code, or create a webpage that has them. Tell us what you did and why you did it. Tell us about the modifications you made to the code; but more importantly, tell us about any profiling, experiments, models, or other considerations that lead you to make your changes.

Grades

Grades will be based more on your explanations and problem-solving attempts than the end implementation results. This code is probably the first parallel code that most people in the class have written, and as such, your implementations are not expected to have astounding performance and scalability.

References

FAQ

[ Assignment ] [ Compilation ] [ Execution ] [ Miscellaneous ]

Assignment Questions

Do I have to optimize all three implementations?

To familiarize yourself with the coding style of each implementation, we would like you to try some light optimizations (those optimizations may be the same for each implementation, but will vary in their implementation style and performance results). A few ideas are listed above. You do not have to have great performance for each implementation, but rather describe why, how, and to what effect you optimized the implementations.

Compilation Problems

Why am I getting "stray '\' in program" errors when I run make?

You need to make sure the code was transferred as plaintext if you are editing it on a DOS machine and transfering it to a unix machine. If you transfer as binary, ^M's will be added to the end-of-lines and the \ will be modifying this expression instead of adding a newline to the #define macros.

Why does the Titanium version not compile on seaborg?

Try using "tcbuild --cc cc --backend sp3 <program>".

Execution Problems

How do I run the viewer?

Make sure you have xhosted the machine you are running it on, and the machine has python installed.
Run the compiled fish program with the -o option specified:
e.g. fish -o filename.out
Run the viewer (make sure you specify the pathname correctly): aquarium filename.out

Why does Millennium say that it cannot chdir to a directory?

The filesystem is having problems on some nodes (mm* machines).
Try restricting the set of nodes to those that work.
Which ones work? It is hard to tell until you try to run something.

REXEC_SVRS on Millennium?

No, apparently rexec was the old way of doing things and Titanium compiled with udp-* backend still relies on its methodology as it has not been updated on the Millennium machines.

The preferred way to compile on Millennium machines is with the mpi-* backends which run gexec.

How do I limit the number of processes per node in Titanium?

setenv TI_PFORP processes/processors    -- Aaron Wemhoff pointed out this answer.

See Dan Bonachea's start page.

What if I receive loadlevel errors on Seaborg?

Try running "poe ./fish -retrycount 20".

Miscellaneous

Are there GUI profilers for Millennium like VAMPIR for Seaborg?

Unfortunately, no. VAMPIR is not set up for Millennium. If we have a copy somewhere, I am willing to configure it for Millennium.

Is there a system hook to find which exact processor on which my process is running? [ still looking ]

On Millennium, try using the sysinfo command and fetch the number of pocesses running on the processor.

Timing? It seems that clock() is not giving accurate results.

On seaborg, it seems that clock() increases the scaling of the time based on the number of threads. You will have to divide through by the number of threads to get the actual CPU time for a given thread (also with timing barriers, etc.).                                  -- José González pointed out this problem.

Did you remember to subtract of the initial value given by clock()? For processes taking longer than 72 minutes, you may have to adjust for wrap-around. You may also try switching to using the times() command which returns more of a breakdown of the time your process is taking. Note these commands return time in clock ticks -- according to POSIX, you can just divide through by 1000000. I would not recommend using wall clock time commands like gettimeofday() as these values may not reflect the actual process time.

If you still do not feel like you are getting accurate timings, send me email with the code that is giving you problems and the machine on which you are running it.