Homework 2 (9/24 - 10/15): Sharks and Fish Particle Simulation
[ Intro
] [ Details
] [ Submit ] [ Grades ] [ References
]
[ Discussion ] [ FAQ ]
[ Millennium Essentials ] [ NERSC Essentials ]
The Sharks and Fish Particle Simulation (SFPS) is meant to represent an abstraction of natural phenomena that you may find in biology, chemistry, physics, astronomy, etc. and eventually want to model on a parallel machine. We intend to give you a taste of creating parallel algorithms, using parallel programming tools, libraries and compilers, and modelling performance and scalability on this simplified problem.
We give you an implementation of SFPS in its simplest form -- essentially a simulation over time of equal mass particles obeying gravitational-like forces. To start, the processor's calculate an initial configuration of the particles (either uniform distribution or circular distribution). At every time step, the forces on each particle are summed up due to all of the other particles. Then using Euler's method, the new position of the particles are calculated and distributed to all of the processors. You will analyze and tune this code, and then extend it to your liking as described below.
Read a discussion of the mathematics, time discretization, and parallelization of Sharks and Fish.
You may work in groups of up to 3.
Run, tune, and profile performance of the Sharks and Fish mpi, pthreads, and titanium implementations (tgz, direct) on Millennium and seaborg (and optionally other machines of your choice). Compilation details may be found on the Millennium and NERSC essentials pages. A Python + Tk viewer is provided (e.g. fish -o filename.out; aquarium filename.out). The Sharks and Fish we give you have no difference and are just point masses (or charges) behaving according to inverse square of distance laws (e.g. gravitational or coulombic depending on your scale).
Use the performance characteristics of our naive code to direct your optimizations. Explain your performance characteristics as well as you understand them and compare your performance with the naive implementations. Possible optimization ideas are:
Cyclic copying to improve scalability (mpi, pthreads?, titanium?).
Modify communication methods (blocking/non-blocking methods, synchronization, etc...).
Interleave communication with computation (when is this interleaving possible?).
Try creating a performance model of both implementations that fits the performance curves you see. From this model, try to say something about the memory and speed scalability of the parallel implementations as you are presented with more fish and more processors.
Choose an implementation (mpi, pthreads, titanium, or your own choice) and extend Sharks and Fish in the following ways according to your interests (some of these may not make sense to combine):
Add an external current or force (possibly as a function of position) that pushes the fish.
Experiment with boundary conditions (how can you store the geometry?, does a spatial tiling over processors make sense?).
Make the fish and sharks have some size (instead of points) and experiment with impacts between fish (e.g. elastic / inelastic collisions).
Add a local force according to van der Waals or Lennard-Jones.
Filter or approximate negligable forces of fish/sharks that are far away (may lead to hierarchical methods such as Barnes-Hut).
Have the fish and sharks reproduce if they are "close" to each other (what challenges are associated with adding fish?) and die if they are too lonely (aka Game of Life type simulation). You may want to discretize how fast and the directions they can move in order to be realistic to the Game of Life.
Disallow fish/sharks to share a region (if a shark moves into the same region as a fish, then the fish is eaten; if a shark/fish moves into the same region as a shark/fish (respectively) have some sort of rule system of where to place the sharks/fish).
Have the sharks chase the fish and die if they have not eaten in enough time.
What kind of implementation challenges do you find while making your extensions? What ways have you found to overcome these difficulties?
What kind of parallel performance losses/gains do these different extensions cause? Perhaps try to model the performance of your extensions.
Email your group's tarred write-up and code, or create a webpage that has them. Tell us what you did and why you did it. Tell us about the modifications you made to the code; but more importantly, tell us about any profiling, experiments, models, or other considerations that lead you to make your changes.
Grades will be based more on your explanations and problem-solving attempts than the end implementation results. This code is probably the first parallel code that most people in the class have written, and as such, your implementations are not expected to have astounding performance and scalability.
- See the MPI on Millennium page for a description of how to compile and run MPI programs. You can simply use rexec (on one node) to run the pthreads version.
- You will almost certainly want to look at the MPI resources on the resources page. MPI is a good deal more obscure than pthreads, but you will probably want to look at the pthreads stuff, too.
- Visit the Titanium homepage for information about the compiler and possible project ideas.
- There is a page of tutorials at LLNL which includes lots of good information on pthreads, MPI, OpenMP, optimization, etc.
- Again, thanks to David Bindel for copying done from last year's homework.
[ Assignment ] [ Compilation ] [ Execution ] [ Miscellaneous ]
Do I have to optimize all three implementations?
To familiarize yourself with the coding style of each implementation, we would like you to try some light optimizations (those optimizations may be the same for each implementation, but will vary in their implementation style and performance results). A few ideas are listed above. You do not have to have great performance for each implementation, but rather describe why, how, and to what effect you optimized the implementations.
Why am I getting "stray '\' in program" errors when I run make?
You need to make sure the code was transferred as plaintext if you are editing it on a DOS machine and transfering it to a unix machine. If you transfer as binary, ^M's will be added to the end-of-lines and the \ will be modifying this expression instead of adding a newline to the #define macros.
Why does the Titanium version not compile on seaborg?
Try using "tcbuild --cc cc --backend sp3 <program>".
How do I run the viewer?
Make sure you have xhosted the machine you are running it on, and the machine has python installed.
Run the compiled fish program with the -o option specified: e.g. fish -o filename.out
Run the viewer (make sure you specify the pathname correctly): aquarium filename.outWhy does Millennium say that it cannot chdir to a directory?
The filesystem is having problems on some nodes (mm* machines).
Try restricting the set of nodes to those that work.
Which ones work? It is hard to tell until you try to run something.REXEC_SVRS on Millennium?
No, apparently rexec was the old way of doing things and Titanium compiled with udp-* backend still relies on its methodology as it has not been updated on the Millennium machines.
The preferred way to compile on Millennium machines is with the mpi-* backends which run gexec.
How do I limit the number of processes per node in Titanium?
setenv TI_PFORP processes/processors -- Aaron Wemhoff pointed out this answer.
See Dan Bonachea's start page.
What if I receive loadlevel errors on Seaborg?
Try running "poe ./fish -retrycount 20".
Are there GUI profilers for Millennium like VAMPIR for Seaborg?
Unfortunately, no. VAMPIR is not set up for Millennium. If we have a copy somewhere, I am willing to configure it for Millennium.
Is there a system hook to find which exact processor on which my process is running? [ still looking ]
On Millennium, try using the sysinfo command and fetch the number of pocesses running on the processor.
Timing? It seems that clock() is not giving accurate results.
On seaborg, it seems that clock() increases the scaling of the time based on the number of threads. You will have to divide through by the number of threads to get the actual CPU time for a given thread (also with timing barriers, etc.). -- José González pointed out this problem.
Did you remember to subtract of the initial value given by clock()? For processes taking longer than 72 minutes, you may have to adjust for wrap-around. You may also try switching to using the times() command which returns more of a breakdown of the time your process is taking. Note these commands return time in clock ticks -- according to POSIX, you can just divide through by 1000000. I would not recommend using wall clock time commands like gettimeofday() as these values may not reflect the actual process time.
If you still do not feel like you are getting accurate timings, send me email with the code that is giving you problems and the machine on which you are running it.