Debugging Parallel GENESIS Scripts

First, take a look at an explanation of some common errors.

Source Level Debugging

There is no source level debugging of Genesis scripts; instead, one can set the debug level to provide more or less detailed information about what is being executed in a Genesis script. PGENESIS follows this model - a debug level can be specified in the paron statement to control what level of debugging information is printed out during a run.

Currently, it is possible to run the worker nodes inside their own xterm window. This is achieved by providing the -debug tty flag to the pgenesis shell script which controls how PGENESIS is run. In this case, it is important that the paron command in the GENESIS script not be given the -output flag, which redirects worker output to a file instead of stdout.

For those who need to debug C code (either GENESIS/PGENESIS source code, or custom user-written libraries), it is also possible on some platforms to run the workers and the master under a C code source level debugger such as gdb or dbx. For dbx, the master and each worker run inside their own window, as for the -debug tty option, but each runs inside dbx. For gdb, the master and each worker run inside their own window which is running emacs, with gdb running inside emacs. These options are specified to the pgenesis shell script using -debug dbx and -debug gdb respectively.

Script modifications for debugging

In addition to adding more echo statements to the scripts, the following ideas may be helpful.

  • Timeout:    The timeout period is set by default to 120 seconds. You can modify this with the command

    setfield /post msg_hang_time n

    where n is the number of seconds to wait before timing out on barriers, responses to remote commands, etc.

  • Barriers:    Many errors in parallel programming are due to incorrect synchronization of the executing processes. Insertion of extra barrier and barrierall commands can help in ensuring that the synchronization you expect is in fact occuring.

  • Asynchronous remote function calls:    Asynchronous function calls increase the potential degree of parallelism in a parallel script, and therefore increase the risk of deadlock (no process can continue because each is waiting for a message from another) or other program error. If your scripts use the async command, you can turn all these calls into synchronous calls by globally replacing the string "async" with "//async \", effectively commenting out the "async".