Laplace Benchmark
The core of the Laplace test code is a simple two-dimensional central
differencing scheme. Laplace was developed as a vehicle for motivating
an introductory MPI course and as such, little attention was paid to
performance-related issues.
A rectangular grid is defined at run time by the user. A single layer
of ghost cells surrounds each subdomain. Boundary conditions are set
on the ghost region surrounding the global domain. Boundary values on
the global domain are never updated (read only).
The problem solved by the Laplace benchmark is trivial. The solution is
a plane located at z=1. Boundary variables are set to 1. Interior ghost
cells and the subdomains themselves are initialized to 0. This allowed
a simple estimated error norm to be computed without additional storage
requirements. The feature was added for validation purposes.
Laplace consists of a few hundred lines of Fortran code. The differencing
pattern is a 4-point star. As such, the number of iterations required to
propagate boundary values into the center-of-gravity of the domain is a
function of the mesh dimension along the axis that features the fewest
node points.
The original benchmark code used an explicit scheme. In version 4, the
computational kernel was revised. A Gauss-Seidel variant of the central
differencing scheme was employed for two reasons. It accelerates the
propagation of boundary conditions and requires the absolute minimum
storage requirement.
All floating point data is explicitly declared as REAL*8. Integer sizes
were not specified. It was not necessary as no integers are communicated.
Default compiler settings were used (32 bits for all test platforms).
Working from the Tru64/PBS version, we created AIX/LoadLeveler, Linux/IA-32,
and Linux/IA-64 ports. A timer library based on the gettimeofday() and the
getrusage() functions was also moved to each platform to support internal
measurements including the floating point rate of the computational kernel
and the transfer rate for the communications steps.
NCSA Benchmarks
Dimensioning criteria and iteration counts for the Laplace code are specified
at run time. For the NCSA tests, an effort was made to fill at least 80% of
the memory of the smallest single node to be tested (IA-32 at ~1.25 GB available).
This case was dubbed Small, while Medium and large cases were defined by doubling
and quadrupling this amount across 2 and 4 nodes respectively. In other words,
the Medium and Large cases were sized to fill 80% of multiple IA-32 node memories.
Iteration counts were varied by platform to cause each version of the code to
complete a "Small" simulation in approximately 1 hour.
The row count was fixed across all cases at 8000. Other run-time parameters that
were used are as follows.
Number of columns:
Processors 1 2 4 8 16 21 32 41 64 128
--------------------------------------------------------------------
Small 8000 4000 2000 1000 500 381 250 195 125 63
Medium - 8000 4000 2000 1000 762 500 390 250 125
Large - - 8000 4000 2000 1524 1000 780 500 250
Notes
-----
Cases involving 21 processors were only tested on Copper.
The extent of Copper testing was limited to 32 processors (node size).
Cases involving 41 processors were used to test the Linux clusters.
Column counts are per-process.
Number of Iterations (fixed across all process counts):
--------------------
p690 Copper AIX LoadLeveler 2500
IA-32 Platinum Linux PBS 500
IA-64 Titan Linux PBS 2000
All Copper tests were performed on a dedicated node. All Linux cluster tests were
performed with the process-per-node count set to 1. See the original Word document
for additional software and hardware details.
Tables and Charts
o Parameters and Tables (Word)
o Charts (Excel)
Analysis of Results
The Laplace code can be configured to test numerous machine characteristics. The
configuration used for the NCSA Benchmarks might be considered 'typical.'
By reducing the column count and increasing the row count, timings can be skewed
towards measurement of message passing rates. Conversely, increasing the column
count and reducing the row count will skew measurements towards measurement of
floating point performance.
Since the problem size was fixed for these experiments, the per-process column
count is decreasing. This creates the potential for increases in performance due
to caching effects. It also means that network timings will likely increase.
The communications pattern amounts to simple shifts on the subdomain ghost cells.
Since the row count was held fixed, the message size also remains fixed. It is
small though at 8,000 x 8 bytes = 64 Kbytes. For a sustainable bandwidth in
the neighborhood of 100 MBytes/s, transmission times will be very small and so
latency is expected to dominate communications timings.
Floating point performance rates were measured externally with hpmcount on Copper
and with the psrun utility on Platinum and Titan. Internal timers report elapsed
time and usage. Corresponding rates are computed for the computational and
communications parts of the code.
Since Laplace is a streaming memory code, we expect floating point rates and
efficiencies to be consistent with the STREAMS benchmark.