The message-passing version of TPHOT was implemented on Lawrence Livermore National Laboratory's 128 processor BBN Butterfly TC2000 using the Livermore Message-Passing (LMPS), a library of message-passing routines. Each processor of the BBN has 16 MBytes of memory that can be ``shared'' by all nodes via a ``butterfly switch''. Under LMPS, however, each node's memory belongs to only itself from the perspective of the application program. The code yielded identical results for the test problem run with 8 tasks on both the BBN and the Cray. Many different runs were made on the BBN, varying the number of processors from 1 to 116 and the number of particles (i.e., the workload W) from 2400 to 24,000,000.
Table 5.1 gives the simulation times for the Butterfly as a function of the
number
1|cNumber | 1|c | 1c Workload (W) | 1c | 1c | 1c| |
1|cof | 1|c0.01 | 1|c 0.1 | 1|c1.0 | 1|c10.0 | 1|c| 100.0 |
1|cprocessors | 1|ctimeSN | 1|c timeSN | 1|ctimeSN | 1|ctimeSN | 1|c|timeSN |
1|c | 1|c(sec) | 1|c (sec) | 1|c(sec) | 1|c(sec) | 1|c| (sec) |
1 | 17- | 144 - | 1407- | -- | -- |
4 | 62.83 | 383.79 | 3573.94 | -- | -- |
8 | 53.40 | 226.55 | 1817.77 | 17697.95 | -- |
9 | 53.40 | 207.20 | 1618.74 | 15958.82 | -- |
10 | 53.40 | 188.00 | 1459.70 | 14169.94 | -- |
16 | 72.43 | 1311.08 | 9414.97 | 88815.84 | -- |
32 | 151.13 | 159.60 | 5426.06 | 45031.27 | -- |
64 | -- | 314.65 | 5326.55 | 25156.06 | 236459.52 |
80 | -- | -- | -- | 22363.09 | 181377.61 |
100 | -- | -- | -- | 21565.44 | 149394.24 |
116 | -- | -- | -- | 22462.81 | 1366103.0 |
of processors N and the workload W. We have arbitrarily assigned W=1.0 to the case with approximately 240,000 particles. Blanks appear in the table for two reasons: (1) large workloads are prohibitively expensive on few processors, and (2) small workloads on a large number of processors yield chaotic timings.
The speedups for each case in table 5.1 are computed using equation (5.1), using the N=1 case for each workload as the reference serial case (for ). This is not quite correct, because this will not be the optimal serial code. This is probably not a large effect, but it will tend to make the speedups appear better than they should be.
1|cWorkload | 1|c# of | 1|c model single | 1|cobserved single | 1|c|serial |
1|c(W) | 1|chistories | 1|c processor execution | 1|c processor execution | 1|c|fraction |
1|c | 1|c(Nh) | 1|c time ()(sec) | 1|c time ()(sec) | 1|c|(f) |
0.01 | 2347 | 17.2 | 17 | 0.19 |
0.10 | 23843 | 143.8 | 144 | 0.023 |
1.00 | 238232 | 1407 | 1407 | 0.0024 |
10.0 | 2382320 | 14070 | - | 0.00024 |
100.0 | 23823200 | 140700 | - | 0.000024 |