The message-passing version of TPHOT was implemented on Lawrence Livermore National Laboratory's 128 processor BBN Butterfly TC2000 using the Livermore Message-Passing (LMPS), a library of message-passing routines. Each processor of the BBN has 16 MBytes of memory that can be ``shared'' by all nodes via a ``butterfly switch''. Under LMPS, however, each node's memory belongs to only itself from the perspective of the application program. The code yielded identical results for the test problem run with 8 tasks on both the BBN and the Cray. Many different runs were made on the BBN, varying the number of processors from 1 to 116 and the number of particles (i.e., the workload W) from 2400 to 24,000,000.
Table 5.1 gives the simulation times for the Butterfly as a function of the
number
1|cNumber | 1|c | 1c Workload (W) | 1c | 1c | 1c| |
1|cof | 1|c0.01 | 1|c 0.1 | 1|c1.0 | 1|c10.0 | 1|c| 100.0 |
1|cprocessors | 1|ctime![]() |
1|c time![]() |
1|ctime![]() |
1|ctime![]() |
1|c|time![]() |
1|c | 1|c(sec)![]() |
1|c (sec)![]() |
1|c(sec)![]() |
1|c(sec)![]() |
1|c| (sec)![]() |
1 | 17![]() |
144![]() |
1407![]() |
-![]() |
-![]() |
4 | 6![]() |
38![]() |
357![]() |
-![]() |
-![]() |
8 | 5![]() |
22![]() |
181![]() |
1769![]() |
-![]() |
9 | 5![]() |
20![]() |
161![]() |
1595![]() |
-![]() |
10 | 5![]() |
18![]() |
145![]() |
1416![]() |
-![]() |
16 | 7![]() |
13![]() |
94![]() |
888![]() |
-![]() |
32 | 15![]() |
15![]() |
54![]() |
450![]() |
-![]() |
64 | -![]() |
31![]() |
53![]() |
251![]() |
2364![]() |
80 | -![]() |
-![]() |
-![]() |
223![]() |
1813![]() |
100 | -![]() |
-![]() |
-![]() |
215![]() |
1493![]() |
116 | -![]() |
-![]() |
-![]() |
224![]() |
1366![]() |
of processors N and the workload W. We have arbitrarily assigned W=1.0 to the case with approximately 240,000 particles. Blanks appear in the table for two reasons: (1) large workloads are prohibitively expensive on few processors, and (2) small workloads on a large number of processors yield chaotic timings.
The speedups for each case in table 5.1 are computed using equation (5.1),
using the N=1
case for
each workload as the reference serial case (for ). This is not
quite correct,
because this will not be the optimal serial code. This is probably not a
large
effect, but it will tend to make the speedups appear better than they should be.
1|cWorkload | 1|c# of | 1|c model single | 1|cobserved single | 1|c|serial |
1|c(W) | 1|chistories | 1|c processor execution | 1|c processor execution | 1|c|fraction |
1|c | 1|c(Nh) | 1|c time (![]() |
1|c
time (![]() |
1|c|(f) |
0.01 | 2347 | 17.2 | 17 | 0.19 |
0.10 | 23843 | 143.8 | 144 | 0.023 |
1.00 | 238232 | 1407 | 1407 | 0.0024 |
10.0 | 2382320 | 14070 | - | 0.00024 |
100.0 | 23823200 | 140700 | - | 0.000024 |