next up previous
Next: Determination of Serial Fraction Up: Parallel Photon Transport Algorithm Previous: Parallelization on the CRAY

Parallelization on the BBN Butterfly

The message-passing version of TPHOT was implemented on Lawrence Livermore National Laboratory's 128 processor BBN Butterfly TC2000 using the Livermore Message-Passing (LMPS), a library of message-passing routines. Each processor of the BBN has 16 MBytes of memory that can be ``shared'' by all nodes via a ``butterfly switch''. Under LMPS, however, each node's memory belongs to only itself from the perspective of the application program. The code yielded identical results for the test problem run with 8 tasks on both the BBN and the Cray. Many different runs were made on the BBN, varying the number of processors from 1 to 116 and the number of particles (i.e., the workload W) from 2400 to 24,000,000.

Table 5.1 gives the simulation times for the Butterfly as a function of the number

 
Table: Observed TPHOT Execution Times and Speedups for BBN.
1|cNumber 1|c 1c Workload (W) 1c 1c 1c|
1|cof 1|c0.01 1|c 0.1 1|c1.0 1|c10.0 1|c| 100.0
1|cprocessors 1|ctime$\;\;\;\;$SN 1|c time$\;\;\;\;$SN 1|ctime$\;\;\;\;$SN 1|ctime$\;\;\;\;$SN 1|c|time$\;\;\;\;$SN
1|c 1|c(sec)$\;\;\;\;$ 1|c (sec)$\;\;\;\;$ 1|c(sec)$\;\;\;\;$ 1|c(sec)$\;\;\;\;$ 1|c| (sec)$\;\;\;\;$
1 17$\;\;\;\;$- 144$\;\;\;\;$ - 1407$\;\;\;\;$- -$\;\;\;\;$- -$\;\;\;\;$-
4 6$\;\;\;\;$2.83 38$\;\;\;\;$3.79 357$\;\;\;\;$3.94 -$\;\;\;\;$- -$\;\;\;\;$-
8 5$\;\;\;\;$3.40 22$\;\;\;\;$6.55 181$\;\;\;\;$7.77 1769$\;\;\;\;$7.95 -$\;\;\;\;$-
9 5$\;\;\;\;$3.40 20$\;\;\;\;$7.20 161$\;\;\;\;$8.74 1595$\;\;\;\;$8.82 -$\;\;\;\;$-
10 5$\;\;\;\;$3.40 18$\;\;\;\;$8.00 145$\;\;\;\;$9.70 1416$\;\;\;\;$9.94 -$\;\;\;\;$-
16 7$\;\;\;\;$2.43 13$\;\;\;\;$11.08 94$\;\;\;\;$14.97 888$\;\;\;\;$15.84 -$\;\;\;\;$-
32 15$\;\;\;\;$1.13 15$\;\;\;\;$9.60 54$\;\;\;\;$26.06 450$\;\;\;\;$31.27 -$\;\;\;\;$-
64 -$\;\;\;\;$- 31$\;\;\;\;$4.65 53$\;\;\;\;$26.55 251$\;\;\;\;$56.06 2364$\;\;\;\;$59.52
80 -$\;\;\;\;$- -$\;\;\;\;$- -$\;\;\;\;$- 223$\;\;\;\;$63.09 1813$\;\;\;\;$77.61
100 -$\;\;\;\;$- -$\;\;\;\;$- -$\;\;\;\;$- 215$\;\;\;\;$65.44 1493$\;\;\;\;$94.24
116 -$\;\;\;\;$- -$\;\;\;\;$- -$\;\;\;\;$- 224$\;\;\;\;$62.81 1366$\;\;\;\;$103.0

of processors N and the workload W. We have arbitrarily assigned W=1.0 to the case with approximately 240,000 particles. Blanks appear in the table for two reasons: (1) large workloads are prohibitively expensive on few processors, and (2) small workloads on a large number of processors yield chaotic timings.

The speedups for each case in table 5.1 are computed using equation (5.1), using the N=1 case for each workload as the reference serial case (for $\tau_1$). This is not quite correct, because this will not be the optimal serial code. This is probably not a large effect, but it will tend to make the speedups appear better than they should be.


 
Table: Parameters of BBN Linear Model.
1|cWorkload 1|c# of 1|c model single 1|cobserved single 1|c|serial
1|c(W) 1|chistories 1|c processor execution 1|c processor execution 1|c|fraction
1|c 1|c(Nh) 1|c time ($\tau_1$)(sec) 1|c time ($\tau_1$)(sec) 1|c|(f)
0.01 2347 17.2 17 0.19
0.10 23843 143.8 144 0.023
1.00 238232 1407 1407 0.0024
10.0 2382320 14070 - 0.00024
100.0 23823200 140700 - 0.000024


next up previous
Next: Determination of Serial Fraction Up: Parallel Photon Transport Algorithm Previous: Parallelization on the CRAY
Amitava Majumdar
9/20/1999