Parallelization on the BBN Butterfly

The message-passing version of TPHOT was implemented on Lawrence Livermore National Laboratory's 128 processor BBN Butterfly TC2000 using the Livermore Message-Passing (LMPS), a library of message-passing routines. Each processor of the BBN has 16 MBytes of memory that can be ``shared'' by all nodes via a ``butterfly switch''. Under LMPS, however, each node's memory belongs to only itself from the perspective of the application program. The code yielded identical results for the test problem run with 8 tasks on both the BBN and the Cray. Many different runs were made on the BBN, varying the number of processors from 1 to 116 and the number of particles (i.e., the workload W) from 2400 to 24,000,000.

Table 5.1 gives the simulation times for the Butterfly as a function of the number

**Table:** Observed TPHOT Execution Times and Speedups for BBN.
1\|cNumber	1\|c	1c Workload (W)	1c	1c	1c\|
1\|cof	1\|c0.01	1\|c 0.1	1\|c1.0	1\|c10.0	1\|c\| 100.0
1\|cprocessors	1\|ctime $\;\;\;\;$ S_N	1\|c time $\;\;\;\;$ S_N	1\|ctime $\;\;\;\;$ S_N	1\|ctime $\;\;\;\;$ S_N	1\|c\|time $\;\;\;\;$ S_N
1\|c	1\|c(sec) $\;\;\;\;$	1\|c (sec) $\;\;\;\;$	1\|c(sec) $\;\;\;\;$	1\|c(sec) $\;\;\;\;$	1\|c\| (sec) $\;\;\;\;$
1	17 $\;\;\;\;$ -	144 $\;\;\;\;$ -	1407 $\;\;\;\;$ -	- $\;\;\;\;$ -	- $\;\;\;\;$ -
4	6 $\;\;\;\;$ 2.83	38 $\;\;\;\;$ 3.79	357 $\;\;\;\;$ 3.94	- $\;\;\;\;$ -	- $\;\;\;\;$ -
8	5 $\;\;\;\;$ 3.40	22 $\;\;\;\;$ 6.55	181 $\;\;\;\;$ 7.77	1769 $\;\;\;\;$ 7.95	- $\;\;\;\;$ -
9	5 $\;\;\;\;$ 3.40	20 $\;\;\;\;$ 7.20	161 $\;\;\;\;$ 8.74	1595 $\;\;\;\;$ 8.82	- $\;\;\;\;$ -
10	5 $\;\;\;\;$ 3.40	18 $\;\;\;\;$ 8.00	145 $\;\;\;\;$ 9.70	1416 $\;\;\;\;$ 9.94	- $\;\;\;\;$ -
16	7 $\;\;\;\;$ 2.43	13 $\;\;\;\;$ 11.08	94 $\;\;\;\;$ 14.97	888 $\;\;\;\;$ 15.84	- $\;\;\;\;$ -
32	15 $\;\;\;\;$ 1.13	15 $\;\;\;\;$ 9.60	54 $\;\;\;\;$ 26.06	450 $\;\;\;\;$ 31.27	- $\;\;\;\;$ -
64	- $\;\;\;\;$ -	31 $\;\;\;\;$ 4.65	53 $\;\;\;\;$ 26.55	251 $\;\;\;\;$ 56.06	2364 $\;\;\;\;$ 59.52
80	- $\;\;\;\;$ -	- $\;\;\;\;$ -	- $\;\;\;\;$ -	223 $\;\;\;\;$ 63.09	1813 $\;\;\;\;$ 77.61
100	- $\;\;\;\;$ -	- $\;\;\;\;$ -	- $\;\;\;\;$ -	215 $\;\;\;\;$ 65.44	1493 $\;\;\;\;$ 94.24
116	- $\;\;\;\;$ -	- $\;\;\;\;$ -	- $\;\;\;\;$ -	224 $\;\;\;\;$ 62.81	1366 $\;\;\;\;$ 103.0

of processors N and the workload W. We have arbitrarily assigned W=1.0 to the case with approximately 240,000 particles. Blanks appear in the table for two reasons: (1) large workloads are prohibitively expensive on few processors, and (2) small workloads on a large number of processors yield chaotic timings.

The speedups for each case in table 5.1 are computed using equation (5.1), using the N=1 case for each workload as the reference serial case (for $\tau_1$ ). This is not quite correct, because this will not be the optimal serial code. This is probably not a large effect, but it will tend to make the speedups appear better than they should be.

**Table:** Parameters of BBN Linear Model.
1\|cWorkload	1\|c# of	1\|c model single	1\|cobserved single	1\|c\|serial
1\|c(W)	1\|chistories	1\|c processor execution	1\|c processor execution	1\|c\|fraction
1\|c	1\|c(N_h)	1\|c time ( $\tau_1$ )(sec)	1\|c time ( $\tau_1$ )(sec)	1\|c\|(f)
0.01	2347	17.2	17	0.19
0.10	23843	143.8	144	0.023
1.00	238232	1407	1407	0.0024
10.0	2382320	14070	-	0.00024
100.0	23823200	140700	-	0.000024