UNIVERSITY OF CALIFORNIA, SAN DIEGOA Performance Model and Load Balancer for a ParallelMonte-Carlo Cellular Microphysiology SimulatorA thesis submitted in partial satisfaction of therequirements for the degree Master of Sciencesin Computer SciencebyUrvashi Rao VenkataCommittee in charge:Professor Scott B. Baden, ChairProfessor Keith MarzulloProfessor Henri Casanova2004

The thesis of Urvashi Rao Venkata is approved.ChairUniversity of California, San Diego2004iii

TABLE OF CONTENTSSignature Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiIIntroduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1IIMCell and MCell-K . . . . . . . . .A. MCell . . . . . . . . . . . . . . .1. Model Description Language2. Simulation Characteristics . .3. Implementation . . . . . . .4. Running Times . . . . . . . .B. MCell-K . . . . . . . . . . . . .1. Parallelization Strategy . . .2. Results . . . . . . . . . . . .444567889III Performance Model . . . . . . . . . . .A. MCell workload statistics . . . . . .1. Workload Measure . . . . . . . .2. Observations . . . . . . . . . . .3. Analysis . . . . . . . . . . . . .B. Instantaneous model . . . . . . . . .C. Predictive model . . . . . . . . . . .1. Temporal spatial load variations2. Overall running time trends . . .3. MDL input . . . . . . . . . . . .12131316171821222224IV Load Balancing . . . . . . . . . . . . .A. Domain decomposition . . . . . . .1. Recursive Co-ordinate Bisection2. Hilbert Space Filling Curve . . .3. Optimality . . . . . . . . . . . .B. Adaptive Load Balancing . . . . . . 26. 26. 26. 28. 29. 31V.MCell Parallel Simulator . . . . . . . . . . . . . . . . . . . . . . . . . .A. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B. Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iv343538

VI Simulations and Results . . . . . . . . . . . .A. Test Case : Simple Unbounded Diffusion1. Performance model . . . . . . . . . .2. Load Balancing . . . . . . . . . . . .B. Test Case : Chick Ciliary Ganglion . . . .1. Performance Model . . . . . . . . . .2. Load Balancing . . . . . . . . . . . . 41. 41. 42. 42. 49. 49. 50VII Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59v

LIST OF FIGURESII.1 3D Visualization of the Chick Ciliary Ganglion MCell simulation . .II.2 Dynamic Load Imbalance in MCell-K . . . . . . . . . . . . . . . . 141515151721222324IV.1 Schematic diagram for Recursive Co-ordinate Bisection . . . . . . .IV.2 Schematic diagram for Hilbert Space Filling Curve Partitioning. . .2728V.1 Validation of the parallel simulator . . . . . . . . . . . . . . . . . .V.2 Validation of the load balancer . . . . . . . . . . . . . . . . . . . . .3839VI.1 Unbounded Diffusion : Total Running time and Workload estimateVI.2 Unbounded Diffusion - SFC load balancing . . . . . . . . . . . . . .VI.3 Load Balancing Communication Counts - Unbounded Diffusion . .VI.4 Estimate for Unbounded Diffusion with Periodic Ligand Releases .VI.5 Times for Unbounded Diffusion with Periodic Ligand Releases . . .VI.6 Chick Ciliary Ganglion : Running Time and Estimated Time . . . .VI.7 Chick Ciliary Ganglion - SFC Load Balancing . . . . . . . . . . . .VI.8 Load Balancing Communication Counts - Chick Ciliary Ganglion .VI.9 3D Visualization of Irregular Distribution of Ligands . . . . . . . .VI.10Load Balanced Parallel Running times on 8,16,32,64 processors . . .42434647485052545556MCell core computation loop . . . . . . . . . . . . . . . . . . .Counts and Times for all ligands and free ligands . . . . . . . .Number of SSV walls checked and hit and corresponding timesNumber of Mesh walls checked and hit and corresponding timesTotal, Ray Trace, Wall Hit and Estimated Running times . . . .Total, Ray Trace, Wall Hit and Estimated Running times . . . .Unbounded Diffusion - Gaussian Ligand Density Profiles . . . .Processor workload variation for Unbounded Diffusion . . . . . .Chick Ciliary Ganglion : Ligand Count and Running time . . .vi.

LIST OF TABLESIII.1 Measured Operation counts and timings. . . . . . . . . . . . . .14VI.1 Unbounded Diffusion - Parallel Running times and speedups . . . . 44VI.2 Chick Ciliary Ganglion - Parallel Running times . . . . . . . . . . . 51VI.3 Running times and speedups on 8,16,32 and 64 simulated processors. 56vii

ABSTRACT OF THE THESISA Performance Model and Load Balancer for a ParallelMonte-Carlo Cellular Microphysiology SimulatorbyUrvashi Rao VenkataMaster of Sciences in Computer ScienceUniversity of California, San Diego, 2004Professor Scott B. Baden, ChairThis thesis presents a performance model for a parallel cellular microphysiology simulator and discusses dynamic load balancing techniques that employ it.These simulations are characterized by highly irregular spatial workload concentrations that are determined by the initial data configurations and their timedependent evolution. A performance model giving reliable spatial and temporalworkload estimates is useful in facilitating the adaptive distribution of workloadacross processors during the course of the computation.A hybrid performance model was derived and then tuned to accuratelyestimate and predict the characteristics of each simulation. One component ofthe performance model is the classical instantaneous model that estimates theworkload based on retrospective information about the state of the system. Theother is an evolutionary model that augments the instantaneous model by usinginformation about the initial state of the system and the mechanisms that driveit to predict performance trends before they occur. Workload estimates obtainedwith the hybrid model were used to drive a partitioner that adaptively partitionedthe workload and improved the parallel runtime performance by a factor of two on16 processors.viii

Chapter IIntroductionA Monte-Carlo cellular microphysiology simulation is typically characterized by a non-uniform spatial workload distribution. As a result, a parallelimplementation using a regular domain decomposition is prone to a severe loadimbalance. This leads to a parallel running time that is considerably high compared to an implementation in which all processors are equally loaded. An irregulardomain decomposition that partitions regions of higher workload more finely thanothers, is therefore required. In addition, the workload distribution across thecomputational domain can vary with time, leading to the need for dynamic loadbalancing techniques that respond to changes or shifts in workload concentrationsby adaptively partitioning the domain. Such load balancing techniques requireworkload estimates obtained from a reliable performance model.Performance modeling consists of estimating the running time of an application based on parameters and conditions that drive the computation. Oneinstance where a performance model is useful is efficient run-time load balancingof a parallel application in which varied data dependent behaviour can result indynamic and irregular workload concentrations. The design of an accurate performance model becomes challenging when there are a variety of data dependentconditions that evolve dynamically through the course of the computation. The direct measurement of the running time per iteration may not always provide reliable1

and stable estimates, given the overhead of performing fine grained timing, limited timer resolution and inherently noisy running time variations. A performancemodel based on operation counts and simulation variables is therefore desired.MCell-K1 is a parallel simulator used to study sub-cellular communication via signaling pathways and neurotransmitters. It is based on MCell2 , a serialapplication developed at the Salk Institute for Biological Studies and the Pittsburgh Supercomputer Center. Spatial workload concentrations in a typical MCellrun are highly irregular and determined by the initial data configurations andtheir time-dependent evolution. The use of a static regular domain decompositioncan result in the maximally loaded processor holding more than twice the averageworkload. A performance model giving reliable spatial and temporal workload estimates would be useful in adaptively distributing the workload across processorsduring the course of the computation. MCell-K simulations are described by anelaborate model description and a single performance model is unlikely to applyunder all circumstances. However a good model can enable a computation to scalewell, thereby allowing simulations of large proportions to be run in a reasonableamount of time.This thesis proposes a hybrid performance model for MCell-K, and discusses dynamic load balancing techniques that employ it. Two kinds of cost modelsare explored. One is an empirically derived short term instantaneous model thatestimates the workload based on retrospective information about the state of thesystem [3, 9, 13, 14, 5]. Some MCell-K events can induce a sudden load imbalanceand a purely instantaneous model cannot be used to predict and avoid the resultingimbalance. There is therefore the need for a predictive component that estimatesthe workload in advance, given information about the initial state of the systemand the mechanisms that drive it. The hybrid performance model is validated byits ability to generate reliable and accurate workload estimates for dynamic loadbalancing algorithms that are based on adaptive irregular domain decompositions.12 tilman/index.php?file home.html

The performance models and load balancing algorithms were designedand validated using the serial MCell code, instrumented to simulate a parallel run.Simulations were performed on sample problems with characteristics typical of astandard MCell run. Load balancing was tested using two domain decomposition techniques. A recursive co-ordinate bisection algorithm was compared to adecomposition based on a Hilbert space filling curve. The load balancing testsrevealed that both domain decomposition algorithms give similar results in termsof workload distribution but that the space filling curve algorithm incurred lessload balancing computation and communication costs. The use of these adaptiveload balancing techniques improved the performance of the simulated parallel runby a factor of two on 16 processors. The varied nature of MCell-K simulationsare specified by a complex model description language and warrant a method ofcreating custom performance models. An analysis of the speedups obtained on8,16,32 and 64 processors showed that with an accurate workload estimate, theseload balancing algorithms lead to scalable computations.The organization of the material in this thesis is as follows. Chapter IIgives an overview of the design of MCell and MCell-K with an emphasis on factorsthat can affect runtime performance. Chapter III describes the instantaneousand predictive performance models for MCell-K. Chapter IV describes dynamicload balancing algorithms based on these models. Chapter V deals with detailsconcerning the implementation and validation of the parallel simulator used todesign the models. Chapter VI discusses the accuracy, stability and efficiency ofthe performance model and load balancing techniques for a set of sample MCell-Kruns. Chapter VII concludes with a summary of the results and a short descriptionof how the production MCell-K code will incorporate the load balancer into itsstructure.

Chapter IIMCell and MCell-KII.AMCellMCell [2] is a cellular microphysiology simulator that uses Monte Carlomethods to trace the movement and behaviour of a large number of reactivemolecules called ligands as they diffuse through space in a three dimensional computational domain. These ligands can undergo unimolecular or bimolecular statetransitions as they interact with reactive elements called effector sites, located onsurfaces and membranes that represent sub-cellular biophysical structures 1 . MCelluses a Model Description Language (MDL) as a high level user interface and linkbetween the steps of model design, simulation and output of results.II.A.1Model Description LanguageMCell’s Model Description Language (MDL) is an extremely versatileway of representing the properties of diffusing molecules and arbitrary surfacestructures along with diffusion and reaction mechanisms typical of any desiredphysical system which involves particle diffusion. Parameterized descriptions of themicro-physiological environment along with diffusion and reaction mechanisms areused to drive an MCell simulation. Types of ligand molecules and their properties1Currently ligands interact only with surfaces and not with each other. A future version of MCellwill incorporate bi-ligand interactions.4

are specified along with the locations of ligand release sites and correspondingrelease mechanisms. Complex reactive surfaces and membranes are representedusing polygonal meshes. Mesh elements are provided with reactive (effector sites),reflective or absorptive properties and reaction mechanisms between the ligandmolecules and reactive mesh elements are explicitly defined.II.A.2Simulation CharacteristicsA typical MCell simulation begins with the release of numerous ligandsfrom release sites according to specified release mechanisms. Computations areorganized into iteration timesteps. Within an iteration timestep, each ligand performs a Brownian-dynamics random walk. A maximum diffusion distance specifiedin the MDL input controls the amount each ligand can move during the courseof one iteration timestep. For each ligand the direction and distance to be movedare determined probabilistically, and a three dimensional ray trace operation isperformed to compute the new location. In each iteration, ligands take severalsteps until they reach their maximum allowed diffusion distance. During each stepthe ray trace operation includes checks for intersections with all walls and meshelements in the immediate vicinity of the ligand. If a mesh wall is intersected, theligand may undergo a state change as dictated by the type of mesh element it hasencountered. As iterations proceed, ligands diffuse outward from their release sitesand can reflect off mesh walls, react with effector sites present on them, be absorbed by binding sites located on mesh walls, or interact in any other way definedin the MDL input. Ligand binding sites maintain ligand saturation counts andmay undergo variations in reactive properties based on the number of bound ligands associated with them. Output can be obtained from MCell in various forms.Counts and statistics can be obtained with respect to any class of participatingelements of the simulation. Output can also be generated in a format compatiblewith the DReAMM2 three dimensional visualization tool (Figure II.1.A).2

Figure II.1: (A (left)) Three dimensional visualization of ligands diffusing outfrom release sites in the chick ciliary ganglion MCell simulation (T. Bartol and T.Sejnowski) (B (right))Spatial subvolumes in MCell with planes along the X,Y andZ axes along with a spherical polygon mesh surface(J.Stiles and T. Bartol).II.A.3ImplementationThe three dimensional computational domain is divided into a large num-ber of spatial sub-volumes (SSVs) formed by the tensor product of a set of partitions along each dimension (Figure II.1.B). This limits the computation requiredto track individual ligands by requiring that a ligand check for intersections onlywith the mesh walls physically located within the SSV enclosing the ligand. During the ray trace operation, if a ligand encounters a subvolume wall, its subvolumeidentifier is updated and the ligand takes its next step 3 .Ligands are represented as data structures holding state and locationinformation about a ligand. There is a global linked list of ligand structures thatis traversed through during each iteration timestep and ligand movements andreactions are recorded as state changes. For the purpose of statistical accuracythere is no correlation between the ordering of ligands in this list and the location3In the current implementation the SSVs are represented on a cartesian grid, whereas the mesh wallsand surfaces are based on a triangulated three dimensional mesh. An alternate grid representationis being implemented in which both the SSVs and the mesh walls will be represented using a threedimensional tetrahedral mesh. This will allow some mesh tiles and SSV walls to coincide, thus reducingthe number of intersections to be checked for during ray tracing.

of the ligands in the computational domain. Surface mesh walls are also representedas linked lists of mesh elements holding information of their location, orientation,reactive properties and saturation counts. During each ray trace step along aligand’s trajectory, the part of the mesh wall linked list representing mesh elementscontained in the current SSV is traversed to check for intersections.The radial direction and step length for the movements of each ligand arecontrolled by a set of random numbers that correspond to Monte Carlo probabilitiesderived from bulk solution rate constants of a Brownian-dynamics random walkalgorithm. MCell uses a 64-bit random number generator and splits each numberto obtain random numbers for the direction and length of a ligands step. MCell alsoallows for checkpointing wherein the state of the system at a particular iterationcan be saved in a log file for the current state of the simulation to be examined. Thesimulation can then resume along with additional of changed MDL information.II.A.4Running TimesThe main determinants of the running time of a typical serial MCell runare the number of release sites, ligands and mesh walls in the computational domainand the number of iterations required for the simulation to run to completion. Aserial MCell run with 20 release sites and 100000 ligands, simulated with the microstructure of a chick ciliary ganglion takes approximately 2 hours to complete 3000iterations on a single 2.2GHz processor, and uses approximately 500MB of memory.In a typical run the number of free ligands reduces considerably after about 2500iteration timesteps as they bind to mesh walls and stop diffusing. A full productionrun involving all 550 release sites would run for over 2 days. These runs usuallyhave to be repeated several times to ensure statistical accuracy of the results, andthe total running time required soon becomes prohibitive.

II.BMCell-KMCell simulations possess complex spatial and temporal dynamics thatcan potentially limit scalability in a parallel environment. Also due to the dynamicnature of migrating particles and the requirement of numerical and physical accuracy in the simulation, MCell is not embarrassingly parallel and requires carefulsynchronization. A parallel variant of MCell, called MCell-K[1], has been implemented using the KeLP parallel programming infrastructure. KeLP enables themanagement of distributed pointer-based structures, has mechanisms to handledata migration among processors, and can facilitate load balancing via overdecomposition. These parallel programming paradigms make KeLP a good choicefor implementing a parallel variant of MCell.II.B.1Parallelization StrategyA regular domain decomposition is used to partition the computationaldomain in a one-to-one mapping of partitions to processors. Initially, all processorsread the input file and create local copies of geometry and other simulation datapertaining to their local regions of space in the computational domain. Meshelements that cross processor boundaries are duplicated on multiple processors.Each processor maintains local linked lists for ligand structures and mesh elements.The ligands in the local processor lists are arbitrarily ordered with respect to theirphysical location within the processor’s computational domain.Processor boundaries are introduced into the MCell computational domain as walls of a particular type, and regular MCell ray trace operations areused to detect when a ligand steps across a processor boundary. Each iterationtimestep is split into several sub-iteration timesteps. Ligands whose movementsdo not result in intersections with processor walls are completely updated in thefirst sub-iteration timestep. Ligands that record intersections with processor walls,are transfered to a communication list. At the end of each sub-iteration step, if

any ligand that has not yet been completely updated has registered a hit with aprocessor boundary, there is a synchronization step where all ligands flagged forcommunication are sent to their destination processors. An iteration terminatesonly when all ligands in the entire computational domain have been updated.II.B.2ResultsSpeedupsMCell-K was tested with sample MDL input data and produced numericaland statistical results in agreement with those produced by the serial code [1].A large MCell simulation of ligand diffusion through the chick ciliary ganglionmircostructure was performed with 192 release sites and 5000 ligands per releasesite on the NPACI Blue Horizon [1]. Under typical conditions the ligands decayas iterations proceed, but this run was modified to simulate the case of all ligandsremaining active in the system at each timestep. The simulation using MCell-Kwas run for 2500 iteration timesteps and took 81.7 minutes to complete on 16processors and 44.2 minutes on 32 processors. (A serial run of this size wouldtake approximately 6 hours on a 2.2GHz processor.) Most ligands were completelyupdated in the first sub-iteration timestep and the additional time required forsub-iteration synchronization was found to be negligible compared to the totalrunning time.Communication OverheadCommunication events in MCell-K are frequent but involve message sizesof only a few kilobytes. Communication costs per iteration were found to be negligible in comparison to the running time per iteration. The maximum number ofligands (over all processors) communicated per timestep was found to be inverselyproportional to the number of processors used. This, along with the small messagesizes involved, suggests that communication costs may not be significant even inlarge runs with a large number of processors.

Ligands per ProcessorMaximum and Average Ligands per Processorheavily loaded processors - persistent ligand casepersistent ligand case3000020000100000051015epoch (100 time-steps)2016 processors, maximum16 processors, average32 processors, maximum32 processors, average64 processors, maximum64 processors, average80000processor 17processor 20processor 25processor 27processor 33ligands per processorligands per processor40000256000040000200000051015epoch (100 time-steps)2025Figure II.2: Dynamic Load Imbalance in MCell-K (A (left)) Ligands per processorfor each of the most heavily loaded processors of a 64 processor MCell-K run(Gregory T. Balls [1]). (B (right)) Maximum and average ligand counts over allprocessor for 16,32 and 64 processors (Gregory T. Balls [1]). Time is measured inepochs, each epoch corresponding to 100 iterations.Load ImbalanceLoad imbalance was found to be a major bottleneck in the system. In arun of the chick ciliary ganglion simulation the initial load imbalance was limitedby initializing the simulation such that the distribution of ligand release sites overprocessors was within a factor of two of the average number of release sites perprocessor.Figure II.2.A shows the number of ligands per processor for the most heavily loaded processors as they vary as a function of iteration number. (Reportedtimes are averaged over epochs, each corresponding to 100 iteration timesteps).Ligands were observed to rapidly diffuse across processor boundaries resulting ina high load imbalance that persisted throughout the rest of the simulation. Plotsof the maximum load over all processors for each iteration timestep also showedthat the total running time is affected by a load imbalance with the maximallyloaded processor carrying approximately twice the average (ideal) load in the system (Figure II.2.B).

It can therefore be seen that the dynamic nature of the diffusing liganddistributions and the static but irregular concentrations of mesh elements resultsin a considerable load imbalance in a parallel implementation that uses a staticregular domain decomposition. This necessitates the use of irregular domain decompositions that divide some regions of high workload more finely than others,as well as dynamic load balancing strategies that can detect changes or shifts inworkload concentration and then perform an adaptive domain decomposition andworkload redistribution.

Chapter IIIPerformance ModelThe first step in analyzing the performance of an application for the purpose of designing a performance model, is to identify the operations that dominatethe running time. These operation counts can then be used to obtain an estimateof the workload and predict running times. The running time of a parallel application is determined by the time taken by the most heavily loaded processor. Inapplications like MCell-K, since the work is divided among processors by partitioning the discrete three dimensional computational domain into subvolumes ofspace, the running time of the parallel simulation will depend on spatially dependent counts such as free ligand density and the number of mesh walls that ligandscan intersect with, in each partition subvolume. An instantaneous model can usethe operation counts of one iteration to predict the spatial workload distributionfor the next iteration. This model can detect an increase in running time due toa load imbalance in the system only after it occurs, but it can be combined witha predictive model in which workload estimates that could indicate the need forload balancing, may be obtained prior to the actual computation that could causethe load imbalance.This chapter describes the main computational components of a typicalMCell run and discusses the formulation of instantaneous and supporting predictivemodels for a parallel MCell-K run.12

III.AMCell workload statisticsThe workload in a typical MCell run depends on the number and densityof diffusing ligands and their proximity to walls and effector sites. An irregular distribution of effector sites on (irregular) mesh surfaces results in irregular workloadconcentrations. An irregular distribution of release sites results in highly concentrated workloads in some regions in the computational domain. This is especiallysignificant during the first few hundred iterations which represent a large fractionof MCell’s total running time (Figure III.2.B).III.A.1Workload MeasureTo obtain a workload estimate, various counts and timing statistics arecollected over a spatial grid of Workload Sample Boxes (WSB) in the computationaldomain. MCell uses the SSV grid to speed up searches for encounters with meshsurfaces. For the purpose of load balancing the WSB grid used is a coarsenedversion of the MCell SSV grid and load redistribution is carried out in terms ofunits of work determined by workload estimates per WSB.A practical choice of the resolution of the WSB grid depends on theaverage diffusion rate of the ligands as specified in the MDL input. To ensure thenumerical accuracy of the simulation and to minimize the movement of ligandsacross processor boundaries within an iteration, the distance along each edge of asample box must be greater than the maximum distance a ligand can diffuse duringone iteration timestep. The workload sample boxes chosen for the experimentsdescribed in this thesis each enclose 8 8 8 SSVs.The overhead of collecting these count statistics is low. For each ligand, alinear index mapping between the SSV index and the corresponding WSB index isused to accumulate counts for each sample box. The performance model is basedon various counts and timings obtained by treating each sample box as a unit ofwork.

1. for (each timestep) {2.for (each ligand) {3.4.if isbound (ligand) give it a chance to unbind5.6.if isfree (ligand) { ray trace (ligand) {8.while (ligand has not diffused far enough) {9.move the ligand10.check for intersection with nearest SSV walls11.check for intersection with SSV’s mesh wall list12.}13.if (wall hit) {14.process subvol wall hit15.process mesh wall hit16.}17.}18.}19.}20. }Figure III.1: Code structure of the core MCell computation loop.The core of the MCell computation consists of tracking each ligand in thesystem for a series of iteration timesteps. The processing of each ligand depends onits state (bound or free) and its proximity to mesh walls and effector sites (collisionsand reactions). To identify the operations that dominate the running time, variousoperation counts were monitored and timings obtained for corresponding codesegments. The code structure

A Performance Model and Load Balancer for a Parallel Monte-Carlo Cellular Microphysiology Simulator A thesis submitted in partial satisfaction of the requirements for the degree Master of Sciences in Computer Science by Urvashi Rao Venkata Committee in charge: Professor Scott B. Baden, Chair Professor Keith Marzullo Professor Henri Casanova 2004