Transcription

The Montage Architecture for Grid-EnabledScience Processing of Large, Distributed DatasetsJoseph C. Jacob, Daniel S. Katz, and Thomas PrinceJet Propulsion Laboratory, California Institute of Technology4800 Oak Grove Drive, Pasadena, CA 91109-8099G. Bruce Berriman, John C. Good, and Anastasia C. LaityInfrared Processing and Analysis Center, California Institute of Technology770 South Wilson Avenue, Pasadena, CA 91125Ewa Deelman, Gurmeet Singh, and Mei-Hui SuInformation Sciences Institute, University of Southern California4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292Abstract – Montage is an Earth Science Technology Office(ESTO) Computational Technologies (CT) Round III GrandChallenge project that will deploy a portable, compute-intensive,custom astronomical image mosaicking service for the NationalVirtual Observatory (NVO). Although Montage is developing acompute- and data-intensive service for the astronomycommunity, we are also helping to address a problem that spansboth Earth and space science: how to efficiently access andprocess multi-terabyte, distributed datasets.In bothcommunities, the datasets are massive, and are stored indistributed archives that are, in most cases, remote with respectto the available computational resources. Therefore, use of stateof-the-art computational grid technologies is a key element ofthe Montage portal architecture. This paper describes theaspects of the Montage design that are applicable to both theEarth and space science communities.I.INTRODUCTIONMontage is an effort to deploy a portable, computeintensive, custom image mosaicking service for theastronomy community [1, 2, 3]. The Earth and space sciencecommunities each are faced with their own uniquechallenges, but they also share a number of technicalrequirements and can mutually benefit by tracking some ofthe information technology developments and lessons learnedfrom both communities. Although Montage is developing acompute- and data-intensive service for the astronomycommunity, we are also helping to address a problem thatspans both Earth and space science: how to efficiently accessand process multi-terabyte, distributed datasets. Bothcommunities have recognized the necessity of image reprojection and mosaicking as tools for visualizing mediumand large-scale phenomena and for enabling multiwavelength science.Like Earth science datasets, sky survey data are stored indistributed archives that are, in most cases, remote withrespect to the available computational resources. Therefore,state-of-the-art computational grid technologies are a keyelement of the Montage portal architecture. The Montageproject is contracted to deploy a science-grade custom mosaicservice on the TeraGrid. TeraGrid is a distributedinfrastructure, sponsored by the National Science Foundation(NSF,) that is expected to deliver 20 teraflops performance,with 1 petabyte of data storage, and 40 gigabits per second ofnetwork connectivity between the multiple sites. A secondproject at JPL also plans to use Montage to construct largescale mosaics, in this case on the Information Power Grid(IPG,) NASA’s computational grid infrastructure.Astronomical images are almost universally stored inFlexible Image Transport System (FITS) format. The FITSformat encapsulates the image data with a meta-data headercontaining keyword-value pairs that, at a minimum, describethe image dimensions and how the pixels map to the sky.Montage uses FITS for both the input and output data format.The World Coordinate System (WCS) specifies imagecoordinate to sky coordinate transformations for a number ofdifferent coordinate systems and projections useful inastronomy (some directly analogous to projections popular inthe Earth science community).Two Spitzer Space Telescope Legacy Program teams,GLIMPSE and SWIRE, are actively using Montage togenerate science image products, and to support datasimulation and quality assurance. Montage is designed to beapplicable to a wide range of astronomical data, but isexplicitly contracted to support mosaics constructed fromimages captured by three prominent sky surveys spanningmultiple wavelengths, the Two Micron All Sky Survey(2MASS), the Digitized Palomar Observatory Sky Survey(DPOSS), and the Sloan Digital Sky Survey (SDSS).2MASS has roughly 10 terabytes of images and catalogs,covering nearly the entire sky at 1-arc-second sampling inthree near-infrared wavelengths. DPOSS has roughly 3terabytes of images, covering nearly the entire northern sky inone near-infrared wavelength and two visible wavelengths.The SDSS second data release (DR2) contains roughly 6terabytes of images and catalogs covering 3,324 squaredegrees of the Northern sky in five visible wavelengths.

This paper discusses the aspects of the Montage design thatare applicable to both the Earth and space sciencecommunities. The remainder of the paper is organized asfollows. Section II describes how Montage is designed as amodular toolkit. Section III describes techniques that areemployed in Montage to dramatically expedite the calculationof mappings from one projection to another. We expect thatthese techniques could be beneficial for other mosaicking orre-projection applications in both Earth and space science.Section IV describes the architecture of the MontageTeraGrid portal. Performance on the TeraGrid is discussed inSection V. A summary and description of future plans isprovided in Section VI.II.MONTAGE MODULAR DESIGNMontage has the broad goal of providing astronomers withsoftware for the computation of custom science grade imagemosaics in FITS format. Custom refers to user specificationof the parameters describing the mosaic, including WCSprojection, coordinate system, mosaic size, image rotation,and spatial sampling. Science grade mosaics preserve theTABLE ITHE DESIGN COMPONENTS OF MONTAGEComponentDescriptionMosaic Engine ComponentsFig. 1. The high-level design of Montage.mImgtblExtract geometry information from a set ofFITS headers and create a metadata tablefrom it.mProjectRe-project a FITS image.mProjExecA simple executive that runs mProject foreach image in an image metadata table.mAddCo-add the re-projected images to produce anoutput mosaic.Background Rectification ComponentsmOverlapsAnalyze an image metadata table to determinewhich images overlap on the sky.mDiffPerform a simple image difference between apair of overlapping images. This is meant foruse on re-projected images where the pixelsalready line up exactly.mDiffExecRun mDiff on all the overlap pairs identifiedby mOverlaps.mFitplaneFit a plane (excluding outlier pixels) to animage. Meant for use on the differenceimages generated by mDiff.mFitExecRun mFitplane on all the mOverlapspairs. Creates a table of image-to-imagedifference parameters.mBgModelModeling/fitting program which uses theimage-to-image difference parameter table tointeractively determine a set of corrections toapply to each image to achieve a "best" globalfit.mBackgroundRemove a background from a single image (aplanar correction has proven to be adequatefor the images we have dealt with).mBgExecRun mBackground on all the images in themetadata tablecalibration and astrometric (spatial) fidelity of the inputimages.Montage constructs an image mosaic in four stages:1.2.3.4.Re-projection of input images to a common spatial scale,coordinate system, and WCS projection,Modeling of background radiation in images to achievecommon flux scales and background levels byminimizing the inter-image differences,Rectification of images to a common flux scale andbackground level, andCo-addition of re-projected, background-matched imagesinto a final mosaic.Montage accomplishes these steps in independent modules,written in ANSI C for portability. This “toolkit” approachhelps limit testing and maintenance costs, and providesconsiderable flexibility to users. They can, for example, useMontage simply to re-project sets of images and co-registerthem on the sky, implement a custom background matchingalgorithm without impact on the other steps, or define aspecific processing flow through custom scripts. Table Igives a brief description of the main Montage modules andFig. 1 illustrates how they may be used together to produce amosaic.

Three usage scenarios for Montage are as follows: themodules listed in Table I may be run as stand-aloneprograms; the executive programs listed in the table (i.e.,mProjExec, mDiffExec, mFitExec, and mBgExec)may be used to sequentially process multiple input images; orthe grid portal mechanism described in Section IV may beused to process a mosaic in parallel on computational grids.The modular design of Montage permits the same set of corecompute modules to be used regardless of the computationalenvironment being used.III.TECHNIQUES FOR RAPID RE-PROJECTIONAs described in Section II, the first stage of mosaicconstruction is to re-project each input image to the spatialscale, coordinate system, and projection of the output mosaic.Traditionally, this is by far the most compute-intensive partof the processing because it is done in two steps; first, inputimage coordinates are mapped to sky coordinates (i.e., rightascension and declination, analogous to longitude and latitudeon the Earth); and second, those sky coordinates are mappedto output image coordinates. All of the mappings from oneprojection to another are compute-intensive, but some requiremore costly trigonometric operations than others and a fewrequire even more costly iterative algorithms. The firstpublic release of Montage employed this two-step process tomap the flux from input space to output space. Because thetime required for this process stood as a serious obstacle tousing Montage for large-scale image mosaics of the sky, anovel algorithm that is about 30 times faster was devised forthe second code release.The new approach uses an optimized “plane-to-plane” reprojection algorithm, modeled after a similar algorithmdeveloped by the Spitzer Space Telescope project, for thoseprojection mappings that can be computed without theintermediate step of mapping to the sky. The simplest ofthese is the mapping from one tangent plane projection toanother. To generalize this to arbitrary input and outputprojections, we approximate the actual projection with atangent plane projection with a polynomial warp. The fastplane-to-plane projection can then be done rapidly on thesetangent plane approximations.The error introduced by the Spitzer plane-to-plane reprojection is negligible on arbitrary spatial scales in the casewhere the transformation is between two tangent planes. Forother projections, the tangent plane approximation introducesadditional errors in astrometry, but early indications are thatthese errors can be kept below around 1% of a pixel widthover a few degrees on the sky for most projections.Exceptions are the Aitoff and similar projections, where thisapproach is only applicable over a degree or two. Theaccuracy of this approach is well within acceptable tolerancelevels and at a scale that is suitable for most scientificresearch. In situations where greater accuracy is necessary,the projection should be done using the intermediate step ofmapping to the celestial sphere, as in the Montage first coderelease.IV.MONTAGE GRID PORTAL ARCHITECTUREThe Montage TeraGrid portal has a distributedarchitecture, as illustrated in Fig. 2. The portal is comprisedof the following five main components, each having a clientand server: (i) User Portal, (ii) Abstract Workflow Service,(iii) 2MASS Image List Service, (iv) Grid Scheduling andExecution Service, and (v) User Notification Service. Thesecomponents are described in more detail below.A usage scenario is as follows. Users on the Internetsubmit mosaic requests by filling in a simple web form withparameters that describe the mosaic to be constructed,including an object name or location, mosaic size, coordinatesystem, projection, and spatial sampling. A service at JPL iscontacted to generate an abstract workflow, which specifies:the processing jobs to be executed; input, output, andintermediate files to be read or written during the processing;and dependencies between the jobs. A 2MASS image listservice at IPAC is contacted to generate a list of the 2MASSimages required to fulfill the mosaic request. The abstractworkflow is passed to a service at ISI, which runs softwarecalled Pegasus (Planning for Execution in Grids) [4, 5, 6].Pegasus schedules the workflow on the TeraGrid (andpossibly other resources), using grid information services tofind information about data and software locations. Theresulting “concrete workflow” includes information aboutspecific file locations on the grid and specific grid computersto be used for the processing.The workflow is thenexecuted on the remote TeraGrid clusters using CondorDAGMan [7]. The last step in the mosaic processing is tocontact a user notification service at IPAC, which currentlysimply sends an email to the user containing the URL of theMontage output.Region Name, DegreesPegasusJPLUser PortalmDAGFilesJPLAbstractWorkflowConcrete WorkflowCondor DAGMANTeraGrid ClustersComputationalGridImageList2MASSImage ListServiceGrid Schedulingand Execution ificationServiceNCSAISICondor PoolFig. 2. The distributed architecture of the Montage TeraGrid Portal.

Fig. 3. Montage grid portal web form interface.This design exploits the parallelization inherent in theMontage architecture. The Montage grid portal is flexibleenough to run a mosaic job on a number of different clusterand grid computing environments, including Condor poolsand TeraGrid clusters. We have demonstrated processing onboth a single cluster configuration and on multiple clusters atdifferent sites having no shared disk storage.A. USER PORTALUsers on the Internet submit mosaic requests by filling in asimple web form with parameters that describe the mosaic tobe constructed, including an object name or location, mosaicsize, coordinate system, projection, and spatial sampling. Fig.3 shows a screen capture of the web form interface accessibleat http://montage.jpl.nasa.gov/. After request submission, theremainder of the data access and mosaic processing is fullyautomated with no user intervention.The server side of the user portal includes a CGI programthat receives the user input via the web server, checks that allvalues are valid, and stores the validated requests to disk forlater processing. A separate daemon program with no directconnection to the web server runs continuously to processincoming mosaic requests. The processing for a request isdone in two main steps:1.Call the abstract workflow service client code2.Call the grid scheduling and execution service clientcode and pass to it the output from the abstractworkflow service client codeB. ABSTRACT WORKFLOW SERVICEThe abstract workflow service takes as input a celestialobject name or location on the sky and a mosaic size andreturns a zip archive file containing the abstract workflow asa directed acyclic graph (DAG) in XML and a number ofinput files needed at various stages of the Montage mosaicprocessing. The abstract workflow specifies the jobs andfiles to be encountered during the mosaic processing, and thedependencies between the jobs. These dependencies are usedto determine which jobs can be run in parallel onmultiprocessor systems. A pictorial representation of anabstract workflow for a mosaic with three input images isshown in Fig. 4.C. 2MASS IMAGE LIST SERVICEThe 2MASS Image List Service takes as input a celestialobject name or location on the sky (which must be specifiedas a single argument string), and a mosaic size. The 2MASSimages that intersect the specified location on the sky are

Fig. 4. Example abstract workflow.returned in a table, with columns that include the filenamesand other attributes associated with the images.D. GRID SCHEDULING AND EXECUTION SERVICEThe Grid Scheduling and Execution Service takes as inputthe zip archive generated by the Abstract Workflow Service,which contains the abstract workflow, and all of the inputfiles needed to construct the mosaic. The serviceauthenticates users, schedules the job on the grid using aprogram called Pegasus, and then executes the job usingCondor DAGMan.Users are authenticated on the TeraGrid using their Gridsecurity credentials. The user first needs to save their proxycredential in the MyProxy server. MyProxy is a credentialrepository for the Grid that allows a trusted server (such asour Grid Scheduling and Execution Service) to access gridcredentials on the users behalf. This allows these credentialsto be retrieved by the portal using the user’s username andpassword.Once authentication is completed, Pegasus schedules theMontage workflow onto the TeraGrid or other clustersmanaged by PBS and Condor. Pegasus is a workflowmanagement system designed to map abstract workflowsonto the grid resources to produce concrete (executable)workflows. Pegasus consults various Grid informationservices, such as the Globus Monitoring and DiscoveryService (MDS) [8], the Globus Replica Location Service(RLS) [9], the Metadata Catalog Service (MCS) [10], and theTransformation Catalog to determine what grid resources anddata are available. If any of the data products described inthe abstract workflow have already been computed andregistered in the RLS, Pegasus removes the jobs that generatethem from the workflow. In this way, the RLS caneffectively be used in a data cache mechanism to prune theworkflow. The executable workflow generated by Pegasusspecifies the grid computers to be used, the data movementfor staging data in and out of the computation, and the dataproducts to be registered in the RLS and MCS, as illustratedin Fig. 5.The executable workflow is submitted to CondorDAGMan for execution. DAGMan is a scheduler thatsubmits jobs to Condor in an order specified by the concreteworkflow. Condor queues the jobs for execution on theTeraGrid. Upon completion, the final mosaic is delivered toa user-specified location and the User Notification Service,described below, is contacted.E. USER NOTIFICATION SERVICEThe last step in the grid processing is to notify the userwith the URL where the mosaic may be downloaded. Thisnotification is done by a remote user notification service atIPAC so that a new notification mechanism can be used laterwithout having to modify the Grid Scheduling and ExecutionService. Currently the user notification is done with a simpleemail, but a later version will use the Request ObjectManagement Environment (ROME), being developedseparately for the National Virtual Observatory. ROME will

mBackgroundmConcatFitmBgModelmAddparallelism. However, initially structuring the workflow inthis way allows us to expose the highest degree ofparallelism.VI.Data Stage in nodesMontage compute nodesmFitPlanemDiffmProjectData stage out nodesRegistration nodesFig. 5. Example concrete (executable) workflow for a 10 input file job on asingle cluster. In addition to the computation nodes, the concrete workflowincludes nodes for staging data into and out of the computation and forregistering the data products for later retrieval.extend our portal with more sophisticated job monitoring,query, and notification capabilities.V.We will improve this performance by optimizing both theMontage algorithms and the grid scheduling techniques. Weexpect about a 30 times speedup without sacrificing accuracyby using the algorithmic techniques described in Section III.We will address TeraGrid performance in three ways: makingPegasus aggregate nodes in the workflow in a way that wouldreduce the overheads for given target systems; encouragingthe Condor developers to reduce the per-job overhead; andexamining alternate methods for distributing the work on thegrid. Each option has advantages and disadvantages that willbe weighed as we go forward.PERFORMANCEWe have run the Pegasus-enabled Montage on a variety ofresources: Condor pools, LSF- and PBS-managed clusters,and the TeraGrid (through PBS). Table II gives the runtimesof the individual workflow components to summarize theresults of running a 2-degree M16 mosaic on the NCSATeraGrid cluster. These performance figures are for the firstrelease of Montage, which does not include the algorithmicoptimizations described in Section III. The total runtime ofthe workflow was 107 minutes and the workflow contained1,515 individual jobs.To this point, our main goal was to demonstrate feasibilityof running the Montage workflow in an automated fashion onthe TeraGrid with some amount of performance improvementover the sequential version. Currently, Pegasus schedules theworkflow as a set of small jobs. As seen in the table, some ofthese jobs run only a few seconds, which is suboptimalbecause scheduling too many little jobs suffers from largeoverheads. In fact, if this processing was run on a singleTeraGrid processor, it would have taken 445 minutes, so weare not taking very much advantage of the TeraGrid’sCONCLUSIONMontage is a project to design and develop high sciencequality astronomical image mosaicking software. Thesoftware will be made accessible to the science communityusing two mechanisms: (i) a toolkit that can be directlydownloaded and run manually on a local computer, and (ii) afully automated grid portal with a simple web-form interface.A number of characteristics of the Montage design areapplicable to both the Earth and space science communities,including fast image re-projection techniques and grid portalmechanisms. Montage incorporates a tangent planeapproximation and fast plane-to-plane mapping technique tooptimize the compute-intensive re-projection calculations.A Montage mosaic job can be described in terms of anabstract workflow so that a planning tool such as Pegasus canderive an executable workflow that can be run in a gridenvironment. The execution of the workflow is performed bythe workflow manager DAGMan and the associatedCondor-G. This design exploits the parallelization inherentin the Montage architecture. The Montage grid portal isflexible enough to run a mosaic job on a number of differentcluster and grid computing environments, including CondorTABLE IITERAGRID PERFORMANCE OF MONTAGENumberof JobsJob NameAverage Run-Time1mAdd94.00 seconds180mBackground2.64 seconds1mBgModel11 seconds1mConcatFit9 seconds482mDiff2.89 seconds483mFitplane2.55 seconds180mProject130.52 seconds183Transfer of data inTransfer of mosaicoutBetween 5-30 seconds each118: 03 minutes

pools and TeraGrid clusters. We have demonstratedprocessing on both a single cluster configuration and onmultiple clusters at different sites having no shared diskstorage.Our current and future work includes optimizing the gridscheduling to better account for a priori knowledge about thesize of the computation required for different parts of theMontage processing. This information will be used toaggregate appropriate parts of the computation in order tolessen the impact of the overhead of scheduling smallerchunks of computation on grid computers. Also, the portalwill be integrated with ROME for improved job monitoring,query, and notification capabilities.2.G. B. Berriman, D. Curkendall, J. Good, J. Jacob, D. S.Katz, T. Prince, R. Williams, Montage: An On-DemandImage Mosaic Service for the NVO, Astronomical DataAnalysis Software and Systems (ADASS) XII, October2002, Astronomical Society of the Pacific ConferenceSeries, eds. H. Payne, R. Jedrzejewski, and R. Hook.3.B. Berriman, A. Bergou, E. Deelman, J. Good, J. Jacob,D. Katz, C. Kesselman, A. Laity, G. Singh, M.-H. Su,and R. Williams, Montage: A Grid-Enabled ImageMosaic Service for the NVO, Astronomical DataAnalysis Software & Systems (ADASS) XIII, October2003.4.E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta,S. Patil, M.-H. Su, K. Vahi, M. Livny, Pegasus:Mapping Scientific Workflows onto the Grid, AcrossGrids Conference 2004, Nicosia, Cyprus.5.Y. Gil, E. Deelman, J. Blythe, C. Kesselman, and H.Tangmurarunkit, Artificial Intelligence and Grids:Workflow Planning and Beyond, IEEE IntelligentSystems, January 2004.6.E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta,K. Vahi, K. Blackburn, A. Lazzarini, A. Arbree, R.Cavanaugh, and S. Koranda, Mapping Abstract ComplexWorkflows onto Grid Environments, Journal of GridComputing, vol. 1, no. 1, 2003, pp. 25-39.7.Condor and DAGMan, http://www.cs.wisc.edu/condor/.8.K. Czajkowski, et al., Grid Information Services forDistributed Resource Sharing, Proceedings of 10th IEEEInternational Symposium on High PerformanceDistributed Computing, 2001.9.A. Chervenak, et al., Giggle: A Framework forConstructing Scalable Replica Location Services,Proceedings of Supercomputing (SC) 2002.ACKNOWLEDGMENTSMontage is funded by NASA's Earth Science TechnologyOffice, Computational Technologies Project, underCooperative Agreement Number NCC5-626 between NASAand the California Institute of Technology. Pegasus issupported by NSF under grants ITR-0086044 (GriPhyN) andITR AST0122449 (NVO).Part of this research was carried out at the Jet PropulsionLaboratory, California Institute of Technology, under acontract with the National Aeronautics and SpaceAdministration. Reference herein to any specific commercialproduct, process, or service by trade name, trademark,manufacturer, or otherwise, does not constitute or imply itsendorsement by the United States Government or the JetPropulsion Laboratory, California Institute of Technology.REFERENCES1.B. Berriman, D. Curkendall, J. C. Good, L. Husman, J.C. Jacob, J. M. Mazzarella, R. Moore, T. A. Prince, andR. E. Williams, Architecture for Access to ComputeIntensive Image Mosaic and Cross-IdentificationServices in the NVO, SPIE Astronomical Telescopes andInstrumentation: Virtual Observatories Conference,August 2002.10. G. Singh, et al., A Metadata Catalog Service for DataIntensive Applications, Proceedings of Supercomputing(SC) 2003.

project at JPL also plans to use Montage to construct large-scale mosaics, in this case on the Information Power Grid (IPG,) NASA's computational grid infrastructure. . approach is only applicable over a degree or two. The accuracy of this approach is well within acceptable tolerance levels and at a scale that is suitable for most scientific