Transcription

Paper SAS6422-2016Deep Dive with SAS Studio into SAS Grid Manager 9.4Edoardo Riva, SAS Institute Inc.ABSTRACTDo you know how many different ways SAS Studio can run your programs with SAS Grid Manager?SAS Studio is the latest and coolest interface to SAS software. As such, we want to use it in mostsituations, including with sites that leverage SAS Grid Manager. Are you new to SAS and want to beguided by a modern UI? SAS Studio is here to help you. Are you a code fanatic, who wants total controlof how your program runs to harness the full power of SAS Grid Manager? Sure, SAS Studio is for you,too. This paper covers all SAS Studio editions. You learn how to connect each of them to SAS GridManager and discover best practices for harnessing a high-performance SAS analytics environment,while avoiding potential pitfalls.INTRODUCTIONWith SAS Studio, you can access your data files, libraries, and existing programs, and you can write newprograms. You can also use the predefined tasks in SAS Studio to generate SAS code for you.SAS Grid Manager lets you balance workloads. It also gives you faster parallel processing, highavailability, and enterprise scheduling, all in a flexible and centrally managed grid computing environment.They seem such different technologies, yet they can provide great benefits when used together. Thispaper concentrates on three grid features, showing how SAS Studio can help you take advantage ofworkload balancing and parallel computing, while using grid central-management capabilities.SAS STUDIOSAS Studio is a multi-functional application forleveraging the power of SAS through your web browser.When you run a program or task, SAS Studio connectsto a SAS server to process the SAS code. The SASserver can be a hosted server in a cloud environment, aserver in your local environment, or a copy of SAS onyour local machine. For server-based environments,including SAS Grid Manager provides the extracapabilities discussed in this paper.After the code is processed, the results are returned to SAS Studio in your browser.SAS STUDIO EDITIONSSAS Studio is available in three deployments: SAS Studio Single-User, SAS Studio Basic and SASStudio Enterprise Edition.The single-user edition of SAS Studio is delivered with every copy of Base SAS and runs on Windowsoperating environments. All the software components of SAS Studio are installed on the same machine,and only one user identity is allowed access. You can think of it as the web version of the traditional SASDisplay Manager System.1

Figure 1. SAS Studio Single-User Edition High-Level ArchitectureThe basic edition of SAS Studio is delivered with Base SAS and runs on Windows and UNIX operatingenvironments. This edition includes the SAS Web Application Server and the SAS Object Spawner.Any user who has an operating system account on the Windows or UNIX server machine can log onthrough a web browser over the network.Figure 2. SAS Studio Basic Edition, High-Level ArchitectureThe enterprise edition of SAS Studio is available with the SAS Integration Technologies license, which isincluded in most of SAS solutions. This edition includes the SAS Metadata Server, the SAS WebApplication Server, the SAS Web Server, and the SAS Web Infrastructure Platform services,applications, and data server.2

Figure 3. SAS Studio Enterprise Edition, High-Level ArchitectureWhy bother with all these versions? Because all editions of SAS Studio can take advantage of theprocessing capabilities of a SAS Grid, but the approach to take depends on the edition of SAS Studiothat you are using. All this is described in detail in the paper.Note: SAS University Edition includes SAS Studio together with Base SAS, SAS/STAT software,SAS/IML software, SAS/ACCESS software, and several time series forecasting procedures fromSAS/ETS software. This paper does not include a discussion of SAS University Edition because it doesnot have the components required to connect to a grid environment.SAS STUDIO RELEASESSAS Studio was initially released in 2014 with the first maintenance release of SAS 9.4. It has beenupdated several times since then, as documented in Table 1.SAS Studio ReleaseSupported SAS ReleaseSAS Studio 3.1SAS 9.4 TS1M1, ship event 14W11 (March 2014)SAS Studio 3.2SAS 9.4 TS1M2, ship event 14W32 (August 2014)SAS Studio 3.3SAS 9.4 TS1M2, ship event 15W08 (February 2015)SAS Studio 3.4SAS 9.4 TS1M3, ship event 15W29 (July 2015)SAS Studio 3.5SAS 9.4 TS1M3, ship event 16W08 (February 2016)Table 1. SAS Studio ReleasesMany new features have been added since it was initially released. If you want to be able to harness allits power, be sure to be running the most current release. Otherwise, it is time for an upgrade!SAS GRID MANAGERSAS Grid Manager provides a modern, flexible infrastructure that turbo-charges SAS performance withdistributed and parallel computing techniques. All under the automatic monitoring, resource management,and orchestration of a grid controller. When you leverage a SAS grid computing environment, you canautomatically distribute SAS computing tasks among multiple computers on a network under the controlof SAS Grid Manager.Starting with the third maintenance relase of SAS 9.4, released in, July 2015, when you license SAS GridManager you can choose one of the two available flavors. SAS Studio works seamlessly with eitheredition.3

SAS GRID MANAGER WITH PLATFORM SUITE FOR SASPlatform Suite for SAS is a set of components, provided by IBM Platform Computing,that provide efficientresource allocation, policy management, and load balancing of SAS workload requests.SAS Grid Manager with Platform Suite for SAS includes all of the required grid software components,both from Platform Computing and SAS.SAS GRID MANAGER FOR HADOOPSAS Grid Manager for Hadoop provides workload management, accelerated processing, and schedulingof SAS analytics co-located on a Hadoop cluster. SAS Grid Manager for Hadoop leverages YARN tomanage resources and distribute SAS analytics to a Hadoop cluster running multiple applications. Itintegrates with Oozie, which provides scheduling capability for SAS workflows. SAS Grid Manager forHadoop supports all of the existing SAS Grid syntax, submission modes, and integration with other SASproducts and solutions.SAS Grid Manager for Hadoop does not include a Hadoop distribution or any of the Hadoop components.However, you must have one of the supported enterprise Hadoop distributions already installed andconfigured.HOW DOES SAS STUDIO LEVERAGE A GRID ENVIRONMENT?As you can see in the figures in the previous sections, every edition of SAS Studio uses workspaceserver sessions to run SAS code. In order to be able to run it on a grid, you must have SAS Grid Managerlicensed on the same machines where the SAS Studio workspace servers are started. These will be yourgrid entry point.With all editions of SAS Studio, you can use in your code any of the functions included with the SAS GridManager license, such as GRDSVC ENABLE, to send the code to the grid for remote execution. In thisway, you can start multiple sessions in parallel to run your code faster and get results quickly. Forexample, Figure 4 shows a SAS user running three parallel grid sessions, all started by one of theworkspace servers used by SAS Studio.Figure 4. Parallel Code Execution Leveraging SAS/CONNECT If you are using SAS Studio Enterprise Edition together with SAS Grid Manager, your administrator canconfigure the SAS Studio workspace servers to use load balancing and then select to have the gridlaunch the workspace servers. After that, without any code change or end-user intervention, the gridcontrol server automatically starts new server sessions on the best available grid node, as defined by thecurrent load on the grid hosts and the policies set by the administrator. Figure 5 shows two workspaceserver sessions, each servicing a different end user, automatically load-balanced by SAS Grid Managerand started on different grid nodes.4

Figure 5. Multi-User Workload Balancing with Grid-Launched Workspace ServersLet’s see both use cases in more detail. We will actually start with the second one; it is simply so amazingthat you will find yourself using the grid without even knowing it!MULTI-USER LOAD BALANCINGSAS Workspace Servers are capable of performing load balancing across multiple machines. Theseservers can be configured to use one of the default algorithms to provide load balancing. However, withSAS Grid Manager installed, an administrator can configure workspace servers and other SAS servers touse SAS Grid Manager to provide load balancing. This is a one-time configuration done using SAS Management Console, shown in Display 1. Any SAS product or solution that uses workspace servers,including SAS Studio Enterprise Edition, will benefit from using SAS Grid Manager to provide loadbalancing. However, using the grid to provide load balancing also increases overhead, so each sessionmight take a few seconds longer to start.Display 1. Server Load-Balancing Options Dialog Box in SAS Management ConsoleFor you, as the end user, SAS Studio behavior is unchanged. Simply sign on and, even before pressing abutton or writing any code, you get two workspace server sessions running on your grid hosts. On a sidenote, SAS Studio always starts at least two sessions. One is used to run the user code, while the other isused for file I/O and other internal operations.5

As soon as another user signs on, two additional sessions appear on the grid, and so on for eachadditional user. The grid controller takes care of starting each session on the best available server,resulting in evenly spread resource utilization.Figure 6 shows two end users that simply signed on to SAS Studio Enterprise Edition. SAS Grid Managertook care of starting their sessions load balancing them on the grid nodes.Grid NodesEnd-user clientsGrid serverSAS Workspace ServerSAS Studio WebInterfaceSAS Mid TierSAS Workspace ServerWeb BrowserSAS StudioEnterpriseSAS WebApplication ServerGrid serverSAS Studio WebInterfaceSAS Workspace ServerWeb BrowserSAS Workspace ServerFigure 6. Workspace Servers Load Balanced across Grid NodesSo far, we have only opened the SAS Studio web interface. What happens when we start using it?Remember that every code that is generated by SAS Studio or every action that requires querying theback-end SAS session executes in one of the workspace servers. This means that everything that runs isautomatically using the grid.REMOTE CONNECT TO GRID SESSIONSWith all editions of SAS Studio, you can use SAS/CONNECT statements in your code to send the code tothe grid. It is really simple to convert every program so that it runs on a grid—all you do is add five extralines of code:GRIDSVC ENABLESIGNONRSUBMITENDRSUBMITSIGNOFFThese statements are not described in detail in this paper. You can find a very good explanation in DougHaig’s paper from SAS Global Forum 2015 listed in the references section.The result is that the workspace server session, which is currently running your SAS Studio code,launches one or more additional remote grid sessions. All the code between theRBSUBMIT/ENDRSUBMIT blocks is forwarded there for execution on the grid. Some SAS programscontain multiple independent subtasks that can be executed in parallel. Just add RSUBMIT andENDRSUBMIT statements around each subtask and SAS Grid Manager automatically assigns eachidentified subtask to a grid node.Figure 7 shows an example of SAS Studio single-user edition in which the end user has submitted thestatements to launch two remote parallel grid sessions. For simplicity, only one of the two workspaceserver sessions that SAS Studio always uses is shown. The dotted lines represent the connectionsinstantiated by the SAS/CONNECT protocol. Also, note that since the workspace server session is acting6

as a grid client, the client components of the grid software are required to be installed and configured onthe end-user client. These client components are the Platform Load Sharing Facility (LSF) client for theSAS Grid Manager for Platform, or Hadoop client JAR files and XML configuration files for SAS GridManager for Hadoop.End-user clientGrid NodeGrid clientGrid serverLocal SAS Workspace ServerSAS Grid SessionSAS StudioSingle User EditionSAS Grid SessionEmbedded SAS WebApplication ServerSAS Studio WebInterfaceWeb BrowserFigure 7. SAS Studio Single-User Edition Remotely Connected to a GridDIFFERENT MODES OF EXECUTION INSIDE SAS STUDIOSAS Studio wants to be your interface of choice, whatever your habits and your programming style are.That is why it includes two different perspectives: the SAS Programmer perspective and the VisualProgrammer perspective. It also supports different code submission modes: noninteractive, interactive,and batch. You can find all the details of what these are and how to use them in the officialdocumentation, but it is interesting to explore here how some of these interact with a grid environment.BATCH SUBMITStarting with SAS Studio 3.5, you can submit SAS programs in batch. As you can imagine, this lets yousubmit a program, close your browser and then come back later to check the results.If you are familiar with SAS architecture, it will be interesting to know that this batch submission does notuse the SAS DATA Step Batch Server. Just as with the other code submission modes, SAS Studio startsa workspace server; the difference here is that it keeps the server running on your behalf even after youlog off from the web interface. This is useful to know while configuring or monitoring the grid back end:from the grid point of view, these batch sessions are exactly the same as the interactive ones!One interesting aspect of batch SAS Studio submissions is documented in the product Administrator’sGuide. The guide explains how to set some properties in the SAS Studio configuration file to limit themaximum number of active batch sessions per user or across all users. Table 2 lists these properties withtheir default escriptionSpecifies the maximum number of active batch jobs for thecurrent SAS Studio user. The default value depends on youredition of SAS Studio.For the SAS Studio Enterprise Edition and SAS Basic Edition,the default value is 3.For the SAS Studio Single-User Edition, the default value is 5.7

webdms.maxNumActiveBatchSubmissionsSystemSpecifies the maximum number of batch jobs that can besubmitted for a given instance of SAS Studio across all users.The default value depends on your edition of SAS Studio.For the SAS Studio Mid-Tier (Enterprise) Edition and SASBasic Edition, the default value is 24.For the SAS Studio Single-User Edition, the default value is 5.Table 2. SAS Studio Properties to Limit the Number of Concurrent Batch SubmissionsIn your grid environment, these default values might be too small, especially if you have many cores.You can ask your administrator to increase them if you run out of available sessions.PARALLEL PROCESS FLOWSOne of the hidden gems of SAS Studio that really shines when used in a grid environment is thecapability of running process flows in parallel. However, let’s do one step at a time: what are processflows?When working in the Visual Programmer perspective, you have access to process flows. A process flow isa graphical representation of a process, where each object, be it a SAS program, a SAS Studio task, aquery, and so on, is represented by a node. Nodes are connected by links that instruct SAS Studio how tomove from one node to the next one.Display 2 shows a simple SAS Studio process flow.Display 2. SAS Studio Process FlowOn the Properties tab of the current process flow, you can set the execution mode of the nodes. Withthe default setting, SAS Studio runs the nodes in the order in which they are added to the process flow. Ifnode 2 is dependent on another node 1, node 1 must run completely before node 2 will run.You can change the execution mode to Parallel as shown in Display 3. When this value is set, SASStudio uses multiple workspace servers to run the nodes concurrently, always enforcing the correctdependencies.8

Display 3. Setting the Execution Mode to ParallelWhen you use this feature in a grid environment, you can achieve the benefits and the performanceimprovements of multi-machine parallel load balancing, without having to code any SAS/CONNECTstatement. It’s a real point-and-click parallel execution engine!Display 4 shows the process flow presented in Display 2 while it is running in parallel execution mode.The pane is grayed out because it is not possible to interact with it until the execution is complete.We can see that the List Data node is still running, while the Partition Data node has already finished.Thus, the two Filter Data nodes were able to start in parallel.In this scenario, we would guess three workspace server sessions are concurrently running our code inthe grid. However, if we monitor what is happening on the back-end hosts, we notice somethingunexpected. There are actually five workspace server sessions running. Why? If you remember, whenyou sign in to SAS Studio, it starts two SAS sessions. These are used only for the default executionmode. If a process flow is run in parallel mode, up to three additional SAS session are started, for a totalof five. Once the process flow is finished, the three additional SAS processes terminate if there is nofurther activity for 30 seconds, in order to release resources.Just as with batch processing, an administrator can use a configuration property,webdms.maxParallelWorkspaces, to specify the maximum number of workspaces that can be usedwhen SAS is running in parallel mode. The default value is 3. The maximum value is 8.9

.Display 4. Tasks Running in Parallel in a Process FlowPERSISTENCE OF SESSIONSParallel processing can speed up your projects by an incredible factor, especially when programs consistof subtasks that are independent units of work and can be distributed across a grid and executed inparallel. However, when these parallel execution environments are not kept in sync, it can also introduceunforeseen problems. The most common is the “disappearing” of temporary tables. This can actuallyhappen using different client interfaces because this problem does not depend on using a certainsoftware, but rather on the business logic that is implemented. Let’s discuss this problem with a practicalexample when using SAS Studio.WORK AND OTHER LIBRARIESIn this example, we want to run an analysis—here a simple PROC PRINT—on two independent subsetsof the same table. We decide to use two parallel grid sessions to partition the data, and then we run theanalysis in the parent session. The code we submit in SAS Studio could be similar to the following:%let rc %sysfunc( grdsvc enable( all , server SASApp));signon grid1;signon grid2;10

proc datasets library work noprint;delete sedan SUV;run;rsubmit grid1 wait no ;data sedan;set sashelp.cars;where Type ”Sedan”;run;endrsubmit;rsubmit grid2 wait no ;data SUV;set sashelp.cars;where Type ”SUV”;run;endrsubmit;waitfor ALL grid1 grid2;proc print data sedan;run;proc print data SUV;run;After submitting the code by pressing F3 or clicking Run, we do not get the expected Result window andthe Log window shows some errors, as shown in Display 5.Display 5. SAS Studio Log Window with Errors11

We get these results because we are using the WORK library, the temporary library that is automaticallydefined by SAS at the beginning of each SAS session or job. The WORK library stores temporary SASfiles that are written by a DATA step or a procedure and then read as input of subsequent steps. Whenwe request parallel execution on the grid, tasks run in multiple SAS sessions, and each grid session hasits own dedicated WORK library that is shared neither with any other grid session, nor with the parentsession that started it.In the above example, the DATA steps output their results—the SEDAN and SUV tables—in the WORKlibrary of a SAS session. Then, PROC PRINT tries to read those tables from the WORK library of adifferent SAS session. Figure 8 shows that the desired tables are not where we expect them, and the taskfails.Figure 8. Incorrect Use of the WORK Library between Multiple SessionsThis issue is quite common when dealing with multiple sessions, even without a grid. One simple solutionis to avoid using the WORK library and any other non-shared resources. It is possible to assign acommon library in many ways, such as in autoexec files or in metadata.Figure 9 shows how a common library solves the issue.12

Figure 9. Correct Use of a Shared LibraryWhen submitting process flows in the Visual Programmer perspective in parallel, SAS Studio can help usavoid this problem. We can save our intermediate results in the WEBWORK library. This special library isautomatically assigned at start-up and is shared across all workspace server sessions.As you might have guessed, libraries are not the only objects that should be shared across sessions.Every local setting—be it the value of an option, a macro, or a format—has to be shared across allparallel sessions. It is not difficult, but we have to remember to do it!WHERE ARE MY PREFERENCES?Even when we use all due diligence in coding and using shared locations for all SAS artifacts, we can stillfail in some sharing issues when using SAS Studio in a grid environment.SAS Studio has a Preferences window that enables you to customize several options that change thebehavior of different features of the software. By default, these preferences are stored under the end-userhome directory on the server where the workspace server session is running(%AppData%/SAS/SASStudio/preferences in Windows or /.sasstudio/preferences in UNIX). Does thissentence ring any alarm bell? With SAS Studio Enterprise Edition running in a grid environment, there isno such thing as “the server where the workspace server session is running”! One time it runs on one gridnode, a few minutes later it runs on another one. For this reason, it might happen that a preference thatwe just set to a custom value reverts to its default value on the next sign-in. This issue can become worsebecause SAS Studio follows the same approach to store code snippets, tasks, autosave files, and theWEBWORK library.Until SAS Studio 3.4, the only solution to this uncertainty was to have end users’ home directories sharedacross all the grid nodes. SAS Studio 3.5 removes this requirement by providing the administrators with anew configuration option, webdms.studioDataParentDirectory. This option specifies the location ofSAS Studio preferences, snippets, my tasks, and more. The default value is blank, which means that thebehavior is the same as in previous releases. An administrator can point it to any shared location toaccess all of this common data from any workspace server session.13

MONITORING SAS STUDIO SESSIONSIf your grid uses Platform Suite for SAS, SAS Environment Manager provides a management modulethat enables administrators and end users to monitor a SAS grid cluster. While a grid administrator cansee and manage the whole grid including jobs submitted by any user, end users can take advantage ofthis module to monitor their own sessions. When using SAS Grid Manager for Hadoop, a similarmonitoring is provided by Hadoop using the YARN Resource Manager Web User Interface.Display 6 displays an end user monitoring the SAS processes running on the grid after submitting theprocess flow shown in Display 2 and Display 4. We can recognize the two initial sessions, labeled ID 785and 786 and Status of RUNNING. The three additional sessions used for parallel execution have ID 787,788 and 789 and have a status of DONE because SAS Studio has already terminated them, after 30seconds of inactivity.Display 6. Monitoring Job Execution Using the SAS Environment Manager Grid ModuleCONCLUSIONIn this paper, you have seen how SAS Studio can leverage a grid environment, thanks to multi-user loadbalancing and remote connect to grid sessions. These capabilities empower all SAS Studio users tobenefit from distributed and parallel computing techniques, under the automatic monitoring, resourcemanagement, and orchestration of a grid controller.REFERENCESHaigh, Doug. 2015. “Divide and Conquer—Writing Parallel SAS Code to Speed Up Your SAS Program.”Proceedings of the SAS Global Forum 2015 Conference. Cary, NC: SAS Institute Inc.Available at s15/SAS1935-2015.pdf.RECOMMENDED READING Grid Computing in SAS 9.4.Available at dmgr/index.html. SAS Studio: User’s Guide.Available at studio/index.html. SAS Studio: Administrator’s Guide.14

Available at studio/index.html. SAS Grid Computing OverviewAvailable at https://www.youtube.com/watch?v BIK8JzDsSQg. Working in SAS StudioAvailable athttps://www.youtube.com/watch?v usnucvpnGLM&list PLVBcK IpFVi9cajJtRel2uBLbtcLzWIN&index 4. Riva, Edoardo. “SAS Studio on Grid.”Available at https://www.youtube.com/watch?v ax-AbgjZs2M. Riva, Edoardo. “Avoid the pitfalls of parallel jobs.”Available at the-pitfalls-of-parallel-jobs/.CONTACT INFORMATIONYour comments and questions are valued and encouraged. Contact the author at:Edoardo RivaSAS Institute 1 919 531 [email protected] and all other SAS Institute Inc. product or service names are registered trademarks or trademarks ofSAS Institute Inc. in the USA and other countries. indicates USA registration.Other brand and product names are trademarks of their respective companies.15

Deep Dive with SAS Studio into SAS Grid Manager 9.4 . SAS Studio is the latest and coolest interface to SAS software. As such, we want to use it in most situations, including with sites that leverage SAS Grid Manager. . SAS Studio 3.1 SAS 9.4 TS1M1, ship event 14W11 (March 2014) SAS Studio 3.2 SAS 9.4 TS1M2, ship event 14W32 (August .