Dell Ready Bundlefor Cloudera HadoopArchitecture GuideVersion 5.10Dell Converged Platforms and Solutions

ii ContentsContentsList of Figures.vList of Tables. viTrademarks.8Glossary.9Notes, Cautions, and Warnings. 14Chapter 1: Dell Ready Bundle for Cloudera Hadoop Overview. 15Introducing the Dell Ready Bundle for Cloudera Hadoop.16Solution Use Case Summary.16Solution Components. 17ETL Solution Components. 19Software Support.19Cloudera Enterprise Software Overview.20Hadoop for the Enterprise.20Rethink Data Management. 20What's Inside?. 21Cloudera Enterprise Data Hub.21Syncsort DMX-h Overview. 22Hadoop for Data Transformation. 22Chapter 2: Cluster Architecture. 23High-Level Node Architecture. 24Node Definitions. 25Network Fabric Architecture.26Network Definitions. 27Cluster Sizing. 28Rack.28Pod.28Cluster. 29Sizing Summary. 29High Availability. 30Hadoop Redundancy.30Network Redundancy. 30HDFS Highly Available NameNodes.30Resource Manager High Availability. 31Database Server High Availability.31Chapter 3: Hardware Architecture. 32Server Infrastructure Options. 33Dell PowerEdge R730xd Server. 33Dell PowerEdge FX Architecture. 35Dell Ready Bundle for Cloudera Hadoop

Contents iiiChapter 4: Network Architecture. 40Cluster Networks. 41Physical Network Components. 41Server Node Connections. 42Pod Switches.44Cluster Aggregation Switches. 45Core Network. 47Layer 2 and Layer 3 Separation. 47iDRAC Management Network. 47Network Equipment Summary. 48Chapter 5: Cloudera Enterprise Software. 49Cloudera Manager.50Cloudera RTQ (Impala).50Cloudera Search. 50Cloudera BDR. 51Cloudera Navigator. 51Cloudera Support. 52Chapter 6: Syncsort Software.53Syncsort DMX-h Engine.54Syncsort DMX-h Service. 54Syncsort DMX-h Client.54Syncsort SILQ. 54Chapter 7: Deployment Methodology. 55Appendix A: Physical Rack Configuration - Dell PowerEdge R730xd. 56Dell PowerEdge R730xd Single Rack Configuration. 57Dell PowerEdge R730xd Initial Rack Configuration.58Dell PowerEdge R730xd Additional Pod Rack Configuration. 59Appendix B: Physical Chassis Configuration - Dell PowerEdge FX2. 61Physical Chassis Configuration - Dell PowerEdge FX2.62Appendix C: Physical Rack Configuration - Dell PowerEdge FX2.63Dell PowerEdge FX2 Single Rack Configuration.64Dell PowerEdge FX2 Initial and Second Pod Rack Configuration. 65Appendix D: Update History. 68Changes in Version 5.10. 69Dell Ready Bundle for Cloudera Hadoop

iv ContentsAppendix E: References. 70About Cloudera. 71About Syncsort. 71To Learn More. 71Dell Ready Bundle for Cloudera Hadoop

List of Figures vList of FiguresFigure 1: Dell Ready Bundle for Cloudera Hadoop Components. 18Figure 2: ETL Solution Components. 19Figure 3: Cluster Architecture. 24Figure 4: Cluster Network Fabric Architecture.27Figure 5: Dell PowerEdge R730xd Servers – 2.5” and 3.5” Chassis Options.33Figure 6: Dell PowerEdge FX2 Components.36Figure 7: Dell PowerEdge FX2 Server Chassis with Dual Dell PowerEdge FC630 andDual Dell PowerEdge FD332.36Figure 8: Hadoop Network Connections.41Figure 9: PowerEdge R730xd Node Network Ports. 42Figure 10: Dell PowerEdge FX2 Infrastructure Chassis Network Ports.43Figure 11: Dell PowerEdge FX2 Worker Chassis Network Ports. 43Figure 12: Single Pod Networking Equipment.45Figure 13: Dell Networking S6000-ON Multi-pod Networking Equipment.46Figure 14: Multi-Pod View Using Dell Networking Z9100 Switches (Based on Layer-3ECMP). 47Dell Ready Bundle for Cloudera Hadoop

vi List of TablesList of TablesTable 1: Big Data Solution Use Cases.16Table 2: ETL Solution Use Cases. 17Table 3: Data processing and access components.19Table 4: Dell Ready Bundle for Cloudera Hadoop Support Matrix. 19Table 5: Included Products. 21Table 6: Advanced Components. 21Table 7: DMX-h ETL Edition Key Features. 22Table 8: Cluster Node Roles. 24Table 9: Service Locations. 25Table 10: Networks. 27Table 11: Cloudera Distribution for Apache Hadoop Network Definitions. 27Table 12: Recommended Cluster Size. 29Table 13: Alternative Cluster Sizes. 29Table 14: Rack and Pod Density Scenarios.30Table 15: Hardware Configurations – Dell PowerEdge R730xd Infrastructure Nodes. 34Table 16: Hardware Configurations – Dell PowerEdge R730xd Worker Nodes. 34Table 17: Hardware Configurations – Dell PowerEdge FX2 FC630 InfrastructureNodes. 37Table 18: Hardware Configurations – Dell PowerEdge FX2 FC630 Worker Nodes. 38Table 19: Cluster Networks. 41Table 20: Bond / Interface Cross Reference. 43Table 21: Per Rack Network Equipment. 48Table 22: Per Pod Network Equipment. 48Table 23: Per Cluster Aggregation Network Switches for Multiple Pods. 48Table 24: Per Node Network Cables Required – 10GbE Configurations. 48Table 25: Cloudera Support Features. 52Dell Ready Bundle for Cloudera Hadoop

List of Tables viiTable 26: Single Rack Configuration – Dell PowerEdge R730xd. 57Table 27: Initial Pod Rack Configuration – Dell PowerEdge R730xd. 58Table 28: Additional Pod Rack Configuration – Dell PowerEdge R730xd. 59Table 29: Infrastructure Chassis Configuration – Dell PowerEdge FX2. 62Table 30: Worker Chassis Configuration – Dell PowerEdge FX2. 62Table 31: Single Rack Configuration – Dell PowerEdge FX2.64Table 32: Initial and Second Pod Rack Configuration– Dell PowerEdge FX2. 65Dell Ready Bundle for Cloudera Hadoop

8 TrademarksTrademarksCopyright 2011-2017 Dell Inc. or its subsidiaries. All rights reserved. Dell and other trademarks aretrademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners.This document is for informational purposes only, and may contain typographical errors and technicalinaccuracies. The content is provided as-is and without expressed or implied warranties of any kind.Dell Ready Bundle for Cloudera Hadoop

Glossary 9GlossaryASCIIAmerican Standard Code for Information Interchange, a binary code for alphanumeric charactersdeveloped by ANSI .BMCBaseboard Management ControllerBMPBare Metal ProvisioningCDHCloudera Distribution for Apache HadoopClosA multi-stage, non-blocking network switch architecture. It reduces the number of required ports within anetwork switch fabric.CMCChassis Management ControllerDBMSDatabase Management SystemDTKDell OpenManage Deployment ToolkitDell Ready Bundle for Cloudera Hadoop

10 GlossaryEBCDICExtended Binary Coded Decimal Interchange Code, a binary code for alphanumeric characters developedby IBM .ECMPEqual Cost Multi-PathEDWEnterprise Data WarehouseEoREnd-of-Row Switch/RouterETLExtract, Transform, Load is a process for extracting data from various data sources; transforming the datainto proper structure for storage; and then loading the data into a data store.HBAHost Bus AdapterHDFSHadoop Distributed File SystemHVEHadoop Virtualization ExtensionsDell Ready Bundle for Cloudera Hadoop

Glossary 11IPMIIntelligent Platform Management InterfaceJBODJust a Bunch of DisksLACPLink Aggregation Control ProtocolLAGLink Aggregation GroupLOMLocal Area Network on MotherboardNICNetwork Interface CardNTPNetwork Time ProtocolOSOperating SystemPAMPluggable Authentication Modules, a centralized authentication method for Linux systems.Dell Ready Bundle for Cloudera Hadoop

12 GlossaryRPMRed Hat Package ManagerRSTPRapid Spanning Tree ProtocolRTORecovery Time ObjectivesSIEMSecurity Information and Event ManagementSLAService Level AgreementTHPTransparent Huge PagesToRTop-of-Rack Switch/RouterVLTVirtual Link TrunkingVRRPVirtual Router Redundancy ProtocolDell Ready Bundle for Cloudera Hadoop

Glossary 13YARNYet Another Resource NegotiatorDell Ready Bundle for Cloudera Hadoop

14 Notes, Cautions, and WarningsNotes, Cautions, and WarningsNote: A Note indicates important information that helps you make better use of your system.Caution: A Caution indicates potential damage to hardware or loss of data if instructions are notfollowed.Warning: A Warning indicates a potential for property damage, personal injury, or death.This document is for informational purposes only and may contain typographical errors and technicalinaccuracies. The content is provided as is, without express or implied warranties of any kind.Dell Ready Bundle for Cloudera Hadoop

Dell Ready Bundle for Cloudera Hadoop Overview 15Chapter1Dell Ready Bundle for Cloudera Hadoop OverviewTopics: Introducing the Dell ReadyBundle for Cloudera HadoopSolution Use Case SummarySolution ComponentsCloudera Enterprise SoftwareOverviewSyncsort DMX-h OverviewThe Dell Ready Bundle for Cloudera Hadoop lowers the barrierto adoption for organizations intending to use Apache Hadoop inproduction.Hadoop is an Apache project being built and used by a globalcommunity of contributors, using the Java programming language.Yahoo!, has been the largest contributor to this project, and usesApache Hadoop extensively across its businesses. Core committerson the Hadoop project include employees from Cloudera, eBay,Facebook, Getopt, Hortonworks, Huawei, IBM, InMobi, INRIA,LinkedIn, MapR, Microsoft, Pivotal, Twitter, UC Berkeley, VMware,WANdisco, and Yahoo!, with contributions from many more individualsand organizations.Dell Ready Bundle for Cloudera Hadoop

16 Dell Ready Bundle for Cloudera Hadoop OverviewIntroducing the Dell Ready Bundle for Cloudera HadoopAlthough Hadoop is popular and widely used, installing, configuring, and running a production Hadoopcluster involves multiple considerations, including: The appropriate Hadoop software distribution and extensionsMonitoring and management softwareAllocation of Hadoop services to physical nodesSelection of appropriate server hardwareDesign of the network fabricSizing and scalabilityPerformanceThese considerations are complicated by the need to understand the type of workloads that will be runningon the cluster, the fast-moving pace of the core Hadoop project and the challenges of managing a systemdesigned to scale to thousands of nodes in a single cluster.Dell’s customer-centered approach is to create rapidly deployable and highly optimized end-to-end Hadoopsolutions running on hyperscale hardware. Dell listened to its customers and designed a Hadoop solutionthat is unique in the marketplace, combining optimized hardware, software, and services to streamlinedeployment and improve the customer experience.The Dell Ready Bundle for Cloudera Hadoop was jointly designed by Dell and Cloudera, and embodies allthe hardware, software, resources and services needed to run Hadoop in a production environment. Thisend-to-end solution approach means that you can be in production with Hadoop in a shorter time than istypically possible with homegrown solutions.The solution is based on the Cloudera Enterprise and Dell PowerEdge and Dell Networking hardware. Thissolution includes components that span the entire solution stack: Dell Ready Bundle for Cloudera Hadoop Architecture Guide and best practicesOptimized server configurationsOptimized network infrastructureCloudera EnterpriseSolution Use Case SummaryThe Dell Ready Bundle for Cloudera Hadoop is designed to address the use cases described in Table 1:Big Data Solution Use Cases on page 16:Table 1: Big Data Solution Use CasesUse caseDescriptionBig data analyticsAbility to query in real time at the speed of thoughton petabyte scale unstructured and semi-structureddata using HBase and Hive.Data storageCollect and store unstructured and semi-structureddata in a secure, fault-resilient scalable data storethat can be organized and sorted for indexing andanalysis.Dell Ready Bundle for Cloudera Hadoop

Dell Ready Bundle for Cloudera Hadoop Overview 17Use caseDescriptionBatch processing of unstructured dataAbility to batch-process (index, analyze, etc.) tensto hundreds of petabytes of unstructured and semistructured data.Data archiveActive archival of medium-term (12–36 months)data from EDW/DBMS to expedite access, increasedata retention time, or meet data retention policiesor compliance requirements.Big data visualizationCapture, index and visualize unstructured andsemi-structured big data in real time.Search and predictive analyticsCrawl, extract, index and transform semi-structuredand unstructured data for search and predictiveanalytics.The Dell Ready Bundle for Cloudera Hadoop with SyncSort is designed to address the use casesdescribed in Table 2: ETL Solution Use Cases on page 17:Table 2: ETL Solution Use CasesUse caseDescriptionETL offloadOffload ETL processing from a RDBMS or enterprise data warehouseinto a Hadoop cluster.Data warehouse optimizationAugment the traditional relational management database or enterprisedata warehouse with Hadoop. Hadoop acts as single data hub for all datatypes.Integration with datawarehouseExtract, transfer and load data into and out of Hadoop into a separateDBMS for advanced analytics.High-Performance datatransformationsIncludes high-performance sort, joins, aggregations, multi-key lookup,advanced text processing, hashing functions, and source/record/fieldlevel operations.Mainframe data ingestion &translationReads files directly from the mainframe, parses and transforms the data– packed decimal, occurs depending on, EBCDIC/ASCII, multi-formatrecords, and more –- without installing any software on the mainframeand without writing any code.Solution ComponentsFigure 1: Dell Ready Bundle for Cloudera Hadoop Components on page 18 illustrates the primarycomponents in the Dell Ready Bundle for Cloudera Hadoop.Dell Ready Bundle for Cloudera Hadoop

18 Dell Ready Bundle for Cloudera Hadoop OverviewFigure 1: Dell Ready Bundle for Cloudera Hadoop ComponentsThe Dell PowerEdge servers, Dell Networking switches, and the operating system make up the foundationon which the solution software stack runs.The Store layer components provide multiple layers of functionality on top of this foundation. TheHadoop Distributed File System (HDFS) provides the core storage for data files in the system. HDFS is adistributed, scalable, reliable and portable file system. Apache Kudu provides a columnar relational storageoption, while Apache HBase provides NoSQL access to storage. Object storage is also available.The Integrate layer shows the components that can be used to move data in and out of the Hadoopsystem. Apache Sqoop provides data transfer to and from relational databases while Apache Flume andApache Kafka are optimized for real-time processing event and log data. The HDFS API and tools can alsobe used to move data files to and from the Hadoop system.YARN provides a resource management framework for running distributed applications under Hadoop. Themost popular distributed application is Hadoop’s MapReduce, but other applications also run under YARN,such as Apache Spark, Apache Hive, Apache Pig, etc. Enterprise grade Security services are provided byApache Sentry and RecordService.The right side of the diagram shows the data management capabilities that are integrated across the entiresystem, while the left side shows the operational components for Hadoop administration and managementprovided by Cloudera Manager and Cloudera Director.Sitting atop the Cloudera Enterprise core are multiple complementary processing and access alternatives. Batch data processingStream data processingSQL queryData searchAccess APIAll of these layers can be used simultaneously or independently, depending on the workload and problemsbeing solved.Dell Ready Bundle for Cloudera Hadoop

Dell Ready Bundle for Cloudera Hadoop Overview 19Table 3: Data processing and access componentsAccess LayerDescriptionBatch data processingSpark, Hive, Pig, MapReduce provide access to the massivelyparallel Hadoop data processing framework.Stream data processingSpark can also be used for stream processing.SQL queryImpala provides SQL query access to data.Data searchApache SOLR provides real-time search of indexed data.General AccessKite provides a high level data API for Hadoop.ETL Solution ComponentsThe ETL Solution is a variation of the architecture that adds Syncsort DMX-h to the installation. Figure 2:ETL Solution Components on page 19 illustrates this variation.Figure 2: ETL Solution ComponentsThe Syncsort DMX-h ETL engine runs under the YARN Resource management framework, and addsscalable ETL processing to the cluster. DMX-h can access mainframe and RDBMS data, while events canbe handled using either Flume or Kafka.DMX-h coexists with all other available Hadoop components, andcan be used in conjunction with them.Software SupportTable 4: Dell Ready Bundle for Cloudera Hadoop Support Matrix on page 19 describes where you canobtain technical support for the various components of the Dell Ready Bundle for Cloudera Hadoop.Table 4: Dell Ready Bundle for Cloudera Hadoop Support MatrixCategoryComponentVersionAvailable SupportOperating SystemRed Hat EnterpriseLinux Server7.3Red Hat Linux supportDell Ready Bundle for Cloudera Hadoop

20 Dell Ready Bundle for Cloudera Hadoop OverviewCategoryComponentVersionAvailable SupportOperating SystemCentOS7.3Dell Hardware supportJava Virtual MachineSun Oracle JVMJava 7 (1.7.0 67)N/AJava 8 (1.8.0 60)HadoopCloudera Enterprise5.10Cloudera supportHadoopCloudera Manager5.10Cloudera supportHadoopCloudera Navigator2.9Cloudera supportETL EngineSyncsort DMX-h9.2Syncsort supportCloudera Enterprise Software OverviewCloudera Enterprise helps you become information-driven by leveraging the best of the open sourcecommunity with the enterprise capabilities you need to succeed with Apache Hadoop in your organization.Hadoop for the EnterpriseDesigned specifically for mission-critical environments, Cloudera Enterprise includes CDH, the world’smost popular open source Hadoop-based platform, as well as advanced system management and datamanagement tools plus dedicated support and community advocacy from our world-class team of Hadoopdevelopers and experts. Cloudera is your partner on the path to big data.Cloudera Enterprise, with Apache Hadoop at the core, is: Unified – one integrated system, bringing diverse users and application workloads to one pool of dataon common infrastructure; no data movement requiredSecure – perimeter security, authentication, granular authorization, and data protectionGoverned – enterprise-grade data auditing, data lineage, and data discoveryManaged – native high-availability, fault-tolerance and self-healing storage, automated backup anddisaster recovery, and advanced system and data managementOpen – Apache-licensed open source to ensure your data and applications remain yours, and an openplatform to connect with all of your existing investments in technology and skillsRethink Data Management One massively scalable platform to store any amount or type of data, in its original form, for as long asdesired or requiredIntegrated with your existing infrastructure and toolsFlexible to run a variety of enterprise workloads - including batch processing, interactive SQL,enterprise search and advanced analyticsRobust security, governance, data protection, and management that enterprises requireWith Cloudera Enterprise, today’s leading organizations put their data at the center of their operations,to increase business visibility and reduce costs, while successfully managing risk and compliancerequirements.Dell Ready Bundle for Cloudera Hadoop

Dell Ready Bundle for Cloudera Hadoop Overview 21What's Inside?Table 5: Included ProductsProductDescriptionCDHAt the core of Cloudera Enterprise is CDH, which combinesApache Hadoop with a number of other open sourceprojects to create a single, massively scalable systemwhere you can unite storage with an array of powerfulprocessing and analytic frameworks.Automated Cluster Management - Cloudera Cloudera Enterprise includes Cloudera Manager to helpManageryou easily deploy, manage, monitor, and diagnose issueswith your cluster. Cloudera Manager is critical for operatingclusters at scale.Cloudera SupportGet the industry’s best technical support for Hadoop.With Cloudera Support, you’ll experience more uptime,faster issue resolution, better performance to support yourmission critical applications, and faster delivery of theplatform features you care about.Cloudera Enterprise Data HubCloudera Enterprise also offers support for several advanced components that extend and complement thevalue of Apache Hadoop:Table 6: Advanced ComponentsComponentDescriptionOnline NoSQL – HBaseHBase is a distributed key-value store that helpsyou build real-time applications on massive tables(billions of rows, millions of columns) with fast,random access.Analytic SQL – ImpalaImpala is the industry’s leading massively-parallelprocessing (MPP) SQL engine built for Hadoop.Search – Cloudera SearchCloudera Search, based on Apache Solr, letsyour users query and browse data in Hadoop justas they would search Google or their favorite ecommerce site.In-Memory Machine Learning and StreamProcessing – Apache SparkSpark delivers fast, in-memory analytics and realtime stream processing for Hadoop.Data Management – Cloudera NavigatorCloudera Navigator provides critical enterprise dataaudit, lineage, and data discovery capabilities thatenterprises require. Cloudera Navigator includesActive Data Optimization (Cloudera NavigatorOptimizer), Governance & Data Manage

Dell Ready Bundle for Cloudera Hadoop Architecture Guide and best practices Optimized server configurations Optimized network infrastructure Cloudera Enterprise Solution Use Case Summary The Dell Ready Bundle for Cloudera Hadoop is designed to address the use cases desc