Transcription

EMC ISILON BEST PRACTICES GUIDE FOR HADOOPDATA STORAGEABSTRACTThis white paper describes the best practices for setting up and managing the HDFS service on aEMC Isilon cluster to optimize data storage for Hadoop analytics.August 2017

The information in this publication is provided “as is.” EMC Corporation makes no representations or warranties of any kind with respect to theinformation in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.DELL EMC2, DELL EMC, the DELL EMC logo are registered trademarks or trademarks of DELL EMC Corporation in the United States and othercountries. All other trademarks used herein are the property of their respective owners. Copyright 2017 EMC Corporation. All rights reserved.Published in the USA.EMC believes the information in this document is accurate as of its publication date. The information is subject to change without notice.DELL EMC is now part of the Dell Technologies.2

ContentsINTRODUCTION .6Overview of Isilon for Big Data . 6How Hadoop works with Scale-Out Isilon NAS . 7NameNode Redundancy . 7The HDFS Architecture of OneFS . 7HDFS INTEGRATION .8Supported Hadoop Distributions and Projects . 8Isilon Cluster Integration with Hadoop. 8Isilon OneFS Version . 8Ambari and Hortonworks HDP . 9Cloudera Manager CDH . 9Access Zones . 9WORKING WITH HADOOP DISTRIBUTIONS .9HDFS Directories and Permissions . 9UID and GID Parity . 9Isilon Hadoop Tools . 10Virtual Rack Implementations – Rack Configurations. 10NameNode and DataNode Requests . 12DataNode Load Balancing . 12Pipeline Write Recovery . 13SmartConnect . 13IP Pools . 14Static . 14Dynamic . 14HDFS Pool Usage and Assignments . 14Racks Not Required – No Data Locality in Use. 14Single Pool . 14Racks Required – Data Locality to be Implemented . 15Multiple Pools. 15Multi-Use Pools . 15Node Allocation in SmartConnect Pools . 15IP Address Allocation by Interface . 15LACP and Bonding . 163

Jumbo Frames – MTU 9000 . 16ONEFS HDFS SETTINGS .16Checksum Type. 16HDFS Blocksize. 17HDFS Threads . 17NANON with Ambari . 17Improving HDFS Write Throughput – Isilon Write Coalescer . 17Example output: . 18L1/L2 Cache Size . 19Global Namespace Enabled . 19HADOOP CLUSTER CONFIGURATION .19Hadoop Compute Settings. 19mapred.local.dir . 20mapred.compress.map.output . 20dfs.replication. 20dfs.permissions. 20HDFS OPERATION OPTIMIZATIONS .21Data Protection - OneFS Protection Levels . 21Aligning Workloads with Data Access Patterns . 21Write Caching with SmartCache . 22SSD Usage . 22ALIGN DATASETS WITH STORAGE POOLS .23SmartPools for Analytics Data . 23Guidelines for File Pool Management . 23Storage Pools – NodePools for Different Data Sets . 24Storing Intermediate Jobs on an Isilon Cluster . 24Dealing with Space Bottlenecks. 24Review Data Settings on Files and Directories . 25HADOOP AND ONEFS KERBEROS INTEGRATION.26Implementation Approaches . 26Ambari and Hortonworks HDP . 27Cloudera Manager CDH . 27DNS – SmartConnect . 27Time. 27Isilon SPN and ID Management . 274

hadoop.security.token.service.use ip . 27MONITORING AND PERFORMANCE WITH INSIGHTIQ .27CONCLUSION .28CONTACTING DELL EMC ISILON TECHNICAL SUPPORT .285

INTRODUCTIONThe Dell EMC Isilon scale-out network-attached storage (NAS) platform provides Hadoop clients with direct access to big data through a Hadoop FileSystem (HDFS) interface. Powered by the distributed Dell EMC Isilon OneFS operating system, a Dell EMC Isilon cluster delivers a scalable pool ofstorage with a global namespace.Hadoop compute clients access the data that is stored in an Isilon cluster by connecting using the HDFS protocol. Every node in the cluster can act as aNameNode and a DataNode. Each node boosts performance and expands the cluster's capacity. For Hadoop analytics, the Isilon scale-out distributedarchitecture minimizes bottlenecks, rapidly serves big data, and optimizes performance for analytic jobs.An Isilon cluster fosters data analytics without ingesting data into an HDFS based file system. With a Dell EMC Isilon cluster, you can store data onan enterprise storage platform with your existing workflows and standard protocols, including SMB, HTTP, FTP, REST, and NFS as well as HDFS.Regardless of whether you write the data with SMB or NFS, you can analyze it with a Hadoop compute cluster through HDFS. There is no need toset up an HDFS file system and then load data into it with tedious HDFS copy commands or specialized Hadoop connectors.An Isilon cluster simplifies data management while cost-effectively maximizing the value of data. Although high-performance computing withHadoop has traditionally stored data locally in compute clusters HDFS file system, the following use cases make a compelling case for couplingHadoop based analytics with Isilon scale-out NAS: Store data in a POSIX-compliant file system with SMB and NFS workflows and then access it through HDFS Scale storage independently of compute as your data sets grow Protect data more reliably and efficiently instead of replicating it with HDFS 3X mirror replication Eliminate HDFS copy operations to ingest data and Hadoop file system commands to manage data Implement distributed fault-tolerant NameNode services Manage data with enterprise storage features such as deduplication and snapshotsStoring data in an Isilon scale-out NAS cluster instead of HDFS clients streamlines the entire analytics workflow. Isilon’s HDFS interface eliminatesextracting the data from a storage system and loading it into an HDFS file system. Isilon’s multiprotocol data access with SMB and NFS eliminatesexporting the data after you analyze it. The result is that you can not only increase the ease and flexibility with which you analyze data, but alsoreduce capital expenditures and operating expenses.This white paper describes the best practices for managing an Isilon cluster for Hadoop data analytics.Overview of Isilon for Big DataThe Isilon scale-out platform combines modular hardware with unified software to provide the storage foundation for data analysis. Isilon scale-outNAS is a fully distributed system that consists of nodes of modular hardware arranged in a cluster. The distributed Isilon OneFS operating systemcombines the memory, I/O, CPUs, and disks of the nodes into a cohesive storage unit to present a global namespace as a single file system.The nodes work together as peers in a shared-nothing hardware architecture with no single point of failure. Every node adds capacity, performance,and resiliency to the cluster, and each node acts as a Hadoop NameNode and DataNode. The NameNode daemon is a distributed process that runs onall the nodes in the cluster. A compute client can connect to any node in the cluster to access NameNode services.As nodes are added, the file system expands dynamically and redistributes data, eliminating the work of partitioning disks and creating volumes.The result is a highly efficient and resilient storage architecture that brings all the advantages of an enterprise scale-out NAS system to storingdata for analysis.Unlike traditional storage, Hadoop's ratio of CPU, RAM, and disk space depends on the workload—factors that make it difficult to size a Hadoopcluster before you have had a chance to measure your MapReduce workload. Expanding data sets also makes sizing decisions upfrontproblematic. Isilon scale-out NAS lends itself perfectly to this scenario: Isilon scale-out NAS lets you increase CPUs, RAM, and disk space by addingnodes to dynamically match storage capacity and performance with the demands of a dynamic Hadoop workload.6

An Isilon cluster optimizes data protection. OneFS more efficiently and reliably protects data than HDFS. The HDFS file system, by default,replicates a block of data three times. In contrast, OneFS stripes the data across the cluster and protects the data with forward error correction(FEC) codes, which consume less space than replication with better protection.An Isilon cluster also includes enterprise features to back up your data and to provide high availability. For example, in managing your DataNodedata, a best practice with a traditional Hadoop system is to back up your data to another system—an operation that must be performed withbrute force by using a tool like DistCP. OneFS includes clones, NDMP backups, synchronization, geo- replication, snapshots, file system journal,virtual hot spare, antivirus, IntegrityScan, dynamic sector repair (DSR), and accelerated drive rebuilds. For complete information about the dataavailability features of OneFS, see the white paper titled: HIGH AVAILABILITY AND DATA PROTECTION WITH DELL EMC ISILON SCALE-OUT NASThe enterprise features of OneFS ease data management. OneFS includes storage pools, deduplication, automated tiering, quotas, highperforming SSDs, capacity-optimized HDDs, and monitoring with InsightIQ. SmartPools, for example, provides tiered storage so that you can storecurrent data in a high-performance storage pool while storing older data in a lower, more cost-effective pool in case you need to analyze it againlater. For more information about the enterprise features of OneFS, see the white paper titled Isilon OneFS Enterprise Features for HadoopFor security, OneFS can authenticate HDFS connections with Kerberos and LDAP providers to provide central authentication and identitymanagement.How Hadoop works with Scale-Out Isilon NASAn Isilon cluster separates data from compute. As Hadoop clients execute jobs, the clients access the data stored on an Isilon cluster over HDFS.OneFS becomes the HDFS file system for Hadoop clients.OneFS implements the server-side operations of the HDFS protocol on every node, and each node functions as both a NameNode and aDataNode. An Isilon node, however, does not act as a job tracker or a task tracker; those functions remain the purview of Hadoop clients. OneFScontains no concept of a secondary NameNode: Since every Isilon node functions as a NameNode, the function of the secondary NameNode—checking pointing the internal NameNode transaction log—is unnecessary.The cluster load balances HDFS connections across all the nodes in the cluster. Because OneFS stripes Hadoop data across the cluster andprotects it with parity blocks at the file level, any node can simultaneously serve DataNode traffic as well as NameNode requests for file blocks.A virtual rack feature mimics data locality. You can, for example, create a virtual rack of nodes to assign compute clients to the nodes that areclosest to a client's network switch if doing so is necessary to work with your network topology or to optimize performance.Client computers can access any node in the cluster through dual 10 GigE or 40GigE on newer node network interfaces. A SmartConnect licenseadds additional network resilience with IP address pools that support multiple DNS zones in a subnet as well as IP failover. In OneFS, an IPaddress pool appears as a: Groupnet:subnet:poolname. A best practice, which is discussed later in this paper, networking topologies andimplementations are discussed later in this paper.NameNode RedundancySince every node runs the OneFS HDFS service, every node can simultaneously serve NameNode requests for file block requests and DataNodetraffic. A cluster thus inherently provides NameNode redundancy as long as you follow the standard Isilon practice of setting your clients toconnect to the cluster's SmartConnect zone's DNS entry. The result: There is no single point of failure.SmartConnect uses a round-robin algorithm to distribute NameNode sessions. When a Hadoop client first tries to connect to a NameNode, OneFSSmartConnect routes the traffic to a node, which will serve as the client's NameNode. The client's subsequent NameNode requests go to thesame node. When a second Hadoop client connects to the cluster's SmartConnect DNS entry, OneFS balances the traffic with, by default, roundrobin, routing the connection to a different node than that used by the previous client. In this way, OneFS evenly distributes NameNodeconnections across the cluster to provide maximum use of node interfacesThe HDFS Architecture of OneFSWhen HDFS client connects to a NameNode to query or modify data or metadata in the OneFS file system, each node has access to this data andmetadata. The metadata includes the logical location of data for the file stream—that is, the address of the Isilon nodes on which the blocksresides. An HDFS client can modify the metadata through the NameNode's RPC interface. OneFS protects HDFS metadata at the same protectionlevel as HDFS file data. In fact, OneFS handles all the metadata; you do not need to worry about managing it or backing it up.7

With Isilon, the NameNode daemon translates HDFS semantics and data layout into OneFS semantics and file layout. For example, theNameNode translates a file's path, offset, and LEN into lists of block IDs and generation stamps.A DataNode stores blocks of files. More specifically, the DataNode maps blocks to block data. With HDFS, a block is an inode-offset pair thatrefers to a part of a file. With OneFS, you can set the size of an HDFS block to optimize performance. A Hadoop client can connect to a DataNodeto read or write a block, but the client may not write a block twice or delete a block. To transfer a block to a DataNode, the HDFS clientencapsulates the block in packets and sends them over a TCP/IP connection. The Isilon HDFS daemon performs zero-copy system calls to readand write blocks to the file system. On OneFS, the DataNode reads packets from and writes packets to disk.To manage writes, OneFS implements the same write semantics as the Apache implementation of HDFS: Files are append-only and may bewritten to by only one client at a time. Concurrent writes are permitted only to different files. As with the Apache implementation, OneFSpermits one lock per file and provides a mechanism for releasing the locks or leases of expired clients.HDFS IntegrationThis section describes the best practices that can help set up an Isilon cluster to work with Hadoop distributions to solve real-world problems indata analysis. This section does not, however, cover how to perform many of the tasks, such as creating an Access Zone or SmartConnect Zoneconfigurations, that it refers to; those instructions can be found in the referenced implementation guides and Isilon documentation.Supported Hadoop Distributions and ProjectsA Dell EMC Isilon cluster is platform agnostic for compute, and there is no vendor lock in: You can run most of the common Hadoop distributionswith an Isilon cluster, including Apache Hadoop, Hortonworks HDP, Cloudera CDH , IBM BigInsights, and others.Clients running different Hadoop distributions or versions can connect to the cluster simultaneously. For example, you can point both ClouderaCDH and Hortonworks HDP at the same data on your Isilon cluster and run MapReduce jobs from both distributions at the same time.Since the pace of development and release of Hadoop distributions and projects is rapid the most up to date list of distributions andapplications supported by OneFS can be found here: https://community.emc.com/docs/DOC-37101.Isilon Cluster Integration with HadoopThe process of installing a Hadoop distribution and integrating it with an Isilon cluster varies by distribution, requirements, objectives, networktopology, security policies, and many other factors. The following overview assumes that you have installed a supported distribution of Hadoopand verified that it works.Throughout this paper, the instructions and examples for configuring Hadoop abstract away from the Hadoop distributions—you might have toadapt the instructions and commands to your distribution, including translating the code in a configuration file to its corresponding field in adistribution’s graphical user interface. For instructions on how to install and configure Hadoop on a compute client, see the documentation foryour Hadoop distribution.Best Practice: Review the supportability of release compatibility with OneFS version: https://community.emc.com/docs/DOC-37101Isilon OneFS VersionBest Practice: It is recommended to implement at least OneFS version 8.0.1.x for pure HDFS workflows due to significant feature enhancement madein this version of OneFS; mainly DataNode Load Balancing and Pipeline Write recovery, which will be discussed later in this paperAdditional requirements and workflows may dictate a different version of OneFS is used but it should be noted version prior to 8.0.1.0 do not containthese major feature enhancements.8

We recommend that you always follow the following Isilon Hadoop implementation guides for installation and configuration depending on yourdistribution:Ambari and Hortonworks HDP EMC Isilon OneFS with Hadoop and Hortonworks Installation Guide EMC Isilon OneFS with Hadoop and Hortonworks for Kerberos Insta

This white paper describes the best practices for setting up and managing the HDFS service on a EMC Isilon cluster to optimize data storage for Hadoop analytics. August 2017. 2 The information in this publication is provided "as is." EMC Corporation makes no representations or warranties of any kind with respect to the