Transcription

Reference Architecture:Splunk Enterprise withThinkSystem ServersLast update: 30 July 2018Version 1.0Describes referencearchitecture for SplunkEnterpriseContains sizingrecommendationsIncludes four differentdeployment models fromdepartment to large enterpriseContains detailed bill ofmaterials for Lenovo serversand networkingMike PerksKenny BainClick here to check for updates

Table of Contents1Introduction . 12Business problem and business value. 22.1Business problem . 22.2Business value . 33Requirements . 43.1Functional requirements . 43.2Non-functional requirements . 44Architectural overview . 55Component Model . 66Operational model . 86.1Operational model scenarios . 86.2Hardware components . 96.3Servers . 126.4Systems management . 146.5Networking . 226.6Racks . 236.7Operating Systems . 247Appendix: Bill of Materials . 257.1Server BOM . 257.2Networking BOM . 277.3Rack BOM. 27Resources . 28iiReference Architecture: Splunk Enterprise with ThinkSystem Serversversion 1.0

1 IntroductionThis document describes the reference architecture for Splunk Enterprise using Lenovo ThinkSystemservers and networking. The intended audience of this document is IT professionals, technical architects,sales engineers, and consultants to assist in planning, designing, and implementing Splunk Enterprise 7.1.1.This document provides an overview of the business problem and business value that is addressed by SplunkEnterprise. A description of customer requirements is followed by an architectural overview of the solution anda description of the logical components. The operational model describes the recommended operationalarchitecture of Splunk Enterprise and four different deployment scenarios using Lenovo ThinkSystem serversand network switches. The appendix features detailed Bill of Materials configurations that are used in thesolution.1Reference Architecture: Splunk Enterprise with ThinkSystem Serversversion 1.0

2 Business problem and business valueThe following section provides a summary of the business problems that this reference architecture isintended to help address, and the value that this solution can provide.2.1 Business problemThe advent of mobile data, social streams, clouds and interconnected everything signifies the "Transformationof Information" with huge shift in data usage. It delivers on the promise of analysis of big data to identifypatterns in statistical populations vs. traditional reliance on data modeling tools, queries, spreadsheetdashboards and charts.Global enterprises are under competitive pressure to expand into new markets, to find clients and buildcustomer loyalty. To yield real-time insights, they now leverage technology to sift through their datainstantaneously – and not after-the-fact data processing on a monthly, quarterly, or a yearly basis – whichtypically results in a potential loss of competitive advantage. Agility, security, cost-effectiveness, flexibility andefficiency are key deterministic priorities for their IT. Picture a bank sifting through its enormous data torecognize fraud, with a response time, of a few microseconds, during an ATM transaction, or an auto insurerreceiving real-time updates on driving habits from sensors installed in client’s vehicles.While customers are faced with many business challenges, this solution highlights two specific Big Datachallenges that represent significant opportunities. The first challenge focuses on real-time identification andmitigation of advanced organizational security threats to the Enterprise by leveraging vigilant analysis andresponse capabilities. The second challenge is highlighted by the complexity of managing the abundance ofsystems prevalent in a data center, and ensuring high performance and availability of these systems, daily.2.1.1 Vigilant enterprise security intelligenceOrganizational security threats do not make a story line for spy thrillers anymore. Global newsfeeds abounddaily, with compromised websites, stolen credit card data, abnormal HTTP traffic, financial fraud, and malwarepresence. Detecting advanced Enterprise Security threats require a new approach, enabled by a smart &scalable security intelligence platform (SIP). SIP makes any data security relevant, scales to tens of terabytesof data per day and provides real-time analysis and response capabilities.2.1.2 Operations analysis of machine data in data centersIt is an extremely complex effort to efficiently manage the abundance of systems, deployed in a typical datacenter. On a daily basis, several systems experience outages, performance issues, or missed SLA’s. Toensure high performance and availability, Enterprise IT administration teams waste valuable resourcesaccessing several management consoles, and run home-grown scripts to serially trace the valuable data theyneed from failed systems. This is machine data, a form of Big-Data.2Reference Architecture: Splunk Enterprise with ThinkSystem Serversversion 1.0

2.2 Business valueSplunk Enterprise provides an end-to-end, real-time solution for both of these business problems by deliveringthe following core capabilities: Universal collection and indexing of machine data and security data, from virtually any source Powerful search processing language (SPL) to search and analyze real-time and historical data Real-time monitoring for patterns and thresholds; real-time alerts when specific conditions arise Powerful reporting and analysis Custom dashboards and views for different roles Resilience and horizontal scalability Granular role-based security and access controls Support for multi-tenancy and flexible, distributed deployments on-premises or in the cloud Robust, flexible platform for big data appsIn addition, the Lenovo XClarity Administrator App for Splunk enables collection, visual representation, andanalysis of Lenovo hardware events from the Splunk platform. Here are some examples of the critical insightsthat can be gained from the XClarity Administrator App for Splunk: The volume and types of events generated over time from all monitored hardware. This will helpadministrators quickly identify problem hardware and take actions. Percentage of total events being surfaced by each end point type such as the chassis managementmodule (CMM), switch module, server, etc. Number of times when a power threshold has been exceeded for any XClarity-managed resource,over time. This can help identify environmental issues in the data center. If exceeding of powerthresholds caused power capping, this could also explain performance slowdowns. Number of user accounts that were created on XClarity instances over time. Spikes in the number ofnew accounts could help identify uncommon security activities for audit purposes. User IDs that attempted to authenticate to XClarity, but failed. Seeing which unauthorized user IDswere used to attempt access would be useful in system audits. Number of login attempts made outside of normal business hours. This may help identify uncommonuser account activity, like a large number of login attempts in the middle of the night or on a weekend.3Reference Architecture: Splunk Enterprise with ThinkSystem Serversversion 1.0

3 RequirementsThis section describes the functional and non-functional requirements for this reference architecture.3.1 Functional requirementsThe key functional requirements for the Splunk Enterprise solution include: Support for collecting, indexing and searching data Support for real-time processing of data Support for a variety of data and data types, including security data and machine data Support for large volumes of dataIn addressing the functional requirements, the reference architecture and sizing for the Splunk Enterprisesolution must consider the following data requirements: The amount of incoming data. The amount of indexed data in the datastore. Data placement in relevant storage tiers (in accordance with Splunk Indexer Data Retirement &Archiving Policies). Data indexing performance is influenced by the choices of searches, and number of concurrent users. Deployment and execution of Splunk ecosystem applications such as Lenovo XClarity App for Splunkand Splunk App for Enterprise Security. Required storage IO capabilities of high performance, scalability, and availability to support thecreation of extremely large, compressed data indexes, and offer the ability to run Storage IO-intensivesparse searches against this data.3.2 Non-functional requirementsThe key non-functional requirement is to provide superior performance with both indexing data and searchingdata. The following shows the minimum performance requirements for Splunk Enterprise: Minimum performance for each Indexing ServeroUp to 5.8 megabytes per second (or 500 GB per day) of raw indexing performance, providedno other Splunk activity is occurring. Minimum performance for each Search ServeroUp to 50,000 events per second for dense searchesoUp to 5,000 events per second for sparse searchesoUp to 2 seconds per index bucket for super-sparse searchesoFrom 10 to 50 buckets per second for rare searches with bloom filtersIn addition, the Splunk infrastructure needs to support both scale up and scale out as well as high availabilityand resilience to a single point of failure.4Reference Architecture: Splunk Enterprise with ThinkSystem Serversversion 1.0

4 Architectural overviewSplunk Enterprise provides an application platform for real-time operational intelligence. It facilitates easy, fastand secure collection, analysis, and search of data from massive data streams generated by devices,applications, transactions, timed events, systems and technologies.Figure 1 below shows the architectural overview of Splunk Enterprise. Users can access one or more searchhead servers through a load balancer. The search head(s) provide access to information that is collected byforwarders from a variety of data sources possibly across multiple data centers.ApplicationsWeb ServersApp ServersDatabasesClientsHypervisors, OS3rd PartyLoad BalancerNetworksServersSearch HeadClusterStorageIndexersForwardersDeployment andLicense ServerFigure 1: Architectural Overview of Splunk Enterprise5Reference Architecture: Splunk Enterprise with ThinkSystem Serversversion 1.0Cloud Services

5 Component ModelThis section describes the component model for Splunk Enterprise. Figure 2 shows an overview of the majorcomponents.SplunkDeploymentMonitorAppWeb BrowserSplunk CLIHTTP ProtocolSplunk WebServerLenovoXClarityAppOtherApps REST ProtocolDeploymentServerSearch HeadLicenseServerIndexerData Routing, Cloning and Load BalancingForwarderFigure 2: Component Model of Splunk Enterprise5.1.1 ForwardersForwarders collect data and send it to a Splunk deployment for indexing and searching. A particularenvironment could have thousands of forwarders executing on all different types of hardware. A forwarderrepresent a more robust solution than raw network feeds, with capabilities to” Tag metadata Buffer compress and secure data Run local scripts to collect or massage the data Use any available network ports on the remote device5.1.2 IndexersThe indexer is the Splunk Enterprise component that creates and manages indexes. The primary functions ofan indexer are:6 Indexing incoming data. Searching the indexed data.Reference Architecture: Splunk Enterprise with ThinkSystem Serversversion 1.0

5.1.3 Search headsFor large amounts of indexed data and numerous users concurrently searching on the data, it can makesense to distribute the indexing load across several indexers, while offloading the search query function to aseparate machine. In this type of scenario, known as distributed search, one or more Splunk Enterprisecomponents called search heads distribute search requests across multiple indexers.5.1.4 Deployment serverSplunk Enterprise deployment server is used to update a distributed deployment. The deployment serverpushes out configurations and content to sets of Splunk Enterprise instances (referred to, in this context, asdeployment clients), grouped according to any useful criteria, such as OS, machine type, application area,location, and so on. The deployment clients are usually forwarders or indexers. For example all of the Linuxforwarders can be refreshed, after testing an updated configuration for a local Linux forwarder.For small deployments, the deployment server can cohabit a Splunk Enterprise instance with another SplunkEnterprise component, either a search head or an indexer. For larger deployments it should run on its ownSplunk Enterprise instance.5.1.5 License serverThe license server manages Splunk Enterprise licenses. It often runs in the same Splunk Enterprise instanceas the Deployment server.5.1.6 Splunk WebserverSplunk provides a web user interface using a Python-based application server. It allows users to search andnavigate data stored by Splunk servers and to manage the Splunk deployment.5.1.7 Deployment monitorAlthough it's actually an app, not a Splunk Enterprise component, the deployment monitor has an importantrole to play in distributed environments. Distributed deployments can scale to forwarders numbering into thethousands, sending data to many indexers, which feed multiple search heads. The deployment monitor canbe used to view and troubleshoot these distributed deployments and it provides numerous views into the stateof the forwarders and indexers.5.1.8 Lenovo XClarity appThe Lenovo XClarity app for Splunk allows events to be forwarded from XClarity to the to the Splunk serverlistener. History and trends for different event can be viewed using built-in user interface.5.1.9 Other appsBecause Splunk provides a rich RESTful interface into its data and functionality, there are a large number ofSplunk and third party provided applications and add-ons.7Reference Architecture: Splunk Enterprise with ThinkSystem Serversversion 1.0

6 Operational modelThis section describes the options for mapping the logical components of Splunk Enterprise onto LenovoThinkSystem servers and Lenovo network switches. The “Operational model scenarios” section gives anoverview of the examples and has pointers into the other sections for the related hardware. The BOMconfigurations are described in the appendix on page 25.6.1 Operational model scenariosThe following scenarios are considered in this chapter: Departmental server Small enterprise (1/4 rack) Medium enterprise (1/2 rack) Large enterprise (full rack)Below is a list of items that can have a significant impact on Splunk Enterprise performance. Amount of incoming data – increases processes time Amount of indexed data – increases I/O bandwidth needed to store and search on data Number of concurrent users performing searches, creating reports, or viewing dashboards Number and types of searches Number and unique performance, deployment, and configuration considerations for each Splunk appTable 1 below gives sizing information for Splunk Enterprise and shows how many search heads and indexersare needed for different combinations of incoming data size and number of concurrent users. This table istaken from the Splunk Capacity Planning e 1: Splunk Performance RecommendationsUsers 2GB per2 to 300 GB300 to 600600GB to1 to 2TB2 to 3TBdayper dayGB per day1TB perper dayper daydayLess1 combined1 combined1 Search,1 Search,1 Search,1 Search,than 4instanceinstance2 Indexers3 Indexers7 Indexers10 IndexersMax 81 combined1 Search,1 Search,1 Search,1 Search,1 Search,instance1 Indexers2 Indexers3 Indexers8 Indexers12 Indexers1 Search,1 Search,1 Search,2 Search,2 Search,2 Search,1 Indexers1 Indexers3 Indexers4 Indexers10 Indexers15 Indexers1 Search,1 Search,2 Search,2 Search,2 Search,3 Search,1 Indexers2 Indexers3 Indexers6 Indexers12 Indexers18 Indexers1 Search,1 Search,2 Search,2 Search,3 Search,3 Search,2 Indexers2 Indexers4 Indexers7 Indexers14 Indexers21 IndexersMax 16Max 24Max 488Reference Architecture: Splunk Enterprise with ThinkSystem Serversversion 1.0

More data and more users can be supported by adding more search heads and indexers using the SplunkEnterprise scale-out architecture.Table 2 shows how each of the four deployment scenarios is mapped to a specific data and user combinationTable 2: Mapping of Deployment ScenariosAttributeDepartmentalIncoming data per daySmall EnterpriseMediumLargeEnterpriseEnterpriseLess than 2 GB300 to 600 GB600GB to 1TB1 to 2TBConcurrent usersLess than 4Maximum 16Maximum 24Maximum 48Search t serverN/AThe following sections give more details for each of the deployment areas: 6.2 Hardware components 6.3 Servers 6.4 Systems management 6.5 Networking 6.6 Racks 6.7 Operating Systems6.2 Hardware componentsThe following section describes the hardware components that can be used for Splunk Enterprise.6.2.1 Rack serversYou can use various rack-based Lenovo ThinkSystem server platforms to Splunk Enterprise.Lenovo ThinkSystem SR630Lenovo ThinkSystem SR630 (as shown in Figure 3) is an ideal 2-socket 1U rack server for small businessesup to large enterprises that need industry-leading reliability, management, and security, as well as maximizingperformance and flexibility for future growth. The SR630 server is designed to handle a wide range ofworkloads, such as databases, virtualization and cloud computing, virtual desktop infrastructure (VDI),infrastructure security, systems management, enterprise applications, collaboration/email, streaming media,web, and HPC. The ThinkSystem SR630 offers up to twelve 2.5-inch or four 3.5 inch hot-swappableSAS/SATA HDDs or SSDs together with up to 10 on-board NVMe PCIe ports that allow direct connections tothe U.2 NVMe PCIe SSDs.9Reference Architecture: Splunk Enterprise with ThinkSystem Serversversion 1.0

Figure 3: Lenovo ThinkSystem SR630For more information, see this website: lenovopress.com/lp0643Lenovo ThinkSystem SR650Lenovo ThinkSystem SR650 (as shown in Figure 4) is similar to the SR630 but in a 2U form factor.Figure 4: Lenovo ThinkSystem SR650The key differences compared to the SR630 server are more expansion slots and chassis to support up totwenty-four 2.5-inch or fourteen 3.5-inch hot-swappable SAS/SATA HDDs or SSDs together with up to 8 onboard NVMe PCIe ports that allow direct connections to the U.2 NVMe PCIe SSDs. The ThinkSystem SR650server also supports up to two NVIDIA GRID cards for graphics acceleration.For more information, see this website: lenovopress.com/lp06446.2.2 10 GbE networkingThe standard network for Splunk Enterprise is 10 GbE. The following Lenovo 10GbE ToR switches arerecommended: Lenovo ThinkSystem NE1032 RackSwitch Lenovo RackSwitch G8272Lenovo ThinkSystem NE1032 RackSwitchThe Lenovo ThinkSystem NE1032 RackSwitch (as shown in Figure 5) is a 1U rack-mount 10 Gb Ethernetswitch that delivers lossless, low-latency performance with feature-rich design that supports virtualization,Converged Enhanced Ethernet (CEE), high availability, and enterprise class Layer 2 and Layer 3 functionality.The switch delivers line-rate, high-bandwidth switching, filtering, and traffic queuing without delaying data.The NE1032 RackSwitch has 32x SFP ports that support 1 GbE and 10 GbE optical transceivers, activeoptical cables (AOCs), and direct attach copper (DAC) cables. The switch helps consolidate server andstorage networks into a single fabric, and it is an ideal choice for virtualization, cloud, and enterprise workloadsolutions.10Reference Architecture: Splunk Enterprise with ThinkSystem Serversversion 1.0

Figure 5: Lenovo ThinkSystem NE1032 RackSwitchFor more information, see this website: lenovopress.com/lp0605Lenovo RackSwitch G8272The Lenovo RackSwitch G8272 uses 10Gb SFP and 40Gb QSFP Ethernet technology and is specificallydesigned for the data center. It is an enterprise class Layer 2 and Layer 3 full featured switch that deliversline-rate, high-bandwidth switching, filtering, and traffic queuing without delaying data. Large data centergrade buffers help keep traffic moving, while the hot-swap redundant power supplies and fans (along withnumerous high-availability features) help provide high availability for business sensitive traffic.The RackSwitch G8272 (shown in Figure 6), is ideal for latency sensitive applications, such as highperformance computing clusters and financial applications. In addition to the 10 Gb Ethernet (GbE) and 40GbE connections, the G8272 can use 1 GbE connections.Figure 6: Lenovo RackSwitch G8272For more information, see this website: lenovopress.com/tips12676.2.3 1 Gbe networkingThe following Lenovo 1GbE ToR switch is recommended for use with Splunk Enterprise: Lenovo RackSwitch G7028 Lenovo RackSwitch G8052Lenovo RackSwitch G7028The Lenovo RackSwitch G7028 (as shown in Figure 7) is a 1 Gb top-of-rack switch that delivers line-rateLayer 2 performance at an attractive price. G7028 has 24 10/100/1000BASE-T RJ45 ports and four 10 GbEthernet SFP ports. It typically uses only 45 W of power, which helps improve energy efficiency.Figure 7. Lenovo RackSwitch G7028For more information, see this website: lenovopress.com/tips1268.11Reference Architecture: Splunk Enterprise with ThinkSystem Serversversion 1.0

Lenovo RackSwitch G8052The Lenovo System Networking RackSwitch G8052 (as shown in Figure 8) is an Ethernet switch that isdesigned for the data center and provides a virtualized, cooler, and simpler network solution. The LenovoRackSwitch G8052 offers up to 48 1 GbE ports and up to four 10 GbE ports in a 1U footprint. The G8052switch is always available for business-sensitive traffic by using redundant power supplies, fans, andnumerous high-availability features.Figure 8: Lenovo RackSwitch G8052For more information, see this website: lenovopress.com/tips1270.6.3 ServersSplunk Enterprise runs best on bare-metal servers, as compared to virtual hardware. If Splunk is run in avirtual machine (VM) on any platform, performance does degrade. This is because virtualization abstracts thephysical system hardware into resource pools from which defined virtual machines draw as needed. Splunkneeds sustained access to a number of resources, particularly disk I/O, for indexing operations. RunningSplunk in a VM or alongside other VMs can cause reduced performance.There are three kinds of servers for Splunk: Indexer Search head Deployment serverFor very small deployments the search head can be combined into the indexer. For medium to largedeployments a separate deployment server is needed which can also support license management for theSplunk system. Each section below explores the Lenovo recommended configuration for the three kinds ofcompute servers.See “Server BOM” on page 25 for the server bill of materials.6.3.1 IndexerAn indexer needs to store a large amount of local data and each indexer can roughly handle 300GB of dataper day. The Lenovo ThinkSystem SR650 is recommended with up to fourteen 3.5” drives. The hot and warmdata should be stored on solid state drives (SSD) that have a high endurance and the cold data can be storedon 3.5” large capacity hard disk drives (HDD). NVMe drives are not used in the configuration.The enterprise performance “HUSMM32” SSDs have 800GB and 1.6TB capacities and a 3.5” form factor. Thesweet spot for HDD price/performance is 8TB. Lenovo also recommends 4TB drives for smaller storagecapacities. Larger storage capacities will usually require more indexers and therefore it may not be necessaryto use 10TB or larger HDDs.The processor and memory depends on the customer environment and Lenovo recommends the following:12Reference Architecture: Splunk Enterprise with ThinkSystem Serversversion 1.0

Two Intel Xeon scalable 6140 processors (each 18 cores @2.3 GHz 140W TDP) 192 GB of RAM (using twelve 16GB DIMMs)The operating system is stored on two mirrored M.2 480GB boot drives. Two mirrored hot swap 3.5” SSDscould be used but that would reduce the total number of drives available for indexing.Table 3 lists the recommended Indexer SSD configurations for each of the 4 deployment scenarios to storehot and warm data.Table 3: Indexer SSD configurationsAttributeDepartmentalIndexersSmall 614Required storage1TB2.1TB7.9TB15.7 TBRequired 4TBSSD raw capacity3 x 800GB3 x 800GB4 x 800GB3 x 800TBRAID configurationRAID 5RAID 5RAID 5RAID 5SSD actual capacity1.47TB1.47TB2.18TB1.47TB 20%Storage per indexerFor those cases that use only 3 SSDs, an extra SSD could be added as a hot spare.Table 4 lists the recommended Indexer HDD configurations for each of the 4 deployment scenarios to storecold and archived data.Table 4: Indexer HDD configurationsAttributeDepartmentalIndexersSmall 614Required cold storage4.4TB8.7TB33.7TB67.4TBArchived storage14.5TB29.1TB145TB290TBTotal storage 20%23.9TB45.4TB214TB429TBStorage per indexer23.9TB15.1TB35.7TB30.6TBHDD raw capacity10 x 6TB10 x 4TB10 x 8TB10 x 8TBRAID configurationRAID 10RAID 10RAID 10RAID 10HDD actual capacity27.3TB18.2TB36.4TB36.4TBFor optimum performance, disk availability, bandwidth and space should be maintained on the indexers.Ensure that the HDD volumes have 20% or more free space at all times as HDD performance decreases13Reference Architecture: Splunk Enterprise with ThinkSystem Serversversion 1.0

proportionally to available space because disk seek times increase. This affects how fast Splunk indexesdata, and can also determine how quickly search results, reports and alerts are returned. In a default Splunkinstallation, the drive(s) that contain your indexes must have at least 5GB of free disk space, or indexing willpause.6.3.2 Search headBecause there is no local storage, a search can use a 1U SR630. The recommended configuration is: Two Intel Xeon scalable 5118 processors (each 12 cores @2.3 GHz 105W TDP) 96 GB of RAM (using twelve 8GB DIMMs)The operating system is stored on two mirrored M.2 480GB boot drives. As an alternative two mirrored hotswap SSDs could be used.6.3.3 Deployment serverThe deployment and license server can use low performance processors. In order to provide redundancy forsearch heads, it is recommended to simply use the same configuration as a search head.6.4 Systems managementLenovo XClarity is used to manage Lenovo hardware. This section describes both Lenovo XClarity and theLenovo XClarity Administrator App for Splunk. The combination provides scalable systems management andmonitoring, and integrated analytics on top of the monitored data.6.4.1 Lenovo XClarity AdministratorLenovo XClarity Administrator is a centralized resource management solution that reduces complexity,speeds up response, and enhances the availability of Lenovo server systems and solutions.The Lenovo XClarity Administrator provides agent-free hardware management for Lenovo’s ThinkSystem rack servers, System x rack servers, and Flex System compute nodes and components, including theChassis Management Module (CMM) and Flex System I/O modules. Figure 9 shows the Lenovo XClarityadministrator interface, in which Flex System components and rack servers are managed and are seen on thedashboard. Lenovo XClarity Administrator is a virtual appliance that is quickly imported into a virtualizedenvironment server configuration.Figure 9: XClarity Administrator interface14Reference Architecture: Splunk Enterprise with ThinkSystem Serversversion 1.0

6.4.2 Lenovo XClarity Administrator App for SplunkXClarity continuously listens for events from all the resources it manages. Most of these are received viastandard protocols such a CIM (common information model) or SNMP (simple network managementprotocol). Users can either view a log of all these events in the XClarity GUI console, or configure “eventforwarders”, which enable them to forward events to another external

Jul 30, 2018 · The license server manages Splunk Enterprise licenses. It often runs in the same Splunk Enterprise instance as the Deployment server. 5.1.6 Splunk Webserver Splunk provides a web user interface using a Python-based application server. It allows users to search and navigate data stored by Splunk serv