Analytics for NetApp E-Series AutoSupport DataUsing Big Data TechnologiesJialiang ZhangElectrical Engineering and Computer SciencesUniversity of California at BerkeleyTechnical Report No. TechRpts/2016/EECS-2016-24.htmlMay 1, 2016

Copyright 2016, by the author(s).All rights reserved.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.

Analytics for NetApp E-Series AutoSupport DataUsing Big Data TechnologiesbyJialiang ZhangMasters Project PaperPresented to the Faculty of the Graduate Division ofThe University of California at Berkeleyin Partial Fulfillmentof the Requirementsfor the Degree ofMaster of Engineering inElectrical Engineering and Computer SciencesThe University of California at BerkeleyMay 2014DO NOT CIRCULATE

AcknowledgementsMany thanks to my faulty committee members Professor Lee Fleming andProfessor Michael Franklin, my industry advisors Jordan Hambleton and MitthaManjunath from NetApp, and my teammates Achal Pandey and Huisi Tong.ii

AbstractAnalytics for NetApp E-Series AutoSupport DataUsing Big Data TechnologiesJialiang ZhangUniversity of California at Berkeley, 2014Supervisor: Lee FlemingOur capstone project, utilizing novel Big Data technology, was to help NetAppInc. develop the AutoSupport (ASUP) Ecosystem for their E-series products [1]. Withthis software framework, NetApp Inc. was able to collect normalized data, performpredictive analytics and generate effective solutions for its E-series products customers.We used the Star Schema for the data warehousing structure and built seven dimensiontables and two fact tables to handle the plethora of E-series ASUP data. To refine ourdecision and eliminate improper technologies, we made a comparison of many eligibleBig Data technologies with respect to their technical strengths and weaknesses. Weutilized the latest Spark/Shark Big Data technology developed by Berkeley AMPLab [2]to construct the software framework. Additionally, to perform the featured predictiveanalytics we used K-means Clustering and K-fold cross- validation machine learningtechniques on the normalized data set.My main contribution in this project was to develop a parser to convert themajority of the E-series product’s daily/weekly and event-based ASUP logs into theiii

normalized data format. After performing multiple trials and the overall assessment ofboth the difficulty and feasibility of different data parsing approaches, I recommendedthe approach of parsing the text-based data in raw ASUP data set. Based on thenormalized data I generated, we then successfully built a prototype. And we expected thatwith our ASUP framework and predictive data analysis function, NetApp would havemore power and efficiency in resolving the E-series product issue for its customer. At thesame time, our project on ASUP framework would revolutionize NetApp’s data storageand customer support business and help the company exploit its niche market in the BigData industry.iv

Table of ContentsList of Tables .viList of Figures . viiChapter 1 Introduction .11.1 Company and Products .11.2 Project Overview.21.3 My Contributions .2Chapter 2 Literature Review.42.1 Competitors' Strategy .42.2 NetApp's Strategy .6Chapter 3 Methodology .83.1 ASUP Environment .93.2 Dataset.103.3 Technology Comparison .103.4 Data Storage Mechanism .113.5 Data Parsing and Storing .113.6 Data Querying and Insights .12Chapter 4 Discussion .134.1 Technology Comparison Matrix .134.2 Data Parsing .144.3 Star Schema Data Structure .17Chapter 5 Conclusion.18Appendix Data Parser Presentation Slides.20Bibliography.23v

List of TablesTable 1:Existing Landscape of Data Storage and Analysis Market.4Table 2:Technology Comparison Matrix .13vi

List of FiguresFigure 1:NetApp E2600 Storage System .1Figure 2:NetApp AutoSupport Infrastructure .9Figure 3:ASUP Data Processing Using Binary - XML- Tabular Format Approach .15Figure 4:Star Schema Structure .17vii

Chapter 1: Introduction1.1 COMPANY AND PRODUCTSNetApp Inc. is a traditional computer storage and data management company. Accordingto International Data Corporation (IDC), in the second quarter of 2013, NetApp Inc. achieved13.3% of market share in external disk storage systems [3]. Its major competitors are EMCCorporation, International Business Machines Corporation (IBM), Seagate Technology PLC andWestern Digital Corporation (WD).E-series is NetApp’s new product line of conventional storage arrays which receivesmany attentions in the storage market. E-series is composed of model E2600, E2700, E5400 andE5500, with storage capacity ranging from 768TB to 1536TB [1].For each individual E-series product, NetApp Inc. integratesAutoSupport (ASUP) technology with it, in order to efficientlyFigure 1: NetApp E2600 Storage System [1]check the health of the system “on a continual basis” [4].Continual monitoring generated huge amount of AutoSupport data. In this project, ourteam focused on the NetApp’s E-series AutoSupport raw data that were already collected oncompany’s server in Sunnyvale, California.Page 1

1.2 PROJECT OVERVIEW“Big Data” refers to the data that is “large or fast moving” and the current “conventionaldatabases and technologies” are not sufficient enough to analyze them. 1 The advent of Big Datatechnologies, such as distributed systems and in- memory computing, data repository with SQLcompatibility and various machine learning algorithms have successfully “facilitated easieranalysis of large amounts of data”. 1 Our capstone project, utilizing novel Big Data technology,is to help NetApp Inc. develop an AutoSupport (ASUP) Ecosystem for their E-series products.At the customer end, plethora of daily/weekly E-series log files is generated worldwideevery day. What is more, when the E-series storage system encounters an abnormal event, forexample, a system level warning or a failure due to disk malfunction, an event-based log will befiled immediately. With this software framework, NetApp is able to capture the significant rootcause from multiple warnings or failures reported, perform predictive analysis based on them andgenerate effective solutions for its customers.1.3 MY CONTRIBUTIONWhile working with the other two Master of Engineering students together, my majorcontribution to the capstone project were as following:1 Referencing “NetApp Capstone Team Strategy Paper” in Jan., 20142 Using NetApp internal AutoSupport data search enginePage 2

1) Helped to investigate and understand the hardware configuration of NetApp’s E-seriesproduct and how ASUP worked.2)cParticipated in designing the evaluation matrix for different Big Data technologies.3) Researched one of Big Data technologies – Phoenix from Inc.4) Participated in building the Star Schema data structure for ASUP data.5) Accomplished ASUP raw log files data parsing and cleaning.6) Generated tables containing necessary information in a normalized format for datarepository, and had data cleaned for the team to analyze.Page 3

Chapter 2: Literature ReviewAdmittedly, there are many data storage service providers in the market who are advocateof Big Data technologies. Other than NetApp, EMC Corporation, Cisco Systems, InternationalBusiness Machines Corporation (IBM), Seagate Technology PLC and Western DigitalCorporation (WD) are all storage array solution companies who are potential competitors toNetApp. Table 1 below illustrates their key technologies, product trend, target and user group,and whether they are equipped with predictive ability or not.Table 1: Existing Landscape of Data S torage and Analysis MarketCompetitorNetAppEM CFullyAutomatedSt orageTiering(FAST )All ProductLinesCiscoIBMStorage implyRAID Hard DriveSt orageNAS StorageMy riesProductCoreProcessPredictiveAnalysis &Solution“SP Collect”ProductUserEngineers /CustomersEngineers NoYesNetworkSt orageReceive– Confirm– Solve– PreventSeagatetechnologyNoWDWDSmart WareNo2.1 COMPETITORS’ STRATEGYAs the NetApp Project Strategy Paper emphasizes, huge amount of data requires fastpaced analysis and efficient management, especially in this Big Data Era.1 To promote “Big DataPage 4

analytics”, EMC Corporation developed “Pivotal HD Solution”. In their marketing literature,“pivotal” solution referred to their utilization of Apache Hadoop distribution application, whichwas advertised as the revolutionary in “Hadoop analytics for unstructured Big Data” [5].Similarly, as a worldwide leader in networking, Cisco IT chose Hadoop to deliver itscommitment that “Enterprise Hadoop architecture, built on Cisco UCS (Unified ComputingSystem) Common Platform Architecture (CPA) for Big Data, unlocks hidden businessintelligence” [6]. What is more, in their promotional material, IBM emphasized “Big Dataplatform”, whose key capabilities included: “Hadoop-based analytics”, “Streaming Computing”and “Data Warehousing”, with prominence on analytic applications of “Business Intelligence”and “Predictive Analytics” [7]. Unwilling to lag behind, traditional storage solution companieswere dedicatedly building their own Big Data technology. As Mike Crump, VP of Seagate andHarrie Netel, director of Seagate denoted, “Seagate puts Big Data in action” with the “automatedODT (Outgoing DPPM Test)” and eCube technologies based on its own “Seagate’s EnterpriseData Warehouse (EDW)” [8]. WD (Western Digital), another major disk drive manufacturer,announced that they used Hadoop and Hortonworks to “optimize manufacturing with longerretention of sensor data” [9]. It is predictable that this market will evolve rapidly, and in order tosurvive, our ASUP ecosystem development for NetApp needs to exploit a niche market in thisindustry.Page 5

2.2 NETAPP’S STRATEGYFor NetApp Inc. the proper use of Big Data technology in our project will have a positiveimpact on its future business, because the successful deployment of Big Data technology on Eseries products will “necessitates secure, robust and low-cost solutions for data storage andmanagement”, as emphasized in NetApp Strategy Paper.1 When AutoSupport was firstintroduced in NetApp white paper in 2007, it was highlighted that NetApp would have a morethan “65% chance of resolving a customer case in less than one day” instead of only “35%[chances] without AutoSupport data” [10].On the other hand, as the database structure has become increasingly complex, ourstrategy for NetApp in the project is a radical evolution in the industry. MapReduce was themilestone in data mining, processing and management, like Dr. Jeff Ullman claimed in his bookMining of Massive Datasets, “Implementations of MapReduce enable many of the most commoncalculations on large-scale data to be performed on computing clusters efficiently” [11]. Later,the MapReduce methodology was integrated with Hadoop Hive, specifically, HiveQL “which arecompiled into map-reduce jobs executed on Hadoop” as demonstrated by Ashish Thusoo et al. inthe paper entitled Hive - A Warehousing Solution Over a Map-Reduce Framework in 2011 [12].Since then, the tool was tailored to handle large data set and was very powerful, and manycompanies still relied on it. However, we chose to use Berkeley Shark, which was Spark on topof Hadoop Hive with SQL compatibility. One of the special features of Shark was the fact thatShark could implement MapReduce functions approximately a hundred times faster [2], whichwas an ideal choice for fast-paced big data analysis. As illustrated in Table 1 above, with thehelp of Berkeley Shark technology, our data analysis function which required the predictivePage 6

nature and real- time feature over large-scale data set became feasible. This was innovative andwould dramatically improve the user experience of NetApp’s customers.Actually, for all the IT companies in this Big Data era, the key to the success is whetherthe company can master the advanced technology and seize the opportunity in a niche market.Our project on E-series ASUP framework will revolutionize NetApp’s data storage and customersupport business and help the company exploit its niche market in the Big Data industry.Page 7

Chapter 3: MethodologyOne of our tasks in this project was to gain extensive knowledge by researching,analyzing and testing various Big Data technologies for the E-series ASUP framework. Initially,we made our technology selection list with Spark/Shark from Berkeley AMPLab [2], Impala [13]and Parquet [14] from Cloudera, Phoenix from [15] and Clydesdale from Googleand IBM [16]. We then set up various benchmarks to compare these technologies in order tonarrow down our list. After we finalized the decision to utilize the latest Berkeley Spark/Shark asour key technology, we developed the data storage schema, constructed the data repositorythereafter and parsed the ASUP raw log files into tabular format data for the repository. At thesame time, we made progress on Berkeley Shark configuration based on NetApp’s computingclusters, with which we could store the large-scale parsed data, perform analysis and offerpredictive solutions using machine learning techniques. Since my work is majorly focused ondata parsing, this paper will be centered on data processing accordingly.Page 8

3.1 ASUP ENVIRONMENTFigure 2 on the right is a demonstration of ASUPinfrastructure from NetApp’s AutoSupport documentsonline [4]. NetApp developed this technology manyyears ago, and integrated it with several branded productlines in order to continuously and efficiently monitor theFigure 2: NetApp AutoSupport Infrastructure [4]health of storage systems. It is achieved by constantlysending ASUP reports back to NetApp headquarter and “My AutoSupport” online platform. Asan effective troubleshooting tool, AutoSupport targets both of the NetApp support engineers andproduct customers.Although AutoSupport was already deployed in many other NetApp products, it had notbeen integrated with NetApp’s E-series product line. Since E-series products are becoming oneof NetApp’s featured products, the company is desired to make this integration accomplishedsoon. And that is the goal of our capstone project.Page 9

3.2 DATASETSince one storage system can generate multiple AutoSupport reports continuously in justa short period of time, it is a pressure for us to do data cleaning and analysis. Likely, it is due to ahardware failure or a system warning occurred before. But within each of the AutoSupportreport, most of the contents are duplicated. Therefore, how to efficiently extract the root cause ofthe problem becomes significant.The size of an ASUP dataset varies greatly from a few megabytes to several hundreds ofmegabytes in total, depending on how large the storage system is and whether the AutoSupportdata is a daily log or a system warning type.These are the raw datasets that we used for our capstone project. With access to theNetApp’s repository of AutoSupport raw data, we can continuously collect these data globally.However, to process and integrate the huge dataset demands novel Big Data technologies ratherthan traditional database and data management solutions.3.3 TECHNOLOGY COMPARISONWe made technical comparison of five eligible Big Data technologies, namely BerkeleySpark/Shark, Cloudera Impala and Parquet, Phoenix and Google Clydesdale.They all have various advantages and disadvantages. And one of our tasks in this project was tonarrow down this list, and made a final decision on which technology we were going to use toPage 10

construct the framework. In order to achieve that goal, we did research on their hardwarelimitations and computing constraints one by one, and list our evaluation standards and results toexamine each single technology.3.4 DATA STORAGE MECHANISMIn order to efficiently organize and store all of the normalized data, we utilized the StarSchema data structure. The Star Schema consisted of fact tables and dimension tables, in whichfact tables stored the central metrics and information, whereas dimension tables were datawarehouse linked to the fact tables.3.5 DATA PARSING AND STORINGAfter choosing Berkeley Spark/Shark technology, it was important to install andconfigure it properly on the NetApp’s company computing cluster. Our computing clusterconsisted of one master node and three worker nodes. And we installed the BerkeleySpark/Shark with the latest release on February 2014 on all of the cluster nodes. With thataccomplished, I began to work on data parser, convert the ASUP raw data into tabular format tostore in data repository.Page 11

3.6 DATA QUERYING AND INSIGHTSLast but not least, we spent time and effort on identifying example use cases forNetApp’s E-series products, and generating insightful data queries. Because this was one of ourkey tasks for the project, we wanted to offer valuable and predictive solutions for our customer.A simple use case would be to collect any drive errors from one system, performinganalysis on its system configuration, record of repairing, capacity usage and device running timeetc., aggregating similar errors and identifying the root cause, and predicting what the next timethat the potential failure would occur. We applied K-means clustering and K- fold crossvalidation machine learning algorithms on our dataset and generated insightful conclusionsaccordingly.Page 12

Chapter 4: Discussion4.1 TECHNOLOGY COMPARISON MATRIXTable 2 below presents the technology comparison results we concluded for five majoradvanced Big Data technologies.Table 2: Technology Comparison Matrix [17]NameCompanyEase of SetupCompatibilitySQL LikeStar SchemaUnstructuredDataAcceleratedStorage FormatBulk Data LoadIn-memoryUDFPredictiveAnalyticsAvailable APIsMaturityNoteUsersOur ChoiceSparkUCB/ApacheEasy Hive ImpalaClouderaEasy Hive PhoenixSalesforceEasy HBase ParquetCloudera/TwitterMediumHadoop ClydesdaleGoogle/IBMHard Not for Nonscalar Data Columnar Columnar Columnar JavaJavaJavaJavaMediumMediumMediumLow In-memorycomputing, fasterdata queries,ideally suited formach ine learningBestintegrationwith ParquetTable Joinfunction notavailable inPhoenixVersion2.1.2Requires extensiveconfiguration, querydependent, notsuitable for mult iplequeriesStill a researchprototype.Performancevariesdepending onquery typeIBM , Yahoo!,Intel, GrouponClouderaSalesforceSalesforce, Couldera,TwitterGoogle, IBM Java, Python,ScalaHigh Page 13

The evaluation standards are:1.Co mpany: The entity who developed and supported such technology2.Ease of Setup: To measure how easy it is for users to setup and configure such technology3.Co mpatibility: To examine whether such technology is compatible with HIVE/ HBase/Hadoop4.SQL Like: To examine whether such technology has the SQL skin, wh ich is easy to develop5.Star Schema: To examine whether such technology supports “Star Schema”6.Unstructured Data: To survey how well such technology handles the unstructured data like t xt7.Accelerated Storage Format: To identify if such technology utilizes Co lu mnar data format8.Bulk Data Load: To examine whether such technology supports large size data bulk loa ding9.In-memory : To observe if such technology has the function of in -memory co mputation10. UDF: To examine if such technology has the User Defined Function features11. Predictive Analytics: To survey whether such technology has the predictive analytics function12. Available APIs: To examine what APIs it supports, like Java, Python or Scala13. Maturity: To measure how mature such technology is, in the level of High, Mediu m and Low14. Extra Note: Other significant features, functions, releasing or development notes15. Users: The examp le of co mpanies/entities who utilize or deploy such technology4.2 DATA PARSINGOne of my major tasks in this capstone project was to parse the raw log files and extractvaluable data from them. At the early stage of our project, we discovered that there was a binaryfile in the log file jar. Utilizing an internal java-based parser, we could convert these binary filesinto semi-structured xml files for preliminary analysis. However, xml file was not valuable to us,because this type of data format was not compatible with databases and none of the machinelearning algorithms could be applied upon. We needed to further normalize these data andconvert them into tabular format, then store them into our databases residing on powerfulcomputing clusters, aggregate them further to perform the predictive analysis using modernPage 14

machine learning algorithms and finally generate insightful solutions. To achieve these goals, wedevoted our effort on creating a new parser based on Python, to convert these xml files into csv(comma-separated values) format with organized data in it.However, the internal binary to xmlparser is just a preliminary version. As we parseddifferent ASUP log files later on, it failed severaltimes. On the other hand, the binary to xml parserhadits drawbackasinefficiencyindataprocessing. Because when we needed to use itevery time, we had to convert those raw log filesinto xml format first and then further transforminto tabular format using the parser developed byour team. This can be illustrated clearly in Figure3.Hence, in consideration of the efficiency ofour AutoSupport ecosystem, we needed to explorean alternative approach. We found that there weremany text files within the same ASUP log jar.Though they are all unstructured data, those text-Figure 3: ASUP Data Processing Using Binary - XML - Tabular Format ApproachPage 15

based data contain almost equally sufficient and valuable information as the binary files. So wedecided to set aside the previous approach, and begin to develop a new parser aiming to parsethese text-based data. This process is clearly illustrated in the appendix.I wrote a parser based on Python, which took in the text files in ASUP log file jar, andgenerated all the tables automatically. The parser would extract all the key words in the text file,like the ID of various physical components, the generation date of the ASUP report, date ofmanufacturing etc. as column names in the table, and the associated value or description to thosekey words would be stored in a tabular format in a csv file.As discussed in section 3.2, the same storage system could continuously generatemultiple ASUP reports in a short period of time. These datasets were mostly alike to each other,so to simply bulk load those in the data repository without proper process might causeoverwriting problems. To deal with such issue, I utilized the “Partitioning” function in Hive, aswell as Shark (because Shark was Spark on top of Hive), to solve this issue. Since the ASUPgeneration date had sufficient precision, we decided to use it as the partition field to differentiatedistinct ASUP data, or the data generated by the same ASUP but in different time period.Page 16

4.3 STAR SCHEMA DATA STRUCTUREI also participated in designing the data structure for the repository. Figure 4 below is thesample Star Schema we created for ASUP data warehousing:As discussedlarge-scaleASUPin previous sections,datawereallstoredfollowing this structure on computing cluster.For different dimension tables, we used IDs ofvarious components as the primary key to linkto the fact tables. And as claimed above, eachtable contained a partition field when storing inFigure 4: Star Schema Structure [17]data repository.Page 17

Chapter 5: ConclusionAt this time, we have successfully designed the data structure, configured Spark/Shark oncomputing clusters, had the majority of ASUP data parsed and cleaned for use, and generatedseveral use cases insights based on machine learning algorithm already.To review my part of work, the biggest difficulty I encountered was to parse theunstructured text-based AutoSupport data. Yet at the same time, I realized that it was importantto have a clean and normalized dataset generated for machine learning application, user interfacedevelopment and future predictive analysis. The difficulty lay in the fact that my parser mightwork well on one version of AutoSupport, but turn out to be a failure totally when testing onother AutoSupport versions, simply due to the new lines added in other AutoSupport versions.To overcome this difficulty, I had to run and test my parser on multiple AutoSupport versionsone by one. Fortunately, I found out that all of the AutoSupport reports were in a “normalized”format to some degree. The total amount of information, or the column names in tabular dataformat after conversion, was set and fixed. The only difference was that some AutoSupportversions tended to omit certain hardware information, which might not be configured in thestorage system. Therefore, I drawn the conclusion that to develop a parser for such unstructuredtext format data, it was important to aggregate all of the possible information first, no matterwhether it existed in the current dataset or not. Furthermore, some parsing techniques, like thelook- up table mechanism, should be used to parse the dataset completely, instead of parsing thedataset line by line or by searching key words in it.Page 18

In the future, continued work can be done in the areas of implementing more machinelearning algorithms on the whole set of E-series AutoSupport data, constructing a friendly userinterface for potential customers, and continuing working to make the E-series AutoSupportecosystem more efficient and robust.Page 19

Appendix A: Data Parser Presentation Slides233 Referencing “NetApp Capstone Team Final Presentation” in May, 2014Page 20

Page 21

Page 22

Bibliography[1] -systems/e5400/e5400-productcomparison.aspx[2] [Online][3] [Online] prUS24302513[4] [Online] ort.aspx[5] plorer/index.htm#content/which/analytics/, last access Jan. 30, 2014.[6] ateral/ns340/ns1176/datacenter/BigData Case Study-1.html/, last access Jan. 31, 2014.[7] [Online]., last access Jan. 31, 2014.[8] [Online]. ion-a-case-study/, last access Feb. 1, 2014.[9] igital/, last access Feb. 2, 2014.[10] "Proactive Health Management with AutoSupport" NetApp White Paper. NetworkAppliance, Inc. Technical Report. WP-7027-0907. Sept. 2007.[11] J. Ullman, "Mining of Massive Datasets" pp.19, Cambridge University Press,December 30, 2011.[12] A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoffand R. Murthy, “Hive - A Warehousing Solution Over a Map-Reduce Framework”.Proceedings of the VLDB Endowment. Vol. 2 Issue 2. Pp. 1626-1629. August ] [Online][15] [Online][16] T. Kaldewey, E. Shekita, S. Tata, “Clyd

2Using NetApp internal AutoSupport data search engine . Page 3 1) Helped to investigate and understand the hardware configuration of NetApp’s E-series product and how ASUP worked. 2) c Participated in designing t