NetFlow Analysis with MapReduceWonchul Kang, Yeonhee Lee, Youngseok LeeChungnam National University{teshi85, yhlee06, lee} on "An Internet Traffic Analysis Method with MapReduce", Cloudman workshop, April 20101

Introduction Flow-based traffic monitoringg– Volume of processed data is reduced– Popular flow statistics tools : Cisco NetFlow [1] Traditional flow-based traffic monitoring– Run on a high performance central serverRoutersFlow DataStorageHigh Performance Server2

Motivation A hugeg amount of flow data– Long-term collection of flow dataFlow data in our campus network ( /16 prefix )# of Routers1 Day1 Month1 Year11.2 GB13 GB156 GB56 GB65 GB780 GB1012 GB130 GB1.5 TB200240 GB2.6 TB30 TB– Short-term period of flow data Massive flow data from anomaly traffic data of Internet worm and DDoS Cluster file system and cloud computing platform– Google’s programming model, MapReduce, big table [8]– OpenOpen-sourcesource systemsystem, Hadoop [9]3

MapReduce MapReducepis a pprogramminggg model for largeg data set First suggested by Google– J. Dean and S. Ghemawat, “MapReduce: Simplified DataProcessing on Large Cluster,” OSDI, 2004 [8] User only specify a map and a reduce function– Automatically parallelized and executed on a large cluster4

MapReduceSplit 1MapSplit 3ReduceShuffle&SortSplit 2MapResultReduceSplit 4( K1, V ) List ( K2, V2 )( k2, list ( v2 ) )List ( v3 )Map : return a list containing zero or more ( k, v ) pair– Output can be a different key from the input– Output can have same key Reduce : return a new list of reduced output from input5

Hadoop Open-source framework for running applications on largeclusters built of commodity hardware Implementation of MapReduce and HDFS– MapReduce : computational paradigm– HDFS : distributed file system Node failures are automatically handled by framework Hadoop– Amazon : EC2, S3 service– Facebook : analyze the web log data6

Related Work Widelyy used tools for flow statistics– Flow-tools, flowscan or CoralReef[5] P2P-basedP2P based distributed analysis of flow data– DIPStorage : each storage tank associated with a rule [11] MapReduce software– Snort log analysis : NCHC cloud computing research group [16]7

Contribution A flow analysisymethod with MapReducep– Process flow data in a cloud computing platform, hadoop Implementation of flow analysis programs with Hadoop– Decrease flow computation time– Enhance fault-tolerantfault tolerant of flow analysis jobs8

Architecture of Flow MeasurementanddAAnalysisl i SSystem Each router exports flow data to cluster node Cluster master manages cluster nodes9

Components of Cluster NodeFlow File InputocessoProcessorflowtoolsCluster FileClusterFileSystem(SystemHDFS )( HDFS )Flow AnalysisMapMapFlow AnalysisReduceReduceMapReduce LibraryHadoopJava Virtual MachineOperating System ( Linux ) Flow fileinput processor Flow analysismap/reduce Flow-tools Hadoop– HDFS– MapReduce Java VM OS : LinuxHardware ( CPU, HDD, Memory, NIC )10

Flow File Input ProcessorLocal DiskNetFlow v5Cluster MasterFlow File( Binary Format )Convert Save NetFlow datain binary flow fileFlow File( Text Format )Copy Convert binary flow fileinto text fileHDFS CopyCttextt filefil tto HDFSCluster Nodes11

Flow Analysis Map/ReduceFlowFlowFlowDst PortFlowOctet535312864 Read text flow filesRun map tasks– Read each line(Validation Check)– Parsing flow data– Save resultinto temporary files(key value)(key,[64, 128]53192 Run reduce tasks– Read temporary files(Key, List[Value])– Run sum process Write results to a file12

Performance EvaluationEnvironment Data: flow data from /24 subnetDuration1 dayFlow count(million)Flow filecountTotal binaryfile size (GB)Total textfile size (GB)323.2228020.2121.21 week19.015960.32.31 month109.170682.013.1 CCompared methods : computing byte count perdestination port– flow-tools : flow-cat [[flow data folder]] flow-stat –f 5– Our implementation with Hadoop Performance metric– flowflstatisticst ti ti computationt ti timeti Fault recovery against map/reduce tasks13

Our TestbedInternetChungnam National UniversityCluster nodesRouter Hadoop 0.18.3 Cluster master x 1 Core 2 Duo 2.33 GHz Memory 2GB 1 GECluster node x 4 Core 2 Quad 2.83 GHz Memory 4GB HDD 1.5 TB 1 GE NetFlow v5 Data ExportGigabit EthernetCluster master14

Flow Statistics Computation Timeflow-tools : 4h 30m 23sPort-breakdown Computation TimePort Breakkdown Running tiime (sec)180001600014000flow-tools12000MR (1)10000MR (2)8000MR ((3))6000MR (4)4000200003.2 million(One Day)19 million(One Week)number of flows (duration)109.1 million (One Month)MR(4) : 1h 15m 49s Port breakdown computation time– 72% decrease with MR(4) on Hadoop15

Single Node Failure : Map Task Under 4 cluster nodes M ttaskk failMapf il timeti– 4 sec (M : 9% R : 0%) Map task recover time– 266 sec (M : 99% R : 32%)Fail time 4 secRecover time 266 sec16

Single Node Failure : Reduce Task Under 4 cluster nodes Reduce task fail time– 29 sec (M : 41% R :10% ) Reduce task recovertime– 320 sec ((M : 99% R :32% )Fail time 29 secRecover time 320 sec17

Text vsvs. Binary NetFlow FilesFlow Analyzer on Binaryflow fileTextConverterTextflow fileHDFSTextInputFormat TextOutputFormatFlow analysis with text filesMapReduceK : TextV : LongWritableFlow Analyzer on Binaryflow fileHDFSBinaryInputFormatFlow analysis with binary filesMapBinaryOutputFormatReduceK : BytesWritableV : BytesWritable18

Binary Input in Hadoop Currently developing BinaryInputFormat modulefor Hadoop Small storage by binary NetFlow files– Reduces # of Mapp tasks Æ increasinggpperformance Decreasing computation time– By 18% 55% for a single flow analysis job– By 58% 75% for two flow analysis jobs19


Summary NetFlow data analysis with MapReduce– Easy management of big flow data– Decreasing computation time– Fault-tolerant service against a single machine failure Ongoing work– Supporting binary NetFlow files– Enhancing fast processing of NetFlow files21

References[1] Cisco NetFlow,[[2]] L. Deri,, nProbe: an Openp Source NetFlow Probe for GigabitgNetworks,, TERENA Networkingg Conference,, Mayy 2003.[3] J. Quittek, T. Zseby, B. Claise, and S. Zander, Requirements for IP Flow Information Export (IPFIX), IETF RFC 3917, October 2004.[4] tcpdump,[5] CAIDA CoralReef Software Suite, 6] M. Fullmer and S. Romig, The OSU Flow-tools Package and Cisco NetFlow Logs, USENIX LISA, 2000.[7] D. Plonka, FlowScan: a Network Traffic Flow Reporting and Visualizing Tool, USENIX Conference on System Administration, 2000.[8] J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Cluster, OSDI, 2004.[9] Hadoop,[10] H. Kim, K. Claffy, M. Fomenkov, D. Barman, M. Faloutsos, and K. Lee, Internet Traffic Classification Demystified: Myths, Caveats, and theBest Practices, ACM CoNEXT, 2008.[11] C. Morariu, T. Kramis, B. Stiller DIPStorage: Distributed Architecture for Storage of IP Flow Records., 16thWorkshop on Local andMetropolitanpArea Networks,, Septemberp2008.[12] M. Roesch, Snort - Lightweight Intrusion Detection for Networks, USENIX LISA, 1999.[13] W. Chen and J. Wang, Building a Cloud Computing Analysis System for Intrusion Detection System, CloudSlam 2009.[14] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, RaghothamMurthy Hive: a warehousing solution over a map-reduce framework., Proceedings of the VLDB Endowment Volume 2 , Issue 2 (August2009) Pages: 1626-1629[15] HBaseHBase, http://hadoop org/hbase[16] Wei-Yu Chen and Jazz Wang. Building a Cloud Computing Analysis System for Intrusion Detection System, CloudSlam'0922

NetFlow Analysis with MapReduce Chungnam National University Wonchul Kang , Yeonhee Lee, Youngseok Lee National University {teshi85, yhlee06, lee} . – Open-source system Hadoop [9]source system, Hadoop [9] 3. MapReduce