
Transcription
GuidelinesForHadoop and Spark Cluster Usage
Procedure to create an account in CSX.If you are taking a CS prefix course, you already have an account; to get an initial passwordcreated:1. Login to https://cs.okstate.edu/pwreset to set the password for CSX.Putty Download:Windows users use the below link to download PUTTY.https://www.chiark.greenend.org.uk/ sgtatham/putty/latest.htmlConnecting from outside the university:1. Login to https://osuvpn.okstate.edu/ CSCOE /logon.html with OKEY username andpassword2. You will be redirected to the below screenshot go to download section and downloadAnyConnectVPN3. Connect every time when you want to use the servers in CS Department.4. After Connecting open the putty and in Hostname give csh.cs.okstae.edu for CSH logincluster, set port as 22 and connection Type as .html
Login Procedure to Hadoop1:1. Login with your user name (CSX user name) to csh.cs.okstate.edu2. MAC/LINUX users type ssh [email protected] from your terminal. Windows users can usePutty to login to CSH.3. Enter your CSX password when asked.4. Then login into bbod by typing ssh user@bbod5. Enter your CSX password when asked.6. After logging onto bbod, logon to master node of our Hadoop cluster by typing sshuser@hadoop17. Enter your CSX password when asked.8. You will have to access only the folder /user name in HDFS.9. You can use Hadoop in the cluster.
Process to collect Twitter Data:Apache Flume is an open-source software that helps to store the streaming data into HDFS. Aflume agent should be created through which we can stream the data. The following steps describethe process to collect the twitter data.1. Create an account in twitter and login with the credentials.2. Navigate to https://apps.twitter.com/ and create a new app.3. Give the Name and the Description on what are you the data, and the website URL ishttps://twitter.com/4. Get the consumer secret, consumer token, access token and access token secret for yourapplication. To get the consumer secret, consumer token:(a)Go to Keys and Access Tokens get the consumer key and consumer secret, and then go toYour Access tokens and get the access tokens and access token secret.4. There are 3 components for a twitter agent namely source, sink and channel.5. The flume source connects to Twitter API and receives data in JSON format which in turn storedinto HDFS.6. Now, create a configuration file for the flume agent by specifying the consumer key, consumersecret, access token, access token secret, keywords and HDFS path.A sample configuration file with file extension .conf is shown below. It shows all the keys andkeywords to be used to collect the twitter data.
TwitterAgent.sources TwitterTwitterAgent.channels MemChannelTwitterAgent.sinks HDFSTwitterAgent.sources.Twitter.type t.sources.Twitter.channels MemChannelTwitterAgent.sources.Twitter.consumerKey TwitterAgent.sources.Twitter.consumerSecret TwitterAgent.sources.Twitter.accessToken TwitterAgent.sources.Twitter.accessTokenSecret TwitterAgent.sources.Twitter.keywords Keywords to be specified hereTwitterAgent.sinks.HDFS.channel MemChannelTwitterAgent.sinks.HDFS.type hdfsTwitterAgent.sinks.HDFS.hdfs.path Configuration File Pathhdfs://hadoop1:9000/rramine/Food data/TwitterAgent.sinks.HDFS.hdfs.fileType DataStreamTwitterAgent.sinks.HDFS.hdfs.writeFormat TextTwitterAgent.sinks.HDFS.hdfs.batchSize 100TwitterAgent.sinks.HDFS.hdfs.rollSize 0TwtterAgent.sinks.HDFS.hdfs.rollCount 0TwitterAgent.channels.MemChannel.type memoryTwitterAgent.channels.MemChannel.capacity apacity 10000tostorethedata.#Command to start the flume agent:nohup FLUME HOME/bin/flume-ng agent -n TwitterAgent -f Configuration File Path &Example: nohup FLUME HOME/bin/flume-ng agent -n TwitterAgent –f/autohome/rramine/bitcoin.conf &nohup will make sure the data collection process runs continuously at the backend. nohup.out fileis the log file that will be created as we start the process. The data collected will be in JSON format.Command to check the count of the files.hdfs dfs –count /username/(folder name) there is a limit on the number of files. No more data isput into files when the limit is point.com/apache flume/fetching twitter data.htm
Map Reduce Example Program:Map Reduce is a framework using which we can process applications with huge amount of data inparallel.An Example word count program in MapReduce:Program: wordcount.javaimport java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import mat;import ormat;import org.apache.hadoop.util.GenericOptionsParser;public class wordcount {public static class MapForWordCount extends Mapper LongWritable, Text, Text, IntWritable {public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{String line value.toString();String[] words line.split(" ");for(String word: words ){Text outputKey new Text(word.toUpperCase().trim());IntWritable outputValue new IntWritable(1);context.write(outputKey, outputValue);}}}public static class ReduceForWordCount extends Reducer Text, IntWritable, Text, IntWritable {public void reduce(Text word, Iterable IntWritable values, Context context) throws IOException,InterruptedException{int sum 0;for(IntWritable value : values){sum value.get();}context.write(word, new IntWritable(sum));}}
public static void main(String [] args) throws Exception{Configuration conf new Configuration();Job job Job.getInstance(conf, InputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new rue) ? 0 : 1);}}Steps to execute MapReduce Programs:1. Login to Hadoop1 and create a java file with program name.java2. Then compile the java program to get the class fileCommand: javac programname.java3. Creation of jar fileCommand: jar –cvf program name.jar program name*.class4. Command to run the Hadoop programCommand: hadoop jar program name.jar program namehdfs://hadoop1:9000/inputpath hdfs://hadoop1:9000/outputpath5. Command to check output: hdfs dfs -cat /outputpath/part-r-00000The input file should be in hdfs.Screenshot of input data:
Screenshot of output nt-hello-word-program-in-mapreduce
Apache HiveHive is a data warehousing package/infrastructure built on top of Hadoop. It provides an SQLdialect, called Hive Query Language(HQL) for querying data stored in a Hadoop cluster.FeatureDescriptionFamiliarQuery data with a SQL-based languageFastInteractive response times, even over huge datasetsScalable and As data variety and volume grows, more commodity machines can beExtensibleadded, without a corresponding reduction in performanceCompatibleWorks with traditional data integration and data analytics tools.Please follow below steps to use Hive in the Hadoop Cluster:First login to the Hadoop cluster (hadoop1) and type ‘hive’ on the console. A hive console will beinitialized as shown in the below screenshot.To verify the version of the hive, we can use below commandhive --versionLoad Data to a Table in Hive:We can load the data into Hive either from a local file or from HDFS. Below examples show how toload the data into hive from both local file and hdfs.To load the data from a local file following query can be used:LOAD DATA LOCAL INPATH ' MOCK DATA.txt' INTO table local table;Below screenshots show how to create a table and load data from local file in hive.
Below steps show how to load the data from hdfs to hive:Data can be accessed via a simple query language and Hive supports overwriting or appendingdata.Reference: https://www.tutorialspoint.com/hive/hive create table.htm
Login Procedure to Spark Cluster:1. Login with your user name (CSX user name) to csh.cs.okstate.edu2. MAC/LINUX users type ssh [email protected] from your terminal. Windows users can usePutty to login to CSH.3. Enter your CSX password when asked.4. Then login into cshvm27 by typing ssh user@cshvm275. Enter your CSX password when asked.There are 2 Spark versions available in the cluster1. 2.0.12. 2.3.0Up on deciding which version of spark set the path of spark version as follows:export SPARK HOME /root/spark-2.3.0export PATH PATH: SPARK HOME/sbin: SPARK HOME/bin
Once logging into spark cluster, Spark’s API can be used through interactive shell or usingprograms written in Java, Scala and Python.Steps to invoke Spark Shell:1. After logging into spark cluster and following the steps mentioned above, type spark-shell atcommand prompt to start Spark’s interactive shell. It will start the shell as shown below.2. Give command :q or press Ctrl D to exit the spark shell.NOTE: Only commands written in scala would work in spark-shell. To use commands written inpython, type pyspark to start spark shell. Use exit() command to exit from pyspark.
Steps to Run a spark Application written in Scala:Once the code for the application is written, the application needs to be built using sbt.1. Create a dependency file called build.sbt or build.xml file.2. An application which have import statements such as sample file will have followingcontents:name : "SparkAssignment"version : "1.0"scalaVersion : "2.10.4"libraryDependencies Seq("org.apache.spark" %% "spark-core" % "2.0.1" % "provided""org.apache.spark" %% "spark-sql" % "2.0.1","org.apache.spark" % "spark-graphx 2.10" % "2.1.0","org.apache.spark" % "spark-streaming 2.10" % "2.1.0")3. Name should include name of the application.4. Version is sbt version5. Scala Version is the version of scala version present in the cluster.6. libraryDependencies include list of libraries used while developing the spark application.Format is as follows:libraryDependencies Seq("org.apache.spark" %% " Spark Library Name " % “ Spark Version %"provided")7. Every spark application written in scala should have following dependencies imported:a. import org.apache.spark.{SparkConf, SparkContext}b. import org.apache.spark.SparkContext.8. SparkConf is for creating the spark configuration. SparkContext is for creating sparkcontext object which is used to run spark application9. Hence build.sbt file will have atleast one entry in the library dependencies:libraryDependencies Seq("org.apache.spark" %% "spark-core" % "2.0.1" % "provided")10. Once build.sbt file is created, create folder structure as below in the folder where build.sbtis saved:a. /src/main/scala11. Copy the spark application to /src/main/scala folder.12. Go the folder where build.sbt is present.13. Run following commands to build application using all included dependencies.a. sbt clean
b. sbt packageThis command would create folder structure as shown below:c. project folder will contain class filesd. target folder would contain jar file created14. Once the build is successful, then application can be run using following command:Ex: spark-submit - -class class Name path to jar file NOTE: Before running the application is cluster, it is advisable to run the application as series ofcommands in spark-shell to test if the commands are working as intended or not.
Implementing Map Reduce in Spark using Scala:Scala has map() and reduce() function that can be applied on RDD, but cannot be used directly torun map reduce paradigm of hadoop.An example word count program is explained below to understand how to write a map reduceprogram in Scala.import org.apache.spark.{SparkConf, SparkContext}object wordCount {def main(args:Array[String]){val conf new SparkConf().setAppName("Word Count Demo")val sc new SparkContext(conf)val inputpath "WordCount.txt"// read in text file and split each document into wordsval textFile sc.textFile(inputpath)val counts textFile.flatMap(line line.split(" ")).map(word (word, 1)).reduceByKey( )counts.foreach(println)}}1. Import both SparkConf and SparkContext libraries.2. Create a wordcount object which is like class in Java.Object wordCound{ }3. Inside object write main method.def main(args:Array[String]) { }4. Create SparkConf() object and set the name of of application in setAppNameval conf new SparkConf().setAppName("Word Count Demo")5. Create Spark context object by passing the parameter as conf which is saved before.6. Provide the location of input path.7. Read the file as text file8. Split each line using delimiter and store in line. (Here dline line.split(" ")9. Apply flatMap() transformation to get an iterator for the words.10. Create an RDD pair with key as word and value as 1 for all the words using followingtransformation.map(word (word, 1))11. Combine the values with same key to generate a key, value pair where key is the word andvalue is count of words in the given file.reduceByKey( )12. Print the output using following command.
counts.foreach(println)13. Save the program as wordCount.scala. Make sure that object name and file name should besame.14. Follow the steps given in the above section and run the program.15. Sample output is as follows:
Steps to invoke Spark Shell: 1. After logging into spark cluster and following the steps mentioned above, type spark-shell at command prompt to start Spark’s interactive shell. It will start the shell as shown below. 2. Give command :q or press Ctrl D to exit the spark shell. NOTE: Only commands written in