Introduction toRapidMinerUniversität Mannheim - Paulheim: Data Mining I1

Organisational Topics Exercise Procedure Presentation/Discussion of tasks from previous week Recap/Deepening of concepts from the previous lecture Introduction to the new tasks Exercises will not be recorded I will not be talking for 1.5 hours straight! You: present your solutions to the tasks ask questions about lecture content, exercise tasks, RapidMiner, .Universität Mannheim - Paulheim: Data Mining I2

RapidMiner A very comprehensive open-source data mining tool The data mining process is visually modeled as an operator chain RapidMiner has over 400 built-in data mining operators RapidMiner provides broad collection of charts for visualizing data Project started in 2001 by Ralf Klinkenberg, Ingo Mierswa,and Simon Fischer at University of Dortmund, Germany Today: Maintained by commercial company plus opensource developers RapidMiner Editions Community Edition: Free Educational Edition: Free for students and instructors Enterprise Edition: CommercialUniversität Mannheim - Paulheim: Data Mining I3

Gartner: Data Science PlatformsUniversität Mannheim - Paulheim: Data Mining I4

Let’s have a look at RapidMinerChange PerspectiveExecute ProcessProcess ViewList of OperatorsBut let’s take it step by step OperatorsParameterViewRepositoryHelp ViewUniversität Mannheim - Paulheim: Data Mining I5

How does it work? You visually design a data mining process A process is like a flow chart for mining operatorsLoad DataDo smart preprocessingLearn awesomemodelDoes not even beat flipping a coin, try again!EvaluateperformanceGood?It works!Apply modelGet richUniversität Mannheim - Paulheim: Data Mining I6

Specifying a Process by Chaining OperatorsCommon Port NamesName MeaningPortsoutOutputexaExample SetoriOriginal InputtraTraining DatamodModelunlUnlabelled DatalabLabelled DataperPerformanceUniversität Mannheim - Paulheim: Data Mining I7

RapidMiner Operators: Loading Data Many operators to read data from files Output Port labelled “out” Creates an Example Set An Example Set contains your data! The records are called ExamplesUniversität Mannheim - Paulheim: Data Mining I8

Data in RapidMiner All data that you load will be contained in an example set Each example is described by Attributes (a.k.a. features) Attributes have Value Types Attributes have RolesAttribute NamesValue TypesRolesUniversität Mannheim - Paulheim: Data Mining I9

Data in RapidMiner Value types define how data is treated Numeric data has an order (2 is closer to 1 than to 5) Nominal data has no order (red is as different from green as fromblue)Value TypeDescriptionbinominalOnly two different values are permittedpolynominalMore than two different values are permittednumericFor numerical values in generalintegerWhole numbers, positive and negativerealReal numbers, positive and negativedate timeDate as well as timedateOnly datetimeOnly timetextRandom free text without structureUniversität Mannheim - Paulheim: Data Mining I10

Data in RapidMiner Roles define how the attribute is treated by the OperatorsRoleDescriptionIdA unique identifier, no two examples in an example setcan have the same valueAttributeRegular attribute that contains dataLabelThe target attribute for classification tasksClusterCreated by RapidMiner as the result of a clustering taskPredictionCreated by RapidMiner as the result of a classificationtaskUniversität Mannheim - Paulheim: Data Mining I11

The Repository This is where you store your data andprocesses Stores data and its meta data (!) Only if you load data from the repository,RapidMiner can show you which attributesexist Add data via the “Import Data” buttonor the “Store” operator Load data via drag ‘n’ drop or the“Retrieve” operatorIf you have a question starting with“Why does RapidMiner not show me ?”Then the answer most likely is“Because you did not load your data into the Repository!”Universität Mannheim - Paulheim: Data Mining I12

RapidMiner Operators: Pre-Processing Type and Role Conversions “TypeA to TypeB”: Change the type “Set Role”: Change the role Attribute Set Transformation “Select Attributes”: Remove attributes “Generate Attributes: Create new attributes Value Transformation “Normalize”: transform all values to acertain range Filtering “Filter examples”: Remove examples Aggregation “Aggregate”: SQL-like aggregation (count,sum)Universität Mannheim - Paulheim: Data Mining I13

How to find Operators The Operators Panel lets youbrowse all available operators You can search for operators bytyping in the search bar You add operators by doubleclicking or by dragging themonto the process viewFrequently Asked Questions – And their surprising answers How can I ?Type into the search bar!Select which Attributes to use?Select AttributesFilter out examples?Filter ExamplesRead a CSV fileRead CSVLearn a decision treeDecision TreeUniversität Mannheim - Paulheim: Data Mining I14

How to use RapidMiner Use the “Design Perspective” to create your Process See your current Process – “Process”Access your data and processes – “Repository”Add operators to the process – “Operators”Configure the operators – “Parameters”Learn about operators – “Help” Use the “Results Perspective” to inspect the output The “Data View” shows your example set The “Statistics View” contains meta data and statistics The “Charts View” allows you to visualise the dataUniversität Mannheim - Paulheim: Data Mining I15

The Design ViewChange ViewExecute ProcessProcess ViewList of OperatorsOperatorsParameterViewRepositoryHelp ViewUniversität Mannheim - Paulheim: Data Mining I16

The Results View - DataUniversität Mannheim - Paulheim: Data Mining I17

The Results View - StatisticsUniversität Mannheim - Paulheim: Data Mining I18

The Results View - ChartsUniversität Mannheim - Paulheim: Data Mining I19

Data Visualisation Visualisation of data is one of the most powerful andappealing techniques for data exploration Humans have a well developed ability to analyselarge amounts of information that is presented visuallyCan detect general patterns and trendsCan detect outliers and unusual patternsVisualisation is the conversion of data into a visual format so thatthe characteristics of the data and the relationships among dataitems or attributes can be analysed.Universität Mannheim - Paulheim: Data Mining I20

Visualisation Techniques: Histogram Usually used to display the distribution of values of a single attribute Divide the values into binsand show a bar plot of thenumber of objects ineach bin The height of each barindicates the number ofobjects per bin Shape of histogramdepends on the numberof binsUniversität Mannheim - Paulheim: Data Mining I21

Visualisation Techniques: Scatter Charts Two-dimensional scatter charts are most commonly used Often additional attributes/dimensions are displayed by using thesize, shape, and color of the markers that represent the objects It is useful to have arrays ofscatter charts that can compactlysummarise the relationshipsof several pairs of attributes RapidMiner Scatter Charts Scatter (single chart) Scatter Multiple Scatter Matrix Scatter 3DUniversität Mannheim - Paulheim: Data Mining I22

RapidMiner Chart: Scatter MatrixUniversität Mannheim - Paulheim: Data Mining I23

RapidMiner Resources Download RapidMiner Studio: #downloads Rapidminer User Manuals: Open Access Book covering RapidMiner– Matthew North: Data Mining For The iningForTheMasses.pdf RapidMiner Forum and Discussion Groups: Video Tutorials– by Rapid-I:– by Neutral Market Trends: sität Mannheim - Paulheim: Data Mining I24

Hands-on! Now start RapidMiner Load your first dataset Start exploring the data!Universität Mannheim - Paulheim: Data Mining I25

Examples for Data Profiling Students Data SetCourseTaught in# StudentsGrade RangeMax. AttendAlgorithms IHWS201051.7 – 5.012DatabaseSystems IFSS2010101.3 – 5.013DatabaseSystems IIHWS201071.0 – 5.013ElectronicMarketsFSS2010101.0 – 3.013SoftwareEngineeringFSS201091.3 – 4.013 Scatter Chart Y-Axis: Course X-Axis: try!Universität Mannheim - Paulheim: Data Mining I26

RapidMiner has over 400 built-in data mining operators RapidMiner provides broad collection of charts for visualizing data Project started in 2001 by Ralf Klinkenberg, Ingo Mierswa, and Simon Fischer at University of Dortmund, Germany