Transcription

Yugoslav Journal of Operations Research15 (2005), Number 1, 125-145DATA WAREHOUSING AND DATA MINING - A CASESTUDYMilija SUKNOVIĆ, Milutin ČUPIĆ, Milan MARTIĆFaculty of Organizational Sciences,University of Belgrade, Belgrade, Serbia and [email protected], [email protected], [email protected] KRULJTrizon Group, Belgrade, Serbia and [email protected]: August 2004 / Accepted: February 2005Abstract: This paper shows design and implementation of data warehouse as well as theuse of data mining algorithms for the purpose of knowledge discovery as the basicresource of adequate business decision making process. The project is realized for theneeds of Student's Service Department of the Faculty of Organizational Sciences (FOS),University of Belgrade, Serbia and Montenegro. This system represents a good base foranalysis and predictions in the following time period for the purpose of quality businessdecision-making by top management.Thus, the first part of the paper shows the steps in designing and development of datawarehouse of the mentioned business system. The second part of the paper shows theimplementation of data mining algorithms for the purpose of deducting rules, patternsand knowledge as a resource for support in the process of decision making.Keywords: Decision support systems, data mining, data warehouse, MOLAP, regression trees,CART.1. PREFACEPermanently decreasing ability to react quickly and efficiently to new markettrends is caused by increase in competition on the market. Companies becomeovercrowded with complicated data and if they are able to transform them into usefulinformation, they will have the advantage of being competitive.

126M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data MiningIt is familiar that the strategic level of decision-making usually does not usebusiness information on a daily basis but instead, cumulative and derivative data fromspecific time period. Since the problems being solved in strategic decision-making aremostly non-structural, it is necessary in decision-making process to consider the largeamounts of data from elapsed period, so that the quality of decision-making is satisfied.Therefore, Data Warehouse and Data Mining concept are imposed as a good base forbusiness decision-making.Moreover, the strategic level of business decision-making is usually followed byunstructured problems, which is the reason for data warehouse to become a base fordevelopment of tools for business decision-making such as the systems for decisionsupport.Data warehouse as a modern technological concept, actually has the role toincorporate related data from vital functions of companies in the form that is appropriatefor implementation of various analyses.2. DATA WAREHOUSE IMPLEMENTATION PHASESBasic data warehouse (DW) implementation phases are [1]: Current situation analysis Selecting data interesting for analysis, out of existing database Filtering and reducing data Extracting data into staging database Selecting fact table, dimensional tables and appropriate schemes Selecting measurements, percentages of aggregations and warehousemethods Creating and using the cubeThe description and thorough explanation of the mentioned phases is to follow:2.1. Current situation analysisComputer system of FOS Student's Service Dept. was implemented at thebeginning of nineties but it has been improved several times since then with the aim toadapt it to the up-to-date requests. This system fully satisfies the complex qualityrequests of OLTP system, but it also shows significant OLAP failures. Data are notadequately prepared for complex report forming. The system uses dBASE V databasethat cannot provide broad range of possibilities for creating complex reports. dBASE Vdoes not have special tools for creating queries that are defined by the users. Designdocumentation is the most important in selecting of system information and data used foranalysis. All vital information needed for warehouse implementation could often befound out from the design documentation of OLTP system. This phase is the mostneglected one by the designers of OLTP system; therefore their solutions do not givepossibilities of good data analysis to users.Since at this phase the possibility of realization and solution of the problem canbe seen, it represents a very important phase in warehouse design. Users often know

M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data Mining127problems better than the system designers so that their opinion is often crucial for goodwarehouse implementation.2.2. Selecting data interesting for analysis, out of existent databaseIt is truly rare that the entire OLTP database is used for warehouseimplementation. More frequent case is choosing the data sub-set which includes allinteresting data related to the subject of the analysis. The first step in data filtering isnoticing incorrect, wrongly implanted and incomplete data. After such data are locatedthey need to be corrected if possible or eliminated from further analysis.2.3. Filtering data interesting for analysis, out of existent databaseThe next step is searching for inappropriately formatted data. If such data exist,they have to be corrected and given the appropriate form. Data analysis does not need allthe data but only the ones related to a certain time period, or some specific area. That iswhy the data reducing practice is often used.2.4. Extracting data in staging databaseAfter the reducing and filtering of data, data are being extracted in stagingdatabase from which the data warehouse is being built (Figure 1). If OLAP database isdesigned to maintain OLAP solutions, this step can be skipped.DTS package is written in Data Transformation Services SQL Server 2000.Package writing is very important in DW implementation because packages can bearranged to function automatically so that DW system users can get fresh and prompteddata.Figure 1: DTS package based on [12]

128M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data Mining2.5. Selecting fact table, dimensional tables and appropriate schemasThe entity-relationship data model is commonly used in the design of relationaldatabases, where a database schema consists of a set of entities and the relationshipsbetween them. Such a data model is appropriate for on-line transaction processing. Adata warehouse, however, requires a concise, subject-oriented schema that facilitates online data analysis. Figure 2 shows the schemas that are used in implementation of Datawarehouse system.Figure 2: Data warehouse schema based on [20].The simplest scheme is a single table scheme, which consists of redundant facttable. The most common modeling paradigm according to [10] is star schema, in whichthe data warehouse contains a large central fact table containing the bulk of data, with noredundancy, and a set of smaller attendant tables (dimension tables), one for eachdimension. Snowflake schema is a variant of star schema model, where some dimensiontables are normalized, causing thereby further splitting the data into additional tables.Galaxy schema is the most sophisticated one, which contains star and snowflake schemas.Figure 3: Snowflake scheme from Student's service based on [19]

M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data Mining129Only the table that contains the most detailed data should be chosen for the facttable. The most detailed table in this project is the one with students' applications. Tablesdirectly related to it, can be observed as dimensional tables. Because of the complexstructure of the data warehouse, snowflake scheme represented at Figure 3 represents thebest solution.2.6. Selecting measurements, percent of aggregations and warehouse modesThe next step in designing data warehouse is selecting measurements. In thiscase, two measurements can be seen: total number of passed exams and average markachieved in passed exams.In the data warehouse implementation very often appears the need for calculatedmeasurements that are attained from various arithmetic operations with othermeasurements. Furthermore, this system uses the average that has been calculated as theratio of the total mark achieved on passed exams and the number of passed exams.Data warehouse solutions use aggregations as already prepared results in userqueries and through them they solve the queries very fast. The selection of an optimalpercentage of aggregation is not simple for the designer of the OLAP system. Theincreasing of the percentage of aggregated data speeds up the user-defined queries, but italso increases also the memory space used.From a Fig. 4 we can conclude that the optimal solution is 75% aggregation,which takes 50 MB of space.Figure 4: The selection of the optimal percentage of aggregation

130M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data MiningThe most important factors that can have an impact on the storing mode are:The size of the OLAP base,The capacity of the storage facilities andThe frequency of data accessing.Manners of storing are:ROLAP (RELATIONAL OLAP),HOLAP (HYBRID OLAP) andMOLAP (MULTIDIMENSIONAL OLAP).wich is shown on fig 5.ROLAP stores data and aggregation into a relational system and takes at leastdisc space, but has the worst performances. HOLAP stores the data into a relationalsystem and the aggregations in a multidimensional cube. It takes a little more space thenROLAP does, but it has better performances. MOLAP stores data and aggregations in amultidimensional cube, takes a lot of space, but has the best performances since verycomplex queries will be used in analysis it is rational to use MOLAP.Figure 5: Modes of data storing based on [15].2.7. Creating and using the cubeThe cube is being created on either client or server computer. Fundamentalfactors that influence the choice of the place for cube's storehouse are: size of the cube,number of the cube users, performances of the client's and server's computers andthroughput of the system. The created cube can be used by the support of various clients’tools.

M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data Mining131Figure 6: Multidimensional cubeFigure 6 shows a three-dimensional cube that represents average grades bydepartment, Exam and school year. In comparison with OLTP systems that have tocalculate the average grade on each request, data warehouse systems have resultsprepared in advance and stored in multidimensional format.Figure 7 shows basic screen form, which offers possibility of creating complexreports with the aim of making appropriate prognosis. Application provides creation ofvarious user reports. Functioning of the faculty’s management is especially made easierin relation to the analyses of passing exams, e.g. by subjects, examination period,examiners, etc.The example of the cube usage with MS Excel is shown on Fig. 8, where we cansee the average grade and total number of passed exams by professor, Direction andExam.

132M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data MiningFigure 7: Data analysis through WEB applicationFigure 8: Data analysis through Microsoft Excel application

M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data Mining1333. FROM DATA WAREHOUSE TO DATA MININGThe previous part of the paper elaborates the designing methodology anddevelopment of data warehouse on a certain business system. In order to make datawarehouse more useful it is necessary to choose adequate data mining algorithms. Thosealgorithms are described further in the paper for the purpose of describing the procedureof transforming the data into business information i.e. into discovered patterns thatimprove decision making process.DM is a set of methods for data analysis, created with the aim to find outspecific dependence, relations and rules related to data and making them out in the new,higher-level quality information [2]. As distinguished from the data warehouse, whichhas unique data approach, DM gives results that show relations and interdependence ofdata. Mentioned dependences are mostly based on various mathematical and statisticrelations [3]. Figure 9 represents the process of knowledge data discovery.Figure 9: Process of knowledge data discovery based on [10]Data for concrete research are collected from internal database system ofStudent's Service Dept., and external bases in the form of various technologicaldocuments, decisions, reports, lists, etc. After performed selection of various data foranalysis a DM method is applied, leading to the appropriate rules of behavior andappropriate patterns. Knowledge of observed features is presented at the discoveredpattern. DM is known in literature as the "extraction of knowledge", "pattern analysis","data archaeology" [3].

134M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data Mining3.1. Regression treesA regression tree is based on [7] a nonparametric model which looks for the bestlocal prediction, or explanation, of a continuous response through the recursivepartitioning of the predictor variables’ space. The fitted model is usually displayed in agraph which has the format of a binary decision tree which grows from the root node tothe terminal nodes, also called leaves.Figure 10: Graphical Display of a Regression tree based on [7]Figure 10 illustrates the features of a model provided by a regression tree thatexplains the relationship between a response variable y and a set of two explanatoryvariables x1 and x2 . In the example above, the predicted values for y are obtainedthrough a chain of logical statements that split the data into four subsets.Mathematical FormulationLet xt ( x1t , x2t ,., xmt ) ' be a vector which contains m explanatory variablesfor a continuous univariate response yt . The relationship between yt and xtfollows the regression model:yt f ( xt ) ε t(3.1.)where the functional form f is unknown and there are no assumptions aboutthe random term ε t . Following [14], a regression tree model with k leaves is arecursive partitioning model that approximates f by fitting functions insubregions of some domain D R m so that:kfˆ ( xt ) fˆ ( xt ) I i ( xt )t 1(3.2.)

M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data Mining135where I j ( xt ) indicates the membership of t th observation to the j th leave thatconstitute a subregion of D . The functional form of fˆi is usually taken to be aconstant and, conditionally to the knowledge of the subregions, the relationshipbetween y and x in (3.1.) is approximated by a linear regression on a set of kdummy variables. In [5] discuss, upon evidences, that there is not much gain inchoosing fˆ to be a linear or a more complex function of x .it3.1.1. CART Regression Tree ApproachThe most important reference in regression tree models is the CART(Classification and Regression Trees) approach in [6], thus the discussion from now on isentirely based on this work. The top-down method of the growing tree implies specifyingat first a model associated with the simplest tree as in Figure 11.Figure 11: Simplest tree structure based on [7]To locate the model parameters in the tree structure above, we adopt a labellingscheme which is similar to the one used in [8]. Here, the root node is at position 0 and aparent node at position i generates the left-child node and right-child node at positions2i 1 and 2i 2 , respectively. Therefore, a tree model equation for Figure 11, that fitsa constant model in each leave, may be written as:yt β 01 I 01 ( xt , w0 , c0 ) β 02 I 02 ( xt , w 0 , c0 ) ε t(3.3.) 1, fw0' xt c0I 01 (.) ' 0, fw0 xt c0(3.4.)I 02 (.) 1 I 01 (.)(3.5.)In (3.4.), the hyperplane w0' xt c0 which defines the binary partition of thevariable space is obtained from a vector of weights w0 ( w01 , w02 ,.w0 m )' , and a scalarthreshold c0 .

136M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data MiningThe constant β 01 is the local model for the observations that are assigned to theleft child node, that is, for those which I 01 (.) 1 . Analogous interpretation is given to theparameter β 02 , hence the model in (3.3) approximates the unknown function f in (3.1.)by a non-smooth and discontinuous function. In the CART context, it is usual to obtainthe recursive partitioning by using hyperplanes that are perpendicular to the predictorvariables’ axis, which means that the vector w0 is composed of m 1 zeros and a unityelement in the S0th position. This simplifies the specification of the model, allowing toperform a less expensive search in the continuous space of hyperplanes. For practicalpurposes, that means to replace the vector of parameters wo in (3.3.) by a scalarparameter S0 (index of the splitting variable) which takes values in the set of integersS {1, 2,., m} so that: 1, fxS0t c0I 01 (.) 0, fxS0t c0(3.6.)I 02 (.) 1 I 01 (.)(3.7.)By omitting the arguments of the indicator functions, more complex treestructures are represented in the equations that follow Figure 12:yt β 02 I 02 ( β11 I11 β12 I12 ) I 01 ut yt ( β11 I11 β12 I12 ) I 01 ( β 21 I 21 β 22 I 212 ) I 02 utFigure 12: Regression trees with 3 and 4 terminal nodes based on [7]β ij and I ij denotes the constant model respectively and indicator function for the subsetof observations located on node at position i that are assigned to the child-node j( j 1 left / j 2 right).3.1.2. Growing the tree by CART algorithmThe tree growing induction can be seen as an iterative and recursivenonparametric modeling cycle that specifies at each iteration a tree architecture byselecting a node to be split, a splitting variable, and a threshold. Then, local models are

M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data Mining137estimated for the observations assigned to the generated nodes. This procedure isrepeated recursively in the created nodes until it reaches a point where there is no gain inpartitioning the observed data. The final model can be evaluated through cost-complexitymeasures and re-specified by cutting some branches of the tree. This cycle starts with thespecification of the simplest model (Fig 9) by selecting a splitting variable xS0 and athreshold c0 and estimating the parameters β 01 and β 02 .The selection and estimationprocedures are carried out simultaneously by searching exhaustively the pair ( S0 , c0 ) thatminimizes the sum of squared errors or the sum of absolute deviations for the model inequation (3.3). The tree obtained in the former case is called LS (Least Squares)regression tree. Conditional on S0 , c0 it is straightforward to show that the bestestimators for β 01 and β 02 , in the least square sense, are the sample mean values of theresponse variable within the children nodes. The tree grown by minimizing of the sum ofabsolute deviations is called LAD (Least Absolute Deviation) regression tree and it canbe shown that the sample median of the response variable within the terminal nodes is thebest estimator in this case. More about absolute deviation regression can be found in [16]and [4]. After splitting the root node (node 0), and if we use the least square criterion, thetree model may be re-specified by selecting the tree architecture in the left side of Figure12.The recursive partitioning implies that all parameters which are not involved inthe current splitting are kept fixed. Thus, in the equation on the left side of Figure 12, theestimates of β11 and β12 are found by selecting the pair ( S1 , c1 ) that minimizes theoverall sum of squared errors conditionally on S0 , c0 and β 02 . Thespecification/estimation process will continue by maximizing the decrease in the sum ofsquared errors while adding a terminal node. This procedure naturally forces at eachiteration the sum of squared errors to be lowered and thus it becomes necessary toestablish a stopping rule in order to verify if a generated node shall be recursively split orshall be declared as terminal, otherwise it will lead to an overfitting. In [6] suggests that agenerated node containing a number of observations equal or less than 5 shall be declaredas terminal. To reduce the complexity of the tree, the last diagnostic checking can beperformed through a pruning techniqueWe present in Figure 13 and Figure 14 a version of CART algorithm to fit aregression tree model. The tree growing algorithm is shown in Figure 13 while Figure 14highlights the node splitting procedure. The algorithm executes in the node at the i thposition an exhaustive search to find the index Si of the univariate splitting variable andthe threshold ci that provide the minimum value of a loss function L , usually the sum ofsquared errors or absolute deviations. A node is declared as terminal if its number ofobservations is equal or less than 5. According to the adopted labelling scheme, wecontrol the status of the i th node by assigning: 0 (test node) and 1 (terminal node).During the execution of the algorithm, the status of the node can change from “terminal”(1) to “test” (0) and if a node position has to be skipped, these nodes are labelled with 1 .

138M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data MiningFigure 13: General form of CART algorithm based on [7]Figure 14: CART splitting algorithm based on [7]3.2. Illustrated examples of DMMore than one hundred cluster models and decision tree models have beenimplemented in this project. For the sake of being practical this project will present onlysome representative examples. In implementation we used Microsoft Decision Trees(MDT) which is based on CART algorithm [17] described in the paragraph 3.1.

M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data Mining139The first example is evaluation of student's mark based on the following criteria:exams, departments, and number of attempts and gender of students. MDT, which isrepresented at Figure 15, chooses number of attempts as the most significant attribute forthe first level. After that the algorithm chooses department and gender as the nextattributes by importance. At the end, there are evaluations of expected marks dependingon the values from analyzed attributes.Figure 15: MDT for evaluation of marksBased on the decision tree the following can be concluded: students that takeexam in less number of attempts get better marks and students from the first and thesecond year of studies (Common Bases Department CB) get worse marks than studentsfrom the third and the forth year.Figure 16 represents correlation between criteria for evaluation of attained mark.By moving cursor to the left-hand side of the Figures, the intensity of correlation betweenobserved criterions is displayed. Therefore it is easily perceived that the obtained markdepends on student's gender the least, a bit more on student's department, the most on thenumber of attempts.

140M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data MiningFigure 16: Analysis of correlation between criterions for evaluation of attained markFigure 17: MDT for evaluation of length of studying period based on [13]The second example represents MDT for evaluating the length of studies period.(Figure 17).

M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data Mining141Relevant criterions for decision-making are: firstly, the year of studies observed, thensecondly, observed average student's mark and then for one category, student's department.Total population, which is observed for this analysis, is 5524 students, (Figure 17).Table 1: Interval's limits of time of studying [13]IntervalIIIIIISpan (Year)till 0,856 0,86-2,7582,76-4,760Number of students 8VNoise20,04Accordingly, out of the total number of observed students, 1672 students or29.45% belong to the first interval, 1495 students or 27.06% belong to the second, 1248students or 22.59% belong to the third, 1152 students or 20.858% belong to the forth and2 students and 0.04% belong to the fifth interval. Two students that have wronglyinscribed data, belong to the fifth interval.For example, if it is necessary to make evaluation of the length of studies of asecond year student, whose average mark is over 7.70, it will be claimed accordingly thathis length of studies will be 2.758 years, with the possibility of 0.2592 or 25.92%.Figure 17 shows correlation between criterions of evaluation of length ofstudies. Moving cursor to the left-hand side of the Figures, we come to the intensity ofcorrelation of observed criteria. It is easily perceived that the length of studies dependson student's department and gender the least, while it depends more on average mark andcurrent student's status.Figure 18: Analysis of dependence between criterions according to evaluation of time ofstudying based on [18]

142M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data MiningThis example shows clustering by criterion of average mark for those studentswhose data are in data warehouse.Table 2: Average student's mark based on [18]StudenttypeThe worstPoorAverageVery goodThe bestCluster No.Avg MarkNumber of studentMin. AvgMax Avg12345Without pass 07.0957.5588.023-----7.0937.5568.00010.000From the Table 2 we can see that 1176 students belong to the first category,without passed exams, and at the same time without any possibility to have the averagemark calculated. Students with average mark 6.54 or within the interval from 6.00 to7.093 belong to the second category, while the total number is 1760. 1140 students with7.32 average mark or within the interval of average mark from 7.095 to 7.556 belong tothe third category. 1080 students with 7.77 average mark or within the interval of averagefrom 7.558 to 8.000 belong to the fourth category. The last one, the fifth categorycomprises students with the best average mark within the interval from 8.023 to 10.00,while the total number of them is 1121.The second example is clustering students by criterion length of studies.Table 3: The results of clustering by time of studying for bachelor students based on [18]length ofstudiesShortClusterNo.Count ofstudentAvg length ofstuding (Year)Min time ofstuding (Year)Max time ofstuding 316.0924.85%long3846.446.0916.7925.15%very long4838.176.809.5524.85%The result of this clustering comprises four categories. The first categoryconsists of the students with the shortest length of studies. The average is 4.67 years,while the total number of those students, from the interval of average length of studiesfrom 4.03 to 5.30 years, is 84 or 25.15%. The second category comprises students withaverage length of studies of 5.69 years, within the interval from 5.31 to 6.09 years, whilethe total number of them is 83 or 24.85%, etc. It has to be noted that the observed datarefer to bachelor students during the period of not more than 10 years, so the interval ofthe last category is from 6.80 years to, approximately 10 years.4. IMPLEMENTATIONIn implementation of DW and DM we have used MS SQL Server 2000, DTSservices SQL Server's 2000 and OLAP Services 8.0. Excel 2000, ProClarity and Webapplication. On the Figure 18 is shown schema of data warehouse and data miningsolution for student data service.

M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data Mining143Figure 18: Scheme of data warehouse and data mining solution based on [11]In conclusion data warehouse is a very flexible solution fitting end userpurposes, which by tools in everyday usage (e.g. Microsoft Excel) can explore databasemore efficiently than any other tool from OLTP environment. Significant advantage ofthis approach to knowledge and information research in databases is that the user doesnot have to possess knowledge of relational model and complex query languages.It is obvious that MDT algorithms give remarkable results for analysis with theaim of good prediction.MDT is based on possibility of various attributes and it is used when predictionof cases is necessary. MDT algorithm also generates rules. Every tie in the tree can beexpressed as a set of rules that describes it, as ties that led to it. Besides, these clusters arebased on statistic of various attributes. They provide for the creation of the model that isnot used for predictions but a very efficient one in finding records that have commonattributes so that they can be put in the same group.MDT enable the user to analyze a great number of various DM problems, withthe aim of timely prediction. Using good elements, which follow the prediction, it ispossible to make good business plans and lead business system to the benchmark. It canbe said with pleasure that this method of analysis of data and making-decision processbecomes more and more popular in solving new problems.

144M. Suknović, M. Čupić, M. Martić, D. Krulj / Data Warehousing and Data Mining5. COMPARATIVE ANALYSISThe following table contains comparative features of existing and new solutionTable 4: Comparative features of existing and new solution based on [11]Existing systemNew systemMemory SpaceAvg. query timeUser defined queriesDetection of faulty dataSophisticated analysisUsed forData securityData granularityDependency analysisComplex statistical analysisCurrent technique100 MB3 secNONONOData processing and dataanalysisNODetailed dataNONORelational databases150 MB 0.5 secYESYESYESData analysisYESAggregated dataYESYESData Warehouse and DataMiningReferring to table 4 we could conclude that the features of the new system inthe sense of data analysis are much better, the only disadvantage is that it takes up moreMemory Space for storing aggregated data and that

data warehouse, however, requires a concise, subject-oriented schema that facilitates on-line data analysis. Figure 2 shows the schemas that are used in implementation of Data warehouse system. Figure 2: Data warehouse schema based on [20]. The simplest scheme is a sin