Transcription

Data Warehousing and OLAP

Decision Support Systems Decision-support systems are used to makebusiness decisions, often based on data collectedby online transaction-processing systems Examples of business decisions:– What items to stock?– What insurance premium to change?– To whom to send advertisements? Examples of data used for making decisions– Retail sales transaction details– Customer profiles (income, age, gender, etc.)CMPT 354: Database I -- Data Warehousing and OLAP2

Data and Statistical Analysis Data analysis tasks are simplified byspecialized tools and SQL extensions– Example tasks For each product category and each region, whatwere the total sales in the last quarter and how dothey compare with the same quarter last year As above, for each product category and eachcustomer category Statistical analysis packages (e.g., SAS)can be interfaced with databasesCMPT 354: Database I -- Data Warehousing and OLAP3

Data Analysis and OLAP Online Analytical Processing (OLAP)– Interactive analysis of data, allowing data to besummarized and viewed in different ways in an onlinefashion (with negligible delay) Multidimensional data: data modeled as dimensionattributes and measure attributes– Dimension attributes: define the dimensions on whichmeasure attributes (or aggregates thereof) are viewed,e.g. the attributes item name, color, and size of thesales relation– Measure attributes: can be aggregated upon, e.g., theattribute number of the sales relationCMPT 354: Database I -- Data Warehousing and OLAP4

Pivot Table Values for one of the dimension attributes form the row headersValues for another dimension attribute form the column headersOther dimension attributes are listed on topValues in individual cells are (aggregates of) the values of thedimension attributes that specify the cellCMPT 354: Database I -- Data Warehousing and OLAP5

Relational Representation Cross-tabs can berepresented asrelations– The value all is used torepresent aggregates– All represents a set– The SQL:1999standard uses nullvalues in place of alldespite confusion withregular null valuesCMPT 354: Database I -- Data Warehousing and OLAP6

Data Cubes A data cube is a multidimensionalgeneralization of a cross-tab Can have n dimensions Cross-tabs can be used as views on a datacubeCMPT 354: Database I -- Data Warehousing and OLAP7

Online Analytical Processing Pivoting: changing the dimensions used in across-tab is called Slicing: creating a cross-tab for fixed values only– Sometimes called dicing, particularly when values formultiple dimensions are fixed Rollup: moving from finer-granularity data to acoarser granularity Drill down: The opposite operation - that ofmoving from coarser-granularity data to finergranularity dataCMPT 354: Database I -- Data Warehousing and OLAP8

Hierarchies on Dimensions Enable dimensions be viewed at different levels ofdetail– Dimension DateTime can be used to aggregate by hourof day, date, day of week, month, quarter or yearCMPT 354: Database I -- Data Warehousing and OLAP9

Cross Tabulation With HierarchyCMPT 354: Database I -- Data Warehousing and OLAP10

OLAP Implementation Multidimensional OLAP (MOLAP) systems– Multidimensional arrays in memory to store datacubes Relational OLAP (ROLAP) systems– Relational tables to store data cubes Hybrid OLAP (HOLAP) systems– Store some summaries in memory and store thebase data and other summaries in a relationaldatabaseCMPT 354: Database I -- Data Warehousing and OLAP11

Extended Aggregation inSQL:1999 The cube operation computes union of group by’son every subset of the specified attributesselect item-name, color, size, sum(number)from salesgroup by cube(item-name, color, size) Compute the union of eight different groupings ofthe sales relation: { (item-name, color, size), (itemname, color), (item-name, size), (color, size),(item-name), (color), (size), ( ) } For each grouping, the result contains the nullvalue for attributes not present in the groupingCMPT 354: Database I -- Data Warehousing and OLAP12

OLTP Versus OLAPOLTPOLAPusersclerk, IT professionalknowledge workerfunctionday to day operationsdecision supportDB ent, up-to-date, detailed, flatrelational Isolatedhistorical, summarized, multidimensionalintegrated, consolidatedusagerepetitivead-hocaccessread/write, index/hash on prim.keylots of scansunit of workshort, simple transactioncomplex query# DB size100MB-GB100GB-TBmetrictransaction throughputquery throughput, responseCMPT 354: Database I -- Data Warehousing and OLAP13

What Is a Data Warehouse? “A data warehouse is a subject-oriented,integrated, time-variant, and nonvolatilecollection of data in support ofmanagement’s decision-making process.”– W. H. Inmon Data warehousing: the process ofconstructing and using data warehousesCMPT 354: Database I -- Data Warehousing and OLAP14

Subject-Oriented Organized around major subjects, such ascustomer, product, sales Focusing on the modeling and analysis ofdata for decision makers, not on dailyoperations or transaction processing Providing a simple and concise view aroundparticular subject issues by excluding datathat are not useful in the decision supportprocessCMPT 354: Database I -- Data Warehousing and OLAP15

Integrated Integrating multiple, heterogeneous data sources– Relational databases, flat files, on-line transactionrecords Data cleaning and data integration– Ensure consistency in naming conventions, encodingstructures, attribute measures, etc. among different datasources E.g., Hotel price: currency, tax, breakfast covered, etc.– When data is moved to the warehouse, it is convertedCMPT 354: Database I -- Data Warehousing and OLAP16

Time Variant The time horizon for the data warehouse issignificantly longer than that of operationalsystems– Operational database: current value data– Data warehouse data: provide information from ahistorical perspective (e.g., past 5-10 years) Every key structure in the data warehouse– Contains an element of time, explicitly or implicitly– But the key of operational data may or may not contain“time element”CMPT 354: Database I -- Data Warehousing and OLAP17

Nonvolatile A physically separate store of datatransformed from the operationalenvironment Operational update of data does not occur inthe data warehouse environment– Does not require transaction processing,recovery, and concurrency control mechanisms– Requires only two operations in data accessing: initial loading of data and access of dataCMPT 354: Database I -- Data Warehousing and OLAP18

Data WarehousingCMPT 354: Database I -- Data Warehousing and OLAP19

Collecting Data Source driven architecture: data sources transmitnew information to a warehouse, eithercontinuously or periodically (e.g. at night) Destination driven architecture: a warehouseperiodically requests new information from datasources Keeping warehouse exactly synchronized withdata sources (e.g. using two-phase commit) is tooexpensive– Usually OK to have slightly out-of-date data atwarehouse– Data/updates are periodically downloaded form onlinetransaction processing (OLTP) systemsCMPT 354: Database I -- Data Warehousing and OLAP20

Design Issues Data cleansing– Correct mistakes in addresses (misspellings, zip codeerrors), and merge address lists from different sourcesand purge duplicates Update propagating– Warehouse schema may be a (materialized) view ofschema from data sources Summarizing data– Raw data may be too large to store on-line– Aggregate values (totals/subtotals) often suffice– Queries on raw data can often be transformed by queryoptimizer to use aggregate valuesCMPT 354: Database I -- Data Warehousing and OLAP21

Warehouse Schemas Dimension values are usually encodedusing small integers and mapped to fullvalues via dimension tables Resultant schema is called a star schema– More complicated schema structures Snowflake schema: multiple levels of dimensiontables Constellation: multiple fact tablesCMPT 354: Database I -- Data Warehousing and OLAP22

Data Warehouse SchemaCMPT 354: Database I -- Data Warehousing and OLAP23

Picture from publib.boulder.ibm.comSnowflake SchemaA star schema is asnowflake schemawhere eachdimension has onlyone singledimension tableCMPT 354: Database I -- Data Warehousing and OLAP24

Why Data Mining? Evolution of database technology– To collect a large amount of data Æ primitivefile processing– To store and query data efficiently Æ DBMS New challenges: huge amount of data, howto analyze and understand?– Data miningCMPT 354: Database I -- Data Warehousing and OLAP25

What Is Data Mining? Mining data – mining knowledge Data mining is the non-trivial process ofidentifying valid, novel, potentially useful,and ultimately understandable patterns indataCMPT 354: Database I -- Data Warehousing and OLAP26

The KDD erpretation/Patterns evaluationData miningTransformationPreprocessingSelectionTarget dataDataCMPT 354: Database I -- Data Warehousing and OLAP27

KDD Process Steps Preprocessing– Data cleaning– Data integration Data selectionData transformationData miningPattern evaluationKnowledge presentationCMPT 354: Database I -- Data Warehousing and OLAP28

Summary OLAP Data warehousing– Star schema– Snowflake schema Data mining and KDD processCMPT 354: Database I -- Data Warehousing and OLAP29

Statistical analysis packages (e.g., SAS) can be interfaced with databases. CMPT 354: Database I -- Data Warehousing and OLAP 4 Data Analysis and OLAP Online Analytical Processing (OLAP) . CMPT 354: Database I -- Data Warehousing and OLAP 7 Data Cubes A data cube is a multid