Transcription

Data Mining and theCase for SamplingSolving Business ProblemsUsing SAS Enterprise Miner SoftwareA SAS InstituteB es t P r a c t i c e s P a p e r

iTable of ContentsABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1THE OVERABUNDANCE OF DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2DATA MINING AND THE BUSINESS INTELLIGENCE CYCLE . . . . . . . . . . . . . . . . . . . . . . . .2THE SEMMA METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3HOW LARGE IS “A LARGE DATABASE?” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4PROCESSING THE ENTIRE DATABASE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4PROCESSING A SAMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6THE STATISTICAL VALIDITY OF SAMPLING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8SIZE AND QUALITY DETERMINE THE VALIDITY OF A SAMPLE . . . . . . . . . . . . . . . . .9RANDOMNESS: THE KEY TO QUALITY SAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10CURRENT USES OF SAMPLING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12WHEN SAMPLING SHOULD NOT BE USED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13MYTHS ABOUT SAMPLING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13SAMPLING AS A BEST PRACTICE IN DATA MINING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16PREPARING THE DATA FOR SAMPLING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17COMMON TYPES OF SAMPLING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17DETERMINING THE SAMPLE SIZE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18GENERAL SAMPLING STRATEGIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20USING SAMPLE DATA FOR TRAINING, VALIDATION, AND TESTING . . . . . . . . . . . .21SAMPLING AND SMALL DATA TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21CASE STUDY: USING SAMPLING IN CHURN ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . .21STEP 1: ACCESS THE DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23STEP 2: SAMPLE THE DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24STEP 3: PARTITION THE DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28STEP 4: DEVELOP A MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28STEP 5: ASSESS THE RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29SUMMARY OF CASE STUDY RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32SAS INSTITUTE: A LEADER IN DATA MINING SOLUTIONS . . . . . . . . . . . . . . . . . . . . . . . .33REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34RECOMMENDED READING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35DATA MINING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35DATA WAREHOUSING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35STATISTICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35CREDITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36

iiFiguresFigureFigureFigureFigureFigure1 : The Data Mining Process and the Business Intelligence Cycle . . . . . . . . . . . . . . . . . . . . . . .22 : Steps in the SEMMA Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33 : How Sampling Size Affects Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94 : How Samples Reveal the Distribution of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .115 : Example Surface Plots for Fitted Models; Regression, Decision Tree, andNeural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19Figure 6 : Churn Analysis — Steps, Actions, and Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22Figure 7 : Process Flow Diagram for the Customer Churn Project . . . . . . . . . . . . . . . . . . . . . . . . . . . .22Figure 8 : Input Data - Interval Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23Figure 9 : Input Data - Class Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23Figure 10 : Percentage of Churn and No Churn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24Figure 11 : General Dialog Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25Figure 12 : Stratification Variables Dialog Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26Figure 13 : Stratification Criteria Dialog Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27Figure 14 : Sampling Results Browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27Figure 15 : Data Partition Dialog Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28Figure 16 : Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29Figure 17 : Diagnostic Chart for Validation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30Figure 18 : Diagnostic Chart for Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30Figure 19 : Incremental Sample Size and the Correct Classification Rates . . . . . . . . . . . . . . . . . . . . . .31Figure 20 : Comparison of Sample Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32

1AbstractIndustry analysts expect the use of data mining to sustain double-digit growth into the 21stcentury. One recent study, for example, predicts the worldwide statistical and data miningsoftware market to grow at a compound annual growth rate of 16.1 percent over the nextfive years, reaching 1.13 billion in the year 2002 (International Data Corporation 1998#15932).Many large- to mid-sized organizations in the mainstream of business, industry, and the publicsector already rely heavily on the use of data mining as a way to search for relationships thatwould otherwise be “hidden” in their transaction data. However, even with powerful datamining techniques, it is possible for relationships in data to remain hidden due to the presenceof one or more of the following conditions: data are not properly aggregated data are not prepared for analysis relationships in the data are too complex to be seen readily via human observation databases are too large to be processed economically as a whole.All of these conditions are complex problems that present their own unique challenges.For example, organizing data by subject into data warehouses or data marts can solveproblems associated with aggregation.1 Data that contain errors, missing values, or otherproblems can be cleaned in preparation for analysis.2 Relationships that are counter-intuitiveor highly complex can be revealed by applying predictive modeling techniques such as neuralnetworks, regression analysis, and decision trees as well as exploratory techniques likeclustering, associations and sequencing. However, processing large databases en masse isanother story — one that carries along with it its own unique set of problems.This paper discusses the use of sampling as a statistically valid practice for processing largedatabases by exploring the following topics: data mining as a part of the “Business Intelligence Cycle” sampling as a valid and frequently-used practice for statistical analyses sampling as a best practice in data mining a data mining case study that relies on sampling.For those who want to study further the topics of data mining and the use of samplingto process large amounts of data, this paper also provides references and a list ofrecommended reading material.1Accessing, aggregating, and transforming data are primary functions of data warehousing. For more information on data warehousing,see the “Recommended Reading” section in this paper.2Unscrubbed data and similar terms refer to data that are not prepared for analysis. Unscrubbed data should be cleaned (scrubbed,transformed) to correct errors such as missing values, inconsistent variable names, and inconsequential outliers before being analyzed.

2The Overabundance of DataIn the past, many businesses and other organizations were unable or unwilling to store theirhistorical data. Online transaction processing (OLTP) systems, rather than decision supportsystems, were key to business. A primary reason for not storing historical data was the factthat disk space was comparatively more expensive than it is now. Even if the storage spacewas available, IT resources often could not be spared to implement and maintain enterprisewide endeavors like decision support systems.Times have changed. As disk storage has become increasingly affordable, businesses haverealized that their data can, in fact, be used as a corporate asset for competitive advantage. Forexample, customers’ previous buying patterns often are good predictors of their future buyingpatterns. As a result, many businesses now search their data to reveal those historical patterns.To benefit from the assets bound up in their data, organizations have invested numerousresources to develop data warehouses and data marts. The result has been substantialreturns on these kinds of investments. However, now that affordable systems exist forstoring and organizing large amounts of data, businesses face new challenges. For example,how can hardware and software systems sift through vast warehouses of data efficiently?What process leads from data, to information, to competitive advantage?While data storage has become cheaper, CPU, throughput, memory management, and network bandwidth continue to be constraints when it comes to processing large quantities ofdata. Many IT managers and business analysts are so overwhelmed with the sheer volume,they do not know where to start. Given these massive amounts of data, many ask, “How canwe even begin to move from data to information?” The answer is in a data mining processthat relies on sampling, visual representations for data exploration, statistical analysis andmodeling, and assessment of the results.Data Mining and the Business Intelligence CycleDuring 1995, SAS Institute Inc. began research, development, and testing of a data miningsolution based on our world-renowned statistical analysis and reporting software — theSAS System. That work, which resulted in the 1998 release of SAS Enterprise Miner software, taught us some important lessons.3 One lesson we learned is that data mining is aprocess that must itself be integrated within the larger process of business intelligence.Figure 1 illustrates the role data mining plays in the business intelligence emData WarehouseDBMSTransformData IntoInformationEIS, BusinessData MiningReportingProcessGraphics Act on InformationIntegrating data mining activities with the organization’s data warehouse and business reporting systemsenables the technology to fit within the existingIT infrastructure while supporting the organization’slarger goals of identifying business problems, transforming data into information, acting on the information, and assessing the results.Figure 1 : The Data Mining Process and the Business Intelligence Cycle3According to the META Group, “The SAS Data Mining approach provides an end-to-end solution, in both the sense of integratingdata mining into the SAS Data Warehouse, and in supporting the data mining process. Here, SAS is the leader” (META Group 1997, file #594).

3The SEMMA MethodologySAS Institute defines data mining as the process used to reveal valuable information andcomplex relationships that exist in large amounts of data. Data mining is an iterative process —answers to one set of questions often lead to more interesting and more specific questions.To provide a methodology in which the process can operate, SAS Institute further dividesdata mining into five stages that are represented by the acronym SEMMA.Beginning with a statistically representative sample of data, the SEMMA methodology —which stands for Sample, Explore, Modify, Model, and Assess — makes it easy for businessanalysts to apply exploratory statistical and visualization techniques, select and transformthe most significant predictive variables, model the variables to predict outcomes, and confirma model’s accuracy. Here is an overview of each step in the SEMMA methodology: Sample the data by creating one or more data tables.4 The samples should be bigenough to contain the significant information, yet small enough to process quickly. Explore the data by searching for anticipated relationships, unanticipated trends, andanomalies in order to gain understanding and ideas. Modify the data by creating, selecting, and transforming the variables to focus the modelselection process. Model the data by allowing the software to search automatically for a combination ofdata that reliably predicts a desired outcome. Assess the data by evaluating the usefulness and reliability of the findings from the datamining process.SEMMA is itself a cycle; the internal steps can be performed iteratively as needed. Figure 2illustrates the tasks of a data mining project and maps those tasks to the five stages of theSEMMA methodology. Projects that follow SEMMA can sift through millions of records5and reveal patterns that enable businesses to meet data mining objectives such as: Segmenting customers accurately intogroups with similar buying patterns Profiling customers for individualrelationship management Dramatically increasing response rate fromdirect mail campaigns Identifying the most profitable customersand the underlying reasons Understanding why customers leave forcompetitors (attrition, churn analysis) Uncovering factors affecting purchasingpatterns, payments and response rates Increasing profits by marketing to thosemost likely to purchase4The terms data table and data set are synonymous.Figure 2 : Steps in the SEMMA Methodology5Record refers to an entire row of data in a data table. Synonyms for theterm record include observation, case, and event. Row refers to the way dataare arranged horizontally in a data table structure.

4 Decreasing costs by filtering out those least likely to purchase Detecting patterns to uncover non-compliance.How Large is “A Large Database”To find patterns in data such as identifying the most profitable customers and the underlyingreasons for their profitability, a solution must be able to process large amounts of data.However, defining “large” is like trying to hit a moving target; the definition of “a largedatabase” is changing as fast as the enabling technology itself is changing. For example, theprefix tera, which comes from the Greek teras meaning “monster,” is used routinely toidentify databases that contain approximately one trillion bytes of data. Many statisticianswould consider a database of 100,000 records to be very large, but data warehouses filledwith a terabyte or more of data such as credit card transactions with associated demographics are not uncommon. Performing routine statistical analyses on even a terabyte of data canbe extremely expensive and time consuming. Perhaps the need to work with even moremassive amounts of data such as a petabyte (250) is not that far off in the future (Potts 1997, p. 10).So what can be done when the volume of data grows to such massive proportions? Theanswer is deceptively simple; either try to process the entire database, or process only a sample of it.Processing the Entire DatabaseTo move from massive amounts of data to business intelligence, some practitioners arguethat automated algorithms with faster and faster processing times justify processing theentire database. However, no single approach solves all data mining problems. Instead,processing the entire database offers both advantages and disadvantages depending on thedata mining project.AdvantagesAlthough data mining often presupposes the need to process very large databases, somedata mining projects can be performed successfully when the databases are small. For example,all of the data could be processed when there are more variables6 than there are records. Insuch a situation, there are statistical techniques that can help ensure valid results in whichcase, an advantage of processing the entire database is that enough richness can be maintainedin the limited, existing data to ensure a more precise fit.In other cases, the underlying process that generates the data may be rapidly changing, andrecords are comparable only over a relatively short time period. As the records age, theymight lose value. Older data can become essentially worthless. For example, the value ofsales transaction data associated with clothing fads is often short lived. Data generated bysuch rapidly changing processes must be analyzed often to produce even short-term forecasts.Processing the entire database also can be advantageous in sophisticated exception reportingsystems that find anomalies in the database or highlight values above or below some threshold level that meet the selected criteria.6Variable refers to a characteristic that defines records in a data table such as a variable B DATE, which would contain customers’birth dates. Column refers to the way data are arranged vertically within a data table structure.

5If the solution to the business problem is tied to one record or a few records, then to findthat subset, it may be optimal to process the complete database. For example, suppose a chainof retail paint stores discovers that too many customers are returning paint. Paint pigmentsused to mix the paints are obtained from several outside suppliers. Where is the problem?With retailers? With customers? With suppliers? What actions should be taken to correct theproblem? The company maintains many databases consisting of various kinds of informationabout customers, retailers, suppliers, and products. Routine anomaly detection (processingthat is designed to detect whether a summary performance measure is beyond an acceptablerange) might find that a specific store has a high percentage of returned paint. A subsequentinvestigation discovers that employees at that store mix pigments improperly. Clearerinstructions could eliminate the problem. In a case like this one, the results are definitiveand tied to a single record. If the data had been sampled for analysis, then that single,important record might not have been included in the sample.DisadvantagesProcessing the entire database affects various aspects of the data mining process includingthe following:Inference/GeneralizationThe goal of inference and predictive modeling is to apply successfully findings from adata mining system to new records. Data mining systems that exhaustively search thedatabases often leave no data from which to develop inferences. Processing all of thedata also leaves no holdout data with which to test the model for explanatory power onnew events. In addition, using all of the data leaves no way to validate findings on dataunseen by the model. In other words, there is no room left to accomplish the goal ofinference.Instead, holdout samples must be available to ensure confidence in data mining results.According to Elder and Pregibon, the true goal of most empirical modeling activities is“to employ simplifying constraints alongside accuracy measures during model formulation in order to best generalize to new cases” (1996, p. 95).Occasionally, concerns arise about the way sampling might affect inference. This concern is often expressed as a belief that a sample might miss some subtle but importantniches — those “hidden nuggets” in the database. However, if a niche is so tiny that it isnot represented in a sample and yet so important as to influence the big picture, theniche can be discovered: either by automated anomaly detection or by using appropriate sampling methods. If there are pockets of important information hidden in thedatabase, application of the appropriate sampling technique will reveal them and willprocess much faster than processing the whole database.Quality of the FindingsUsing exhaustive methods when developing predictive models may actually create morework by revealing spurious relationships. For example, exhaustive searches may “discover” such intuitive relationships as the fact that people with large account balancestend to have higher incomes, that residential electric power usage is low between 2 a.m.and 4 a.m., or that travel-related expenditures increase during the holidays. Spendingtime and money to arrive at such obvious conclusions can be avoided by working withmore relevant subsets of data. More findings do not necessarily mean quality findingsand sifting through the findings to determine which are valid takes time and effort.

6In addition to unreliable forecasts, exhaustive searches tend to produce several independent findings, each of which needs a corresponding set of records on which to baseinferences. Faster search algorithms on more data can produce more findings but withless confidence in any of them.Speed and EfficiencyPerhaps the most persistent problems concerning the processing of large databases arespeed and cost. Analytical routines required for exploration and modeling run faster onsamples than on the entire database. Even the fastest hardware and software combinations have difficulty performing complex analyses such as fitting a stepwise logisticregression with millions of records and hundreds of input variables. For most businessproblems, there comes a point when the time and money spent on processing the entiredatabase produces diminishing returns wherein any potential modest gains are simplynot worth the cost. In fact, even if a business were to ignore the advantages of statisticalsampling and instead choose to process data in its entirety (assuming multiple terabytesof data), no time-efficient and cost-effective hardware/software solution that excludesstatistical sampling yet exists.Within a database, there can be huge variations across individual records. A few datavalues far from the main cluster can overly influence the analysis, and result in largerforecast errors and higher misclassification rates. These data values may have been miscoded values, they may be old data, or they may be outlying records. Had the samplebeen taken from the main cluster, these outlying records would not have been overlyinfluential.Alternatively, little variation might exist in the data for many of the variables; the recordsare very similar in many ways. Performing computationally intensive processing on theentire database might provide no additional information beyond what can be obtainedfrom processing a small, well-chosen sample. Moreover, when the entire database isprocessed, the benefits that might have been obtained from a pilot study are lost.For some business problems, the analysis involves the destruction of an item. For example, to test the quality of a new automobile, it is torn apart or run until parts fail. Manyproducts are tested in this way. Typically, only a sample of a batch is analyzed using thisapproach. If the analysis involves the destruction of an item, then processing the entiredatabase is rarely viable.Processing a SampleCorporations that have achieved significant return on investment (ROI) in data mining havedone so by performing predictive data mining. Predictive data mining requires the developmentof accurate predictive models that typically rely on sampling in one or more forms. ROI is thefinal justification for data mining, and most often, the return begins with a relatively small sample.7AdvantagesExploring a representative sample is easier, more efficient, and can be as accurate as exploring the entire database. After the initial sample is explored, some preliminary models canbe fitted and assessed. If the preliminary models perform well, then perhaps the data mining project can continue to the next phase. However, it is likely that the initial modeling generates additional, more specific questions, and more data exploration is required.7Sampling also is effective when using exploratory or descriptive data mining techniques; however, the goals and benefits (and hencethe ROI) of using these techniques are less well defined.

7In most cases, a database is logically a subset or a sample of some larger population.8 Forexample, a database that contains sales records must be delimited in some way such as bythe month in which the items were sold. Thus, next month’s records will represent a different sample from this month’s. The same logic would apply to longer time frames such asyears, decades, and so on. Additionally, databases can at best hold only a fraction of theinformation required to fully describe customers, suppliers, and distributors. In the extreme,the largest possible database would be all transactions of all types over the longest possibletime frame that fully describes the enterprise.Speed and EfficiencyA major benefit of sampling is the speed and efficiency of working with a smaller datatable that still contains the essence of the entire database. Ideally, one uses enough datato reveal the important findings, and no more. Sufficient quantity depends mostly onthe modeling technique, and that in turn depends on the problem. “A sample surveycosts less than a complete enumeration, is usually less time consuming, and may evenbe more accurate than a complete enumeration,” as Saerndal, Swensson, and Wretman(1992, p. 3) point out. Sampling enables analysts to spend relatively more time fittingmodels and thereby less time waiting for modeling results.The speed and efficiency of a process can be measured in various ways. Throughput isa common measure; however, when business intelligence is the goal, business-orientedmeasurements are more useful. In the context of business intelligence, it makes moresense to ask big picture questions about the speed and efficiency of the entire businessintelligence cycle than it does to dwell on smaller measurements that merely contributeto the whole such as query/response times or CPU cycles.Perhaps the most encompassing business-oriented measurement is one that seeks todetermine the cost in time and money to go from the recording of transactions to a planof action. That path — from OLTP to taking action — includes formulating the businessquestions, getting the data in a form to be mined, analyzing it, evaluating and disseminating the results, and finally, taking action.VisualizationData visualization and exploration facilitate understanding of the data.9 To betterunderstand a variable, univariate plots of the distribution of values are useful. To examinerelationships among variables, bar charts and scatter plots (2-dimensional and 3-dimensional)are helpful. To understand the relationships among large numbers of variables, correlationtables are useful. However, huge quantities of data require more resources and moretime to plot and manipulate (as in rotating data cubes). Even with the many recentdevelopments in data visualization, one cannot effectively view huge quantities of datain a meaningful way. A representative sample gives visual order to the data and allowsthe analyst to gain insights that speed the modeling process.GeneralizationSamples obtained by appropriate sampling methods are representative of the entiredatabase and therefore little (if any) information is lost. Sampling is statistically dependable. It is a mathematical science based on the demonstrable laws of probability, uponwhich a large part of statistics is built.108Population refers to the entire collection of data from which samples are taken such as the entire database or data warehouse.9Using data visualization techniques for exploration is Step 2 in the SEMMA process for data mining. For more information, see thesources listed for data mining in the “Recommended Reading” section of this paper.10For more information on the statistical bases of sampling, see the section “The Statistical Validity of Sampling” in this paper.

8EconomyData cleansing (detecting, investigating, and correcting errors, outliers, missing values,and so on) can be very time-consuming. To cleanse the entire database might be a verydifficult and frustrating task. To the extent that a well-designed data warehou

3According to the META Group, “The SAS Data Mining approach provides an end-to-end solution, in both the sense of integrating data mining into the SAS Data Warehouse, and in supporting the data mining process. Here, SAS is the leader” (META Group 1997, file #594). Business Questions