Generation of Synthetic Data toConform to Constraints Derivedfrom Data Mining ApplicationsJosh EnoMS Computer ScienceCommittee:Dr. Craig ThompsonDr. Brahendra PandaDr. David Douglas

Agenda IntroductionBackgroundPMML Conversion SoftwarePMML Model HandlersAnalysisConclusion

Introduction: Problem Synthetic Data is useful in several areas Two Basic Approaches– Multivariate Distribution– System Model There is a need for a hybrid model,combining the power and flexibility of both

Introduction: ObjectiveThe objective of this thesis is todemonstrate that the decision tree datamining technique can discover patterns thatcan be reverse mapped back into syntheticdata sets of any size that will faithfullyexhibit the same patterns.

Background: PMML Need some way to store data mining models Used Predictive Model Markup Language(PMML)– Industry standard supported by Oracle, SPSS,IBM, Microsoft, etc. through the Data MiningGroup– XML format with vendor extensions

Background: PMML Consists of Data Dictionary and 1 or moreMining Models Data Dictionary– Contains field name and type information Mining Model– 11 types of model, including trees, regression,neural net– XML Schema varies by model type

Background: SDDL Need some way to specify data Synthetic Data Definition Language– XML language to describe data sets– Developed by Joe Hoag and Craig Thompson– Contains a database element with multiple pooland table elements

Background: SDDL Pool Elements– A type of data dictionary with weighted poolchoices– Each choice can have multiple auxiliaryattributes as well as nested sub-pools Table Elements– Defines a set of fields to generate– Variable and field elements

Related Work Multivariate Distribution Simulation– Simulates data as a statistical distribution– Defined by sufficient statistics (mean vector,covariance matrix, etc.)– Limitations on the kinds of data that can besimulated

Related Work Synthetic Data Generators– Commercial and experimental software with acommon goal of simulating large data sets withdatabase-like relationships– Approaches vary in data definition mechanismand generation framework– Each approach lacks either flexibility of SDDLor efficiency of the Parallel Synthetic DataGenerator

PMML Conversion Software Software demonstrates the ability to createsimulated data based on decision tree model Parses PMML 3.0 with a decision treemining model and creates an SDDL file Extensible architecture for future miningmodels

PMML Software Architecture Three TierArchitecture– Top layer provides aninterface, currently CLI– Middle layer handlesPMML common to allfiles– Bottom layer handlesspecific models, withvarying XML schemaInterface/Driver LayerGeneric PMMLParser LayerTree ModelHandlerRule SetModelHandlerNeural NetworkModel HandlerModel Handler Layer

PMML Software: Driver Layer Command Line Interface– Specify input, output, and properties files– Properties file may contain input/output files aswell as options for specific models– Decision tree model allows the specification ofglobal minimum and maximum field values– Other options include random numbergenerator seed value, database name, andnumber of rows to create

PMML Software: Parser Layer Implements a SAX XML Parser Receives tags sequentially, rather than as afull XML tree Handles tags that are not model specific,which primarily concern the data dictionary Also launches model handlers when amining model tag is encountered

PMML Software: Parser Layer Data Dictionary Parsing– Creates a field list for use by individual miningmodels– Classifies fields as integer, real, or string– For categorical string values, the datadictionary also includes information on validvalues

PMML Software: Model Handlers Model Handler Interface– Each model handler must implement a modelhandler interface– Allows parser layer to pass tags to modelhandlers– Model handlers return a Boolean value toindicate whether the tag was handled

PMML Software: Tree Handler Creates SDDL based on a decision treeclassification model Nodes in the tree have predicates whichmust be satisfied for a record to fall withinthe node These predicates are used to constrain thegenerated data

PMML Tree Model Structure Mining Schema– Classifies fields as active, predicted, orsupplementary Active fields determine the path of a record throughthe tree Predicted field determines the category of a row thatreaches a leaf node Supplementary fields are for internal use by the datamining software and are ignored for SDDL purposes

PMML Tree Model Structure Node define the actual tree– Node score determines the value of predictedfield– Predicate is used to determine record paththrough the tree– Score distribution tells how many records ofeach category pass through the node

PMML Tree Model Structure Node score "Iris-virginica" recordCount "6" id "5" CompoundPredicate booleanOperator "surrogate" SimplePredicate field "petal length"operator "greaterThan" value "4.95"/ /CompoundPredicate ScoreDistribution value "Iris-setosa" recordCount "0" /ScoreDistribution Node score "Iris-virginica" recordCount "3" id "6" /Node Node score "Iris-versicolor" recordCount "3" id "7" /Node /Node

PMML Tree Model Structure Surrogate Compound Predicates– SPSS Clementine creates compound predicateswith surrogate simple predicates– Only the first predicate with available dataapplies– Leaves a question about how to handle lowerlevel predicates that are not applied

Tree Scanning Algorithm Performs a depth first search of the tree Constraints are propagated down the tree toleaf nodes Leaf node scores, record counts, scoredistributions, and data constraints are storedin a list for table build phase

Tree Scanning Algorithm In the event of a predicateconflict, the lower levelnode takes precedence In this case, node 5 wouldhave a PL minimum of4.95 and the maximum of4.75 from node 3 wouldbe loosened to the globalmaximumID: 3PW: 1.75PL: 4.75SL: 6.15SW: 2.95ID: 4Predicate:PL: 4.95SL: 7.1ID: 5Predicate:PL: 4.95SL: 7.1

SDDL Build Algorithm Each leaf node is used to build a choice in a poolin the database The weight is the record count of the node Field minimums and maximums are stored asauxiliary values Categorical values are stored as a sub-pool– Predicted field weights are set based on scoredistribution– Active field weights are equal unless excluded from theset by a set predicate

SDDL Pool Example choice name "3" chol max 405 /chol max chol min 126 /chol min pool name "sex" choice name "female" weight 1 /weight /choice choice name "male" weight 0 /weight /choice /pool pool name "R num" choice name "lt 50" weight 103 /weight /choice choice name "gt 50 1" weight 10 /weight /choice /pool weight 113.0 /weight /choice

SDDL Build Algorithm Table Build Phase– Contains one variable field to select node poolchoice– Numerical fields have minimum and maximumconstrained by choice auxiliary fields NodePool[nodeId].chol min– Categorical fields choose a value from a subpool NodePool[nodeId].R num

Analysis The PMML file was used to create an SDDLfile. A large data set was generated based on theSDDL File. The generated data was loaded into arelational database for analysis. The data was analyzed through a series ofSQL queries.

Iris Data Analysis Simple data set first used by R. Fisher in 1934 [1],commonly used to evaluate machine learningalgorithms 150 records, 50 each from 3 species of Iris,measuring length and width of sepal and petal One species, Setosa, is linearly separable from theothers, but Virginica and Versicolor are notlinearly separable

ID: 0Virginica: 50Versicolor: 50Setosa: 50ID: 1Score: SRC: 50Virginica: 0Versicolor: 0Setosa: 50Predicate:PL: 2.45PW: .8SL: 5.45SW: 3.35ID: 2Score: VeRC: 100Virginica: 50Versicolor: 50Setosa: 0Predicate:PL: 2.45PW: .8SL: 5.45SW: 3.35ID: 3Score: VeRC: 54Virginica: 5Versicolor: 49Setosa: 0Predicate:PW: 1.75PL: 4.75SL: 6.15SW: 2.95ID: 4Score: VeRC: 48Virginica: 1Versicolor: 47Setosa: 0Predicate:PL: 4.95SL: 7.1TrueKey:RC: Record CountPL: Petal LengthPW: Petal WidthSL: Sepal LengthSW: Sepal WidthID: 8Score: ViRC: 46Virginica: 45Versicolor: 1Setosa: 0Predicate:PW: 1.75PL: 4.75SL: 6.15SW: 2.95ID: 5Score: ViRC: 6Virginica: 4Versicolor: 2Setosa: 0Predicate:PL: 4.95SL: 7.1FalseID: 6Score: ViRC: 3Virginica: 3Versicolor: 0Setosa: 0Predicate:PW: 1.55SW: 2.65SL: 6.5PL: 5.7FalseID: 7Score: VeRC: 3Virginica: 1Versicolor: 2Setosa: 0Predicate:PW: 1.55SW: 2.65SL: 6.5PL: 5.7False

Iris Data AnalysisNode Record CountsTraining DataNode ID TotalSetosa Virginica 5604263030730128460451Generated DataTotalSetosa Virginica Versicolor150000 50015 50091 4989450015 5001500999850 50091 498945390505030 48875478900993 4689760150403719783047030470296809901978460800 450611019

Iris Data AnalysisNode ProbabilitiesTraining DataNode ID TotalSetosa Virginica Versicolor0 1.000 0.333 0.333 0.3331 0.333 0.333 0.000 0.0002 0.667 0.000 0.333 0.3333 0.360 0.000 0.033 0.3274 0.320 0.000 0.007 0.3135 0.040 0.000 0.027 0.0136 0.020 0.000 0.020 0.0007 0.020 0.000 0.007 0.0138 0.307 0.000 0.300 0.007Generated DataTotalSetosa Virginica Versicolor1.000 0.333 0.334 0.3330.333 0.333 0.000 0.0000.667 0.000 0.334 0.3330.359 0.000 0.034 0.3260.319 0.000 0.007 0.3130.040 0.000 0.027 0.0130.020 0.000 0.203 0.0000.020 0.000 0.007 0.0130.307 0.000 0.300 0.007

Heart Data Analysis Heart Disease indicators from the ClevelandClinic [2] 13 active fields, with a mix of categoricaland numeric values 303 records, classified by greater than orless than 50% narrowed arteries

ID: 0RC: 303 50: 165 50 1: 138ID: 1Score: 50RC: 167 50: 130 50 1: 37Predicate:Thal isIn {Normal}Thalach 150.5Cp isIn {atyp angina, non anginal, typ angina}Exang isIn {no}Oldpeak 1.55Sex isIn {female}trueID: 5Score: 50RC: 49 50: 25 50 1: 24Predicate:Ca .5Age 66.5Thalach 134Chol 405.5Oldpeak 3.55Cp isIn {typ angina}falseID: 6Score: 50 1RC: 20 50: 3 50 1: 17Predicate:Cp isIn {asympt}Exang isIn {yes}Thalach 125.5Slope isIn {down, flat}Oldpeak 0.85Trestbps 115falseID: 9Score: 50RC: 5 50: 4 50 1: 1Predicate:Age 55.5Chol 202.5Slope isIn {up}Cp isIn {non anginal}Thalach 128.5falseID: 2Score: 50RC: 118 50: 105 50 1: 13Predicate:Ca .5Age 66.5Thalach 134Chol 405.5Oldpeak 3.55Cp isIn {atyp angina, non anginal, asympt}trueID: 7Score: 50RC: 29 50: 22 50 1: 7Predicate:Cp isIn {atyp angina,non anginal, typ angina}Exang isIn {no}Thalach 125.5Slope isIn {up}Oldpeak 0.85Trestbps 115trueID: 8Score: 50RC: 11 50: 6 50 1: 5Predicate:Chol 237.5Cp isIn {atyp angina,typ angina}Trestbps 153Oldpeak 2.25Sex isIn {male}Age 41.5falseID: 12Score: 50 1RC: 136 50: 35 50 1: 101Predicate:Thal isIn {fixed defect, refersable defect}Thalach 150.5Cp isIn {asympt}Exang isIn {yes}Oldpeak 1.55Sex isIn {male}falseID: 3Score: 50RC: 113 50: 103 50 1: 10Predicate:Trestbps 158trueID: 11Score: 50RC: 18 50: 16 50 1: 2Predicate:Chol 237.5Cp isIn {non anginal}Trestbps 153Oldpeak 2.25Sex isIn {female}Age 41.5trueID: 10Score: 50 1RC: 6 50: 2 50 1: 4Predicate:Age 55.5Chol 202.5Slope isIn {flat}Cp isIn {typ anginal,atyp anginal}Thalach 128.5trueID: 4Score: 50 1RC: 5 50: 2 50 1: 3Predicate:Trestbps 158trueID: 13Score: 50 1RC: 90 50: 10 50 1: 80Predicate:Cp isIn {asympt}Thalach 172Exang isIn {yes}Trestbps 106.5Age 66.5Chol 128.5trueID: 14Score: 50 1RC: 22 50: 8 50 1: 14Predicate:Oldpeak .55Thalach 146.5Slope isIn {up}Trestbps 109falseID: 15Score: 50RC: 12 50: 7 50 1: 5Predicate:Chol 237.5Restecg isIn {normal}Exang isIn {no}Slope isIn {up}Ca 1.5Trestbps 122trueID: 16Score: 50RC: 4 50: 4 50 1: 0Predicate:Thalach 152Thal isIn {fixed defect}Oldpeak .05Age 63Fbs isIn {t}Ca 2falseID: 19Score: 50 1RC: 68 50: 2 50 1: 66Predicate:Oldpeak .55Thalach 146.5Slope isIn {down, flat}Trestbps 109trueID: 18Score: 50 1RC: 10 50: 1 50 1: 9Predicate:Chol 237.5Restecg isIn{left vent hyper}Exang isIn {yes}Slope isIn {flat}Ca 1.5Trestbps 122falseID: 17Score: 50 1RC: 8 50: 3 50 1: 5Predicate:Thalach 152Thal isIn {reversable defect}Oldpeak .05Age 63Fbs isIn {f}Ca 2trueID: 20Score: 50RC: 46 50: 25 50 1: 21Predicate:Cp isIn {atyp angina, non anginal, typ angina}Thalach 172Exang isIn {no}Trestbps 106.5Age 66.5Chol 128.5falseID: 22Score: 50 1RC: 17 50: 4 50 1: 13Predicate:Ca .5Thalach 113.5Oldpeak 1.95Cp isIn {non anginal}Age 67.5Trestbps 97.5falseID: 24Score: 50RC: 5 50: 3 50 1: 2Predicate:Slope isIn {up}Thalach 150.5Oldpeak .3Chol 157falseID: 21Score: 50RC: 29 50: 21 50 1: 8Predicate:Ca .5Thalach 113.5Oldpeak 1.95Cp isIn {atyp angina,typ angina}Age 67.5Trestbps 97.5trueID: 23Score: 50 1RC: 12 50: 1 50 1: 11Predicate:Slope isIn {flat}Thalach 150.5Oldpeak .3Chol 157true

Heart Data AnalysisHeart Node Record CountsTraining DataGenerated DataNode ID Total 50 50Total 50 094424532501630162000

Heart Data AnalysisNode ID0123456789101112131415161718192021222324Heart Node ProbabilitiesTraining DataGenerated DataTotal 50 50Total 50 07

Data Analysis: Summary 150,000 rows in Iris set, 303,000 rows inHeart Disease set Of 68 probability measures, only one had aprobability difference greater than .001 Data sets indistinguishable based ondecision tree model

Conclusion: Summary Goal: Simulate data based on decision treemining model Created software to convert a decision treestored as PMML to an SDDL dataspecification Generated data sets 1000 times as large andtested for similarity

Conclusion: Contributions Demonstrated viability of using a miningmodel as a data description Established architecture to expand to othermining models Creates a way to rapidly simulate data thatdoes not conform to standard distributions

Conclusion: Future Work Implementation Work– Interactive interface– Additional mining models– Different PMML versions and software Theoretical Work– Simulating multiple models simultaneously– Theoretical limits of this approach

IBM, Microsoft, etc. through the Data Mining Group -XML format with vendor extensions. Background: PMML Consists of Data Dictionary and 1 or more Mining Models Data Dictionary . -SPSS Clementine creates compound predicates with surrogate simple predicates -Only the first predicate with available data applies