Transcription

ITERATIVELY LEARNING DATA TRANSFORMATION PROGRAMS FROMEXAMPLESbyBo WuA Dissertation Presented to theFACULTY OF THE USC GRADUATE SCHOOLUNIVERSITY OF SOUTHERN CALIFORNIAIn Partial Fulfillment of theRequirements for the DegreeDOCTOR OF PHILOSOPHY(COMPUTER SCIENCE)December 2015Copyright 2015Bo Wu

DedicationTo my father Yongwei Wu and my mother Rufen Liu, for their love and supportii

AcknowledgmentsFirst and foremost, thanks to my parents. Their unconditional support allows me tofinish this long endeavor. Thanks to Chiao-Yu. Her love and company give me thestrength to get through all those hard times.I would like to thank my advisor, Craig A. Knoblock, who taught me how to conductresearch. During my PhD study, I met many difficulties, such as identifying researchproblems, writing papers, giving presentations and so on. I was deeply frustrated by myresearch progress at the beginning. Craig’s encouragement allows me to continue myresearch. Talking with Craig is always very helpful. Once I leave ISI, I think one ofthings I would miss most is talking with Craig. Besides research, Craig also sets a goodexample for me on many aspects of life, which I will try my best to simulate in future.Many thanks to my committee members: Daniel O’Leary, Cyrus Shahabi, Yan Liuand Jose Luis Ambite. Their guidance kept me on the right track so that I could successfully finish my thesis in time. Their patience and kindness allowed my thesis proposaland defense to go smooth.I want to thank Pedro Szekely. He led me into the user interface area and patientlytaught me how to develop a research prototype within a large team. These experiencesopened a new door for me and allowed me to pay more attention to the user part. I alsowant to thank Yao-Yi Chiang from whom I learnt a lot of interesting work and tools iniii

geospatial information integration. He showed me the interesting applications in thatfield.Finally, I want to express my thanks to my fellow students of information integration group at ISI: Mohsen, Jeon-Hyung, Suradej, Yoon-Sik, Jason and Hao. You arealways ready to help. Hanging out with you guys is fun, which makes this long PhDjourney more joyful. Talking with you also helps me to learn many interesting thingsand appreciate the diversity of this world.This research is based upon work supported in part by the National Science Foundation under Grant No. 1117913. This research was also supported in part by the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory(AFRL) contract number FA8650-10-C-7058.The U.S. Government is authorized to reproduce and distribute reports for Governmental purposes notwithstanding any copyright annotation thereon. The views andconclusions contained herein are those of the author and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied,of any of the above organizations or any person connected with them.iv

ContentsDedicationiiAcknowledgmentsiiiList of TablesviiiList of FiguresixAbstractx1Introduction1.1 Motivation and problem statement1.2 Proposed approach . . . . . . . .1.3 Thesis Statement . . . . . . . . .1.4 Contributions of the Research . .1.5 Outline of the Thesis . . . . . . .1137782Previous Work2.1 Generating the transformation program . . . . . . . . . . . . . . . . . .2.1.1 Synthesizing the branch transformation program . . . . . . . .2.1.2 Learning the Conditional Statement . . . . . . . . . . . . . . .91111143Learning Conditional Statements3.1 Construct Conditional Transformations .3.1.1 Data Preprocessing . . . . . . .3.1.2 Partition Algorithm . . . . . . .3.1.3 Distance Metric Learning . . .3.1.4 Learning the Classifier . . . . .3.2 Evaluation . . . . . . . . . . . . . . . .3.2.1 Datasets . . . . . . . . . . . . .3.2.2 Experiment Setup . . . . . . . .3.2.3 Metrics . . . . . . . . . . . . .3.2.4 Experimental Results . . . . . .1618202124283030313333.v

456Adapting Branch Programs4.1 Iteratively Learning Programs by Example . . . . . .4.1.1 P and T have the same number of segments .4.1.2 P and T have a different number of segments4.1.3 Adapting Incorrect Programs . . . . . . . . .4.1.4 Soundness and Completeness . . . . . . . .4.1.5 Performance Optimizations . . . . . . . . .4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . .4.2.1 Dataset . . . . . . . . . . . . . . . . . . . .4.2.2 Experiment Setup . . . . . . . . . . . . . . .4.2.3 Real-World Scenario Results . . . . . . . . .4.2.4 Synthetic Scenarios Results . . . . . . . . .414446474950505151525354Maximizing Correctness with Minimal User Effort5.1 Verifying the transformed data . . . . . . . . . . . . . . . . . . . .5.2 Sampling records . . . . . . . . . . . . . . . . . . . . . . . . . . .5.3 Recommending records . . . . . . . . . . . . . . . . . . . . . . . .5.3.1 Finding the records with runtime errors . . . . . . . . . . .5.3.2 Building a meta-classifier for detecting questionable records5.3.2.1 Classifiers based on distance . . . . . . . . . . .5.3.2.2 Classifiers based on the agreement of programs .5.3.2.3 Classifiers based on the format ambiguity . . . . .5.3.2.4 Sorting the recommended records . . . . . . . . .5.3.3 Minimal test set . . . . . . . . . . . . . . . . . . . . . . . .5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.4.1 Simulated experiment . . . . . . . . . . . . . . . . . . . .5.4.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . .5.4.1.2 Experiment setup . . . . . . . . . . . . . . . . .5.4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . .5.4.2 User study . . . . . . . . . . . . . . . . . . . . . . . . . .5.4.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . .5.4.2.2 Experiment setup . . . . . . . . . . . . . . . . .5.4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . .5659606263656566676969707171717274747576Related Work6.1 Data Transformation Approaches . . . . . . . .6.1.1 Specifying the transformation steps . .6.1.2 Specifying the transformation results .6.2 PBE approaches for data transformation . . . .6.2.1 Learning Conditional Statements . . . .6.2.2 Adapting Program With New Examples6.2.3 User interface . . . . . . . . . . . . . .7878787980818283.vi

7Conclusion7.1 Limitations and future work . . . . . .7.1.1 Managing the user expectation .7.1.2 Incorporating external functions7.1.3 Handling user errors . . . . . .8687888889Bibliography90A Appendix94vii

List of Tables3.13.23.33.4Data profile . . . . . . . . . . . . . . . . . . . . . .Success rates for different approaches on all scenariosClassifier accuracy . . . . . . . . . . . . . . . . . .The partitions of example of NPIC . . . . . . . . . .313437394.14.24.3Data transformation scenario . . . . . . . . . . . . . . . . . . . . . . .Synthetic scenario for generating the first 7 columns . . . . . . . . . . .Results of read-world scenarios . . . . . . . . . . . . . . . . . . . . . .4252535.15.25.3One typical example of a failed iteration . . . . . . . . . . . . . . . . .Scenarios used in user study . . . . . . . . . . . . . . . . . . . . . . .User study results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .747576viii

List of Figures1.11.21.3An example for a data transformation scenario . . . . . . . . . . . . . .Interaction between users and PBE systems . . . . . . . . . . . . . . .Approach overview . . . . . . . . . . . . . . . . . . . . . . . . . . . .2452.12.22.3An example transformation program . . . . . . . . . . . . . . . . . . .One trace example of an input-output pair . . . . . . . . . . . . . . . .An example of the hypothesis space . . . . . . . . . . . . . . . . . . .1012133.13.23.33.4Token Sequence and Feature Vector for a String . . . . . . . . . . . . .Weights of Different Features Change Over Iterations . . . . . . . . . .The examples and the unlabeled data in each partition . . . . . . . . . .Comparing DPICED with 3 other groups of approaches: (G1) DPICEDvs DPIC vs DP, (G2) DPICED vs SPIC vs SP and (G3) DPICED vsNPIC and NP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .DPIC used fewer seconds per iteration compared to SPIC and NPIC . .Time per iteration increases as example number grows . . . . . . . . .The number of conditional branches in the final program generated bydifferent approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . .2227293.53.63.7.343638394.14.24.34.4Program changes as more examples are added . . .P and T have the same number of segments . . . .P and T have different number of segments . . . .Synthesizing time rises as column number increases.434647545.15.25.35.45.5Different rules for recognizing incorrect records . . . .User interface . . . . . . . . . . . . . . . . . . . . . .An example transformation program . . . . . . . . . .Candidate position programs for one segment programComparison results . . . . . . . . . . . . . . . . . . .5761646773ix

AbstractData transformation is an essential preprocessing step in most data analysis applications. It often requires users to write many trivial and task-dependent programs, whichis time consuming and requires the users to have certain programming skills. Recently,programming-by-example (PBE) approaches enable users to generate data transformation programs without coding. The user provides the PBE approaches with examples(input-output pairs). These approaches then synthesize the programs that are consistentwith the given examples.However, real-world datasets often contain thousands of records with various formats. To correctly transform these datasets, existing PBE approaches typically requireusers to provide multiple examples to generate the correct transformation programs.These approaches’ time complexity grows exponentially with the number of examplesand in a high polynomial degree with the length of the examples. Users have to waita long time to see any response from the systems when they work on moderately complicated datasets. Moreover, existing PBE approaches also lack the support for users toverify the correctness of the transformed results so that they can determine whether theyshould stop providing more examples.To address the challenges of existing approaches, we propose an approach that generates programs iteratively, which exploits the fact that users often provide multipleexamples iteratively to refine programs learned from previous iterations. By collectingx

and accumulating key information across iterations, our approach can efficiently generate the new transformation programs by avoiding redundant computing. Our approachcan also recommend potentially incorrect records for users to examine, which can saveusers effort in verifying the correctness of the transformation results.To validate the approach in this thesis, we evaluated IPBE, the implementation of ouriterative programming-by-example approach, against several state-of-the-art alternativeson various transformation scenarios. The results show that users of our approach usedless time and achieved higher correctnesses compared to other alternative approaches.xi

Chapter 1IntroductionThis chapter introduces the key challenges of applying PBE approaches to perform datatransformation. I first motivate and define the problem. I then briefly describe ourapproach to the problem and summarize our contributions. Finally, I present the outlineof the thesis.1.1Motivation and problem statementData transformation converts the data from the source format to the target format,which is an essential step before utilizing the data. As the transformation is often taskdependent, the data practitioner usually writes scripts for different tasks, which can beerror-prone and labor intensive.Recently, programming-by-example (PBE) approaches [Lau et al., 2003; Huynhet al., 2008; Kandel et al., 2011; Gulwani, 2011] have proven to be effective in generating transformation programs for simple scenarios without coding. These approachesare based on certain domain specific language (DSL) designed to express common datatransformation operations. They only require examples (input-output pairs). They canthen generate programs that are consistent with examples, which means these programcan generate expected outputs for the corresponding inputs specified in the examples.FlashFill [Gulwani, 2011], which is an example PBE system, is already integrated intoExcel 2013 to help users transform the data. For example, the dimensions of artworksin Figure 1.1 mix values for all degrees into one cell. The user wants to put each degree1

in its own column. Taking width for instance, the target values are shown in the rightcolumn in Figure 1.1. To transform the data, the user enters the target value (width)“9.375” for the first entry. The system generates a program and applies that program tothe rest of the data. If the user finds any entry is transformed incorrectly, she providesa new example for that entry to regenerate the program. For instance, as the 4th entryhas both the painting width and frame width, the program learned from the first example fails to transform the input of this unseen format. The user inputs “19.5” to teachthe system that she only wants the frame width, which is shown after the vertical bar.The system regenerate the program and applies it to the rest of entries. The user keepsproviding examples until she determines the results are correct.Figure 1.1: An example for a data transformation scenarioHowever, there are 3 challenges that need to be addressed to make PBE approachesmore practical. First, a dataset can have various formats and the user is only willing2

to provide a few examples, as seen in Figure 1.1. The approach should learn to recognize these formats from few examples accurately and then perform the transformationaccordingly.Second, the users need to see the results on the fly so that they can provide additional examples if necessary. The approach should generate the programs in real time.However, the current approaches’ computational complexity grows exponentially in thenumber of examples and a high degree polynomial in the size of each example [Razaet al., 2014]. In previous work, users have to wait a long time, when they work on thescenarios that require long or many examples.Third, users are often overconfident with the correctness of the results especiallyon large datasets [Ko et al., 2011]. There are usually thousands of entries being transformed. It is hard for users to manually examine whether all the entries are transformedcorrectly. Users also do not like to invest much time in examining the results.Given the challenges presented by current PBE approaches, this thesis focus on solving the problem of how to efficiently generate correct transformation programs fordata with heterogeneous formats using minimal user effort.1.2Proposed approachTo address the problem, we exploit the fact that users interact with the PBE systemiteratively. The interaction between the user and a PBE system often contains multipleiterations as shown in Figure 1.2. In each iteration, the user examines the records andprovides examples for the records with incorrect results. A record here consists of rawinput and the transformed result. After obtaining the examples from the user, the PBEsystem synthesizes a transformation program and applies this program to transform allrecords. The transformed records are presented to the user for examination. If the user3

Figure 1.2: Interaction between users and PBE systemsidentifies any record transformed incorrectly, she can provide a new example and starta new iteration. For any non-trivial datasets, users often provide more than 3 examplesbefore finishing transforming. For example, in Figure 1.1, the user keeps providing newexamples for the entries with incorrect results to regenerate a transformation programthat is consistent with all given examples.The intuition of our approach is to identify and collect certain key information fromprevious iterations and utilize the information to benefit the current iteration. To understand how the information from previous iterations can improve the performance of theexisting PBE system [Gulwani, 2011], I briefly describe the architecture of our iterativeprogramming-by-example approach (IPBE) shown in Figure 1.3 and its key modules.These key modules are based on the previous approach [Gulwani, 2011], which willbe described in Chapter 2 in more detail. In our approach, a user is asked to provide4

Figure 1.3: Approach overviewexamples in the GUI. It first clusters the examples into multiple clusters, where eachcluster corresponds to a specific format. It learns a classifier (conditional statement) torecognize these different formats. It then synthesizes a branch transformation programfor each cluster to handle the specific format. Finally, the PBE approach combines theconditional statement with branch programs to generate the final data transformationprogram.The three key components are (1) clustering examples and learning conditional statements (2) synthesizing branch programs and (3) a user interface for providing examples.The approach that we used to improve these modules’ performance through utilizinginformation from previous iterations are listed below.5

Efficiently clustering examples and accurately learning the classifiers: Learningconditional statements (classifiers) requires clustering the examples into compatibleclusters. A compatible cluster means there is at least one branch program that is consistent with the examples in that cluster. To know whether a group of examples are compatible, the previous approach [Gulwani (2011)] tries to generate programs for this groupof examples to see if there exists a consistent program. Thus, verifying the compatibility of examples in a cluster is computationally expensive. What is worse, the previousapproach needs to verify the compatibilities of a large number of clusters before identifying the final clustering. We developed an approach that learns a distance metric fromcluster information of previous iterations. This distance metric will put the compatibleexamples closer so that they can form a cluster while putting the incompatible examplesfar away. Using the distance metric, our approach can put the compatible examples intoone cluster without verifying the compatibilities of a large number of clusters. The distance metric can also be used to incorporate unlabeled data (the records that are not usedas examples) as the training data for learning a classifier as the conditional statement.By expanding the training data using the unlabeled data, we can learn a more accurateclassifier than by just relying on few examples.Efficiently synthesizing the branch transformation program: Only a small portionof the learned program changes after the user provides a new example to refine the program. Our approach can identify the incorrect subprograms that require modificationafter users provide new examples. Our approach decomposes the new example to generate the expected input-output pairs for subprograms. It then compares the executionresults of subprograms of the current program with the expected outputs to identify thesubprograms that cannot generate expected outputs. After identifying the incorrect subprograms and their expected outputs, our approach uses the new example to refine the6

hypothesis space that are used to generate the incorrect subprogram. It generates thecorrect subprograms from the refined hypothesis space. It then replaces the previousincorrect subprograms with new ones to generate the new program.Guiding users to obtain correct results with minimal user effort: To handle largedataset, our approach allows users to focus on a small sample. When the records in thesmall sample are all transformed correctly, the entire dataset can obtain a correctness satisfying the user’s requirement. My approach also learns from past transformation resultsto recommend the records that potentially have incorrect results. Users examine theserecommended records. They can enter examples for incorrectly transformed records orconfirm the correctness of certain recommended records to refine the recommendation.The approach also requires users to examine a minimal set of the records to ensure thatthe users are not too confident with their results to carefully check the results.1.3Thesis StatementThrough collecting and leveraging the information generated across iterations, ourprogramming-by-example approach can efficiently generate programs that cancorrectly transform data with heterogeneous formats using minimal user input.1.4Contributions of the ResearchThe goal of this research is to develop a practical programming-by-example data transformation system. To be specific, our research has the following contributions: efficiently learning accurate conditional statements by exploiting informationfrom previous iterations7

incrementally synthesizing branch transformation programs efficiently by adapting programs from the previous iteration maximizing the user correctness with minimal user effort on large datasets byrecommending potential incorrect records.1.5Outline of the ThesisThe rest of this proposal is organized as follows. Chapter 2 describes the previouswork and related basic concepts. Chapter 3 describes the approach to learn conditionalstatements iteratively. Chapter 4 presents the approach that iteratively generates newprograms by reusing previous subprograms. Chapter 5 discusses the approach to helpusers verify the correctness of the results with minimal effort on large datasets. Chapter6 reviews all the related work. Finally, chapter 7 concludes this research and identifiesareas for future research.8

Chapter 2Previous WorkOur approach is built on the state-of-the-art PBE system [Gulwani, 2011], whichexploits the version space algebra like many other program induction approaches [Lauet al., 2003]. It defines a domain specific transformation language (DSL), which supports a restricted, but expressive form of regular expressions that also includes conditionals and loops. The approach synthesizes transformation programs from this language using examples.To better understand the structure of the generated transformation program, we usea different representation of the transformation program from Gulwani’s original notations without changing its meaning. The transformation program learned from the examples in Figure 1.1 is shown in Figure 2.1.The program recognizes the format of the input using the classify function as seenin Figure 2.1. It then performs the conditional transformation using the switch function based on the recognized format. It has a branch transformation program for eachspecific format. The branch program is essentially a concatenation of several segmentprograms as branch substr1 substr2 . A segment program outputs a substring,which is defined as substr const substr(ps , pe ) loop(w, branch). The segmentprogram can (1) be a constant string (const), (2) extract a substring (substr) betweentwo positions specified by position programs ps and pe respectively or (3) a loop program (loop) that executes a branch program iteratively where w controls the start pointof the loop. The w is passed as an offset from the beginning into the position programsin the loop body, which is shown in an example later. The position program locating9

Figure 2.1: An example transformation programa position in the input is defined as p indexOf (lef tcxt, rightcxt, c) number. Itcan be specified using (1) an absolute number (number), or (2) a position with the context specified by (lef tcxt, rightcxt, c). Both lef tcxt and rightcxt are a sequenceof tokens, which specifies the left and right context of a position correspondingly. crefers to the c-th occurrence of the position with the required context. The set oftokens used in the approach is as follows. ‘ANY’ can represent any token. ‘WORD’([a-zA-Z] ) represents a sequence of alphabetical letters. ‘UWRD’ ([A-Z]) stands fora single upper case letter. ‘LWRD’ ([a-z] ) means a sequence of lowercase letters.‘NUM’ ([0-9] ) represents a sequence of digits. ‘BNK’ is a blank space. Besidesthese tokens, all punctuation are also tokens. For example, in Figure 2.1, pos1 isspecified as (BNK, NUM, -1). “-1” means the first appearance of a position with therequired context when scanning backwards from the end of the input. Hence, (BNK,10

NUM, -1) refers to the first position whose left is a blank space and whose right is anumber, when scanning from the end of the input. An example for loop program isloop(2, substr(indexOf (W ORD, AN Y, 2 i), indexOf (AN Y, W ORD, 2 i)) “, ”) i [0, 1, .]. The body of the loop is a concatenation of two segment programs: (1) one is asegment program where the position programs start matching at the (2 i)-th occurrenceand (2) a constant string “,”. The loop program essentially extracts all WORD tokensexcept the first one and appends a comma after each word. Here w is 2. The w ibecomes 2 i, which specifies the position program should start locating positions fromthe second occurrence and continue searching for the next occurrence with the desiredcontext until the position program fails to locate a position.2.1Generating the transformation programSynthesizing the transformation program as seen in Figure 1.3 has two essential steps:(1) clustering the examples into partitions to learn conditional statements and (2) synthesizing the branch program for each partition. The partition is a hypothesis space ofbranch transformation programs derived from the examples belonging to the partition.2.1.1Synthesizing the branch transformation programTo generate a branch transformation program, Gulwani’s [Gulwani, 2011] approach follows these steps:First, it creates all traces of the examples. To create the traces, it segments the outputs into all possible ways. From the original input, it then generates these segments,independently of each other, by either copying substrings from original values or inserting constant strings. Two trace examples are shown in Figure 2.2. The outside rectangleshows that the “9.375” can be directly extracted from the input. The other trace depicted11

Figure 2.2: One trace example of an input-output pairusing three small rectangles shows that the output is a concatenation of three substringswhere the period is extracted from the first period in the input.Definition 1 Traces: traces are the computational steps executed by a program to yieldan output from a particular input [Kitzelmann and Schmid, 2006]. A trace here defineshow the output string is constructed from a specific set of substrings from the inputstring.Second, it derives hypothesis spaces from the traces. A hypothesis space definesthe set of possible programs in the DSL that can transform the inputs to the outputs.The hypothesis space is stored using a direct acyclic graph (DAG), where the nodesin the DAG correspond to the position hypothesis spaces that contain the position programs. The edges correspond to the segment hypothesis spaces that contain the segmentprograms. An example of the hypothesis space derived from the traces in Figure 2.2is shown in Figure 2.3. There are four nodes corresponding to 4 position hypothesisspaces, as the output in Figure 2.2 can be split into at most 3 different segments. S1and S4 are the start and end positions. Each path from S1 to S4 is a concatenation ofmultiple edges representing the programs consisting of different segments such as thepath (S1 S2 S3 S4 ) dictates the program can consist of three segment programs. The second segment hypothesis space (S2 S3 ) contains all segment programsgenerated by filling the ps and pe in the substr(ps , pe ) with any pair of position programs from S2 and S3 position hypothesis spaces respectively. The segment space also12

Figure 2.3: An example of the hypothesis spacecontains the constant text as “.” shown in Figure 2.3. There is no guarantee that all programs in the hypothesis space are consistent with examples. For example, (N U M,0 .0 , 2)from S2 and (0 .0 , N U M, 1) from S3 can not form a valid segment program. The outputof the start position program will be larger than the output of the end position program,as the program from S2 locates the beginning of the second period whereas the programfrom S3 identifies the end of the first period. Finally, if there are multiple examples,the approach creates a hypothesis space for each example and merges these hypothesisspaces to generate a common hypothesis space for all examples.Finally, the programs in the hypothesis space are partially ordered based on theirsimplicity which is measured using a set of pre-defined heuristics. The simpler programsare generated first as they are more likely to be correct based on Occama’s principle. Forexample, the approach generates the program with fewer segments earlier. As mentionedearlier, since not all programs in the hypothesis space are consistent with examples, theapproach uses a generate-and-test approach. The approach keeps generating programsin the order of their simplicity and returns the first program that is consistent with all

and in a high polynomial degree with the length of the examples. Users have to wait a long time to see any response from the systems when they work on moderately com-plicated datasets. Moreover, existing PBE approaches also lack the support for users to verify the correctness of the transformed results so that they can determine whether they