Title:Desire Lines in Big DataName:Wil M.P. van der AalstAffil./Addr.:Eindhoven University of TechnologyDepartment of Mathematics and Computer SciencePO Box 513, NL-5600 MB, Eindhoven, The NetherlandsE-mail: [email protected] Lines in Big DataSynonymsprocess mining, business process intelligence, distributed process mining, process discoveryGlossaryEvent log: multiset of traces.Trace: sequence of events.Event: occurrence of some discrete incident (e.g., completion of an activity).Process mining: collection of techniques to discover, monitor and improve real processes by extracting knowledge from event data.Process discovery: extracting process models from an event log.Conformance checking: monitoring deviations by comparing model and log.DefinitionProcesses leave footprints in information systems just like people leave footprints ingrassy spaces. Desire lines, i.e., the tracks formed by erosion showing where people

2really walk, may be very different from the formal pathways. When people deviatefrom the official path there is often a good reason and room for improvement. The goalof process mining is to extract desire lines from event logs, e.g., to automatically infera process model from raw events recorded by some information system.Process mining techniques and tools should be able to deal with huge heterogeneous event logs. For example, the increasing ability to record events (cf. sensor data,internet of things, remote monitoring, and service orientation) may make it infeasibleto store all events over an extended period. Therefore, on-the-fly discovery techniqueshave been developed, i.e., techniques to learn process models without storing excessiveamounts of events. Moreover, techniques to distribute process mining techniques overa network consisting of many computing nodes are being developed. The techniquesexploit modern computing infrastructures and make process mining scalable. This wayit is possible to discover desire lines in Big Data.IntroductionDesire lines refer to tracks worn across grassy spaces – where people naturally walk– regardless of formal pathways (see Figure 1). A desire line emerges through erosion caused by footsteps of humans (or animals) and the width and degree of erosionof the path indicates how frequently the path is used. Typically, the desire line follows the shortest or most convenient path between two points. Moreover, as the pathemerges more people are encouraged to use it, thus stimulating further erosion. DwightEisenhower is often mentioned as one of the persons that noted this emerging groupbehavior. Before becoming the 34th president of the United States, he was the president of Columbia University. When he was asked how the university should arrangethe sidewalks to best interconnect the campus buildings, he suggested letting the grassgrow between buildings and delay the creation of sidewalks. After some time the de-

3sire lines revealed themselves. The places where the grass was most worn by people’sfootsteps were turned into sidewalks.normative orexpected pathdesirelineFig. 1: Desire lines reveal the actual and not the assumed behavior of people, machines,and organizations.The term “desire line” has been used for decades in urban planning. A desireline shows where people naturally walk. The width and degree of erosion of such aninformal path indicates how frequently the path is used. Often the desire line is verydifferent from the formal pathway. Therefore, some planners simply let erosion tell werethe paths need to be. For example, the paths across Central Park in New York werereconstructed using this approach [24, 26].Good information systems do not show signs of erosion. Nevertheless, they oftencontain a wealth of event data providing clues about the paths followed by the usersof the system. Therefore, it is possible to determine desire lines in organizations, systems, and products. Besides visualizing such desire lines, we can also investigate howthese desire lines change over time, characterize the people following a particular de-

4sire line, etc. There may also be desire lines that are “undesirable” (unsafe, inefficient,unfair, etc.). Uncovering such phenomena is a prerequisite for process and productimprovement.The potential value of desire lines in “big data” (say event logs containing millions of events) is enormous. The identification of such information can be used toredesign procedures and systems (“reconstructing the formal pathways”), to recommend people taking the right path (“adding signposts were needed”), or to build insafeguards (“building fences to avoid dangerous situations”).More and more information about (business) processes is recorded by information systems in the form of so-called “event logs”. IT systems are becoming more andmore intertwined with these processes, resulting in an “explosion” of available data thatcan be used for analysis purposes. Today’s information systems already log enormousamounts of events. Classical workflow management systems (e.g. FileNet, TIBCO iProcess Suite, Global 360), ERP systems (e.g. SAP, Oracle), case handling systems (e.g.BPM one), PDM systems (e.g. Windchill), CRM systems (e.g. Microsoft DynamicsCRM, SalesForce), middleware (e.g., IBM’s WebSphere, Cordys), hospital informationsystems (e.g., Chipsoft, Siemens Soarian), etc. provide very detailed information aboutthe activities that have been executed. Not just information systems record data; manyphysical devices are connected to the Internet and objects (products and resources) aretagged and monitored. Providers of high-tech systems (ASML, Philips Healthcare, etc.)are recording terabytes of data on a daily basis. In fact, according to MGI, nearly allsectors in the US economy have at least an average of 200 terabytes of stored dataper company (for companies with more than 1,000 employees) and many sectors havemore than 1 petabyte in mean stored data per company [21]. Until 2000 most datawas still stored in analog form (books, photos, etc.). Since 2000 data storage has grownspectacularly, shifting markedly from analog to digital [18].

5Data will continue to grow at a spectacular rate. Moreover, the digital universeand the physical universe are becoming more and more aligned, e.g., money has becomea predominantly digital entity. When booking a flight over the Internet, the customer isinteracting with many organizations (airline, travel agency, bank, and various brokers),often without actually realizing it. If the booking is successful, the customer receivesan e-ticket. Note that an e-ticket is basically a number, thus illustrating the tightcoupling between the digital and physical universe. When the SAP system of a largemanufacturer indicates that a particular product is out of stock, it is impossible to sellor ship the product even when it is available in physical form. Technologies such asRFID (Radio Frequency Identification), GPS (Global Positioning System), and sensornetworks will stimulate a further alignment of data and reality, e.g., RFID tags makeit possible to track and trace individual items. Hence, there will be more and morehigh-quality data that can be used to reveal desire lines in any industry.Since we are interested in analyzing processes based on the data recorded, wefocus on events that can be linked to relevant activities. The order of such events isimportant for deriving the actual process. Fortunately, most events have a timestampor can be linked to a particular date. Hence, the event data needed for process miningare omnipresent.Consider for example Philips Healthcare, a provider of medical systems that areoften connected to the Internet to enable logging, maintenance, and remote diagnostics.For example, more than 1500 Cardio Vascular (CV) systems (i.e., X-ray machines) aremonitored by Philips. On average each CV system produces 15,000 events per day,resulting in 22.5 million events per day for just their CV systems. The events arestored for about three years and have many attributes. The error logs of ASML’slithography systems have similar characteristics and also contain about 15,000 eventsper machine per day. These numbers illustrate the fact that many organizations are

6storing terabytes of event data. Earlier applications of process mining in organizationssuch as Philips and ASML, show that there are various challenges with respect toperformance (response times), capacity (storage space), and interpretation (discoveredprocess models may be composed of thousands of activities).Many organizations are using so-called Business Intelligence (BI) software, e.g.,Business Objects (SAP), Cognos (IBM), Hyperion (Oracle), etc. Common functionsoffered by these BI tools are reporting, online analytical processing, data mining, business performance management, benchmarks, and predictive analysis. However, thesetools assume that the process is known and they typically look at data-related aspects(e.g., correlations) or view the process at an aggregate level (e.g., a dashboard showingthe average response time). BI tools typically provide some form of data mining andthere are dedicated data mining tools such as Weka, SPSS Clementine, RapidMiner,etc. Typical techniques supported are classification, clustering, association rules, etc.However, these systems do not allow for the discovery of processes based on eventlogs. In fact, an explicit process notion is missing. This led to the formation of a newresearch domain: process mining.Key PointsThe spectacular growth of event data is providing opportunities and challenges forprocess mining. Process discovery and conformance checking can be used to analyze andimprove operational business processes in any sector. However, as event logs are growingin size it may be impossible to store, manage, and analyse event data using traditionalalgorithms and tools. Moreover, process mining is increasingly used on online settingswhere processes need to be analyzed on-the-fly. Process mining algorithms and toolsneed to be adapted to this new reality.

7case id12.event 300.Table 1: A fragment of some event log: each line corresponds to an event.Process MiningIn this section, we first introduce process mining using a small example. Then weelaborate on ways to deal with huge event sets.Process mining techniques attempt to extract non-trivial and useful informationfrom event logs [1, 19]. One aspect of process mining is control-flow discovery, i.e., automatically constructing a process model (e.g., a Petri net or BPMN model) describingthe causal dependencies between activities [7, 9, 29]. The basic idea of control-flowdiscovery is very simple: given an event log containing a set of traces, automaticallyconstruct a suitable process model “describing the behavior” seen in the log. Such discovered processes have proven to be very useful for the understanding, redesign, andcontinuous improvement of business processes [1].To illustrate the notion of process discovery, consider Table 1. The table shows asmall fragment of some larger event log. Only two traces are shown, both containing 4

8events. Each event has a unique id and several properties. For example, event 35654423is an instance of activity A that occurred on December 30th at 11.02, was executedby John, and costs 300 euros. The second trace starts with event 35655526 and alsorefers to an instance of activity A. Note that each trace corresponds to a case, i.e., acompleted process instance.1hA02 , B 06 , C 12 , D18 i2hA10 , C 14 , B 26 , D36 i3hA12 , E 22 , D56 i4hA15 , B 19 , C 22 , D28 i5hA18 , B 22 , C 26 , D32 i6hA19 , E 28 , D59 i7hA20 , C 25 , B 36 , D44 iTable 2: A simplified event log. Each line corresponds to a trace represented as asequence of activities with timestamps.The information depicted in Table 1 is the typical event data that can be extracted from today’s information systems. To make the example more manageable, wenow focus on the activities and their timestamps only. Table 2 shows another view onthe same event log. Now each line corresponds to a process instance, e.g., the first tracehA02 , B 06 , C 12 , D18 i refers to a process instance where activity A was executed at time2, activity B was executed at time 6, activity C was executed at time 12, and activityD was executed at time 18. Note that the first two traces in Table 2 correspond to thefragment shown in Table 1 (using simplified timestamps).Using existing process mining techniques it is possible to extract a process modelfrom Table 2. For example, by applying the α algorithm [9] we obtain the process modelshown in Fig. 2. This simple Petri net model [25] describes the process that starts with

9BAp1Ep3startDcompletep2Cp4Fig. 2: A process model discovered from Table 2 using the α algorithm.A and ends with D. In-between A and D either E or B and C are executed (in anyorder).Clearly, process mining – in particular control-flow discovery – is related tothe classical work on inductive inference. However, there are also notable differencesbecause, unlike most of the classical work, process mining focuses on higher orderrepresentations which explicitly model concurrency (e.g., Petri nets, UML ADs, EPCs,BPMN, etc.) rather than lower level representations (e.g., Markov chains, finite statemachines, or regular expressions). Moreover, we do not assume negative examples (i.e.,there are no events stating that an activity cannot happen) and deal with issues suchas incompleteness (i.e., if something did not happen, it may still be possible) andexceptional behavior. See [1] for an overview of existing process discovery approaches.Process mining is not limited to control-flow discovery [1]. First of all, besidesthe control-flow perspective (“How?”), other perspectives such as the organizationalperspective (“Who?”) and the case/data perspective (“What?”) may be considered.Second, process mining is not restricted to discovery. Typically three basic types ofprocess mining are considered: (a) discovery, (b) conformance, and (c) enhancement[1]. In this article we will focus on process discovery, i.e., discovering a model from rawevents. Discovery serves as the starting point for the two other types of process mining.The second type of process mining is conformance [27, 23]. Here, an existing processmodel is compared with an event log of the same process. Conformance checking can beused to check if reality, as recorded in the log, conforms to the model and vice versa. The

10third type of process mining is enhancement [8]. Here, the idea is to extend or improvean existing process model using information about the actual process recorded in someevent log. Whereas conformance checking measures the alignment between model andreality, this third type of process mining aims at changing or extending the a-priorimodel. For instance, by using timestamps in the event log one can extend the modelto show bottlenecks, service levels, throughput times, and frequencies.For example, the event log in Table 2 shows timestamps. When replaying theevent log on the process model shown in Fig. 2, we can measure the time spent in theplaces in-between the various activities. 