Understanding Process Mining

Last Update: January 7th; 2021

In this section, we explain what process mining is all about. Note that, this page just describes the basics of process mining, i.e., it is not a full-fledged reference of every possible aspect of process mining. Therefore, for a more detailed overview of process mining, we recommend taking a look at the Coursera MOOC on Process Mining, and, the seminal book of Wil van der Aalst. Furthermore, before you begin, please make sure to install PM4Py on your system, i.e., as described in the Installation page.

Processes in our Modern World

The vast majority of companies active in virtually any domain executes a process. Whether the core business of a company is to deliver a product, e.g., manufacture a car, cooking a delicious pizza, etc., or, providing a service, e.g., providing you with a mortgage to buy your dream house, paying back your insurance claim, etc., for efficient delivery of your product/service, processes are executed. Hence, a natural question is: “What is a process?”. In general, several notions of the concept of a process exist. However, in process mining, we typically assume the following conceptual definition:

A process represents a collection of activities that we execute to achieve a certain goal.

As an example, consider the burger restaurant just around the corner, which also does deliveries When you call the restaurant to order your beloved burger, the first action taken by the employee, let’s call her Lucy, taking your call, is to take your order. Let’s assume you go for a tasty cheese burger, with a can of soda. After Lucy has entered your order in the cash register, she asks for your address, which she also adds to the order. Finally, she asks for your preferred means of payment, after which she provides you with a rough estimate of the time until delivery. At the moment Lucy finishes the call, she prints your order and hands it over to the chef, let’s call him Luigi. Since you’ve called relatively early, Luigi can start preparing your burger right away. At the same time, Lucy takes a can of soda out of the refrigerator and places it on the counter. At the same time, a new call comes in from a different customer, which she handles in roughly the same way as she did yours. When Luigi finishes your burger, he slides it into a carton box and hands the box over to Lucy. Lucy puts the burger in a bag and puts the can of soda next to it. She then hands the bag with your burger and soda to Mike, which uses a fancy electrical bicycle to bring your order to your home.

In this small example, let’s assume that we are interested in the process, i.e., the collection of activities performed for your order. Based on the scenario we just presented, the steps look as follows:

  1. Lucy takes your order
  2. Lucy notes down your address
  3. Lucy notes down your preferred payment method
  4. Luigi prepares your burger
  5. Lucy grabs your can of soda
  6. Luigi puts your burger in a box
  7. Lucy wraps your order
  8. Mike delivers your order

Each entry in the above list is an activity executed by an employee of the burger restaurant, in the context of your order. As such, the list represents what we typically refer to as a process instance.

Assume that, one week later, you order another burger, e.g., the BBQ-bacon burger. In this case, you just bought a back of beer, so, you decided not to order any drinks. It turns out that Lucy and Mike have a day of, hence, Randy replaces Lucy at the cash register whereas John takes care of deliveries. Let's assume that the process instance of your second order, i.e., the BBQ-bacon burger, looks as follows:

  1. Randy takes your order
  2. Randy notes down your preferred payment method
  3. Randy notes down your address
  4. Luigi prepares your burger
  5. Luigi puts your burger in a box
  6. Randy wraps your order
  7. John delivers your order

Do you notice some differences? Clearly, in this example, different employees (Randy/John) execute certain activities that were previously executed by Lucy. However, there are more differences. For example, Randy first asked for your preferred payment method and as a final step asked for your address. Similarly, since you did not order any soda, Randy didn't take a can of soda from the fridge.

This type of variability is naturally present in virtually any process. In some cases, variability depends on the context of a process isntance, e.g., if you do not order a can of soda, we do not need to grab it from the fridge. In other cases, variability is more natural. For example, noting down your address and/or preferred payment method can be done in any order, i.e., it does not really matter. Hence, there are two possible orders at which we are able to schedule these activities, i.e., either we note your address before your preferred payment method, or, we do it after.

Another phenomenon inherently present in processes is concurrency, i.e., the ability to execute two tasks at the same time. In our first example, note that, whilst Luigi was preparing your burger, Lucy grabbed your can of soda. Hence, these two activities are able to happen at the exact same of time. A perfect example of concurrency is a pit-stop in Formula 1. Observe that, in a pit-stop, almost every activity is literally executed at the exact same time!

Event Logs

In the previous section, we briefly introduced the concept of processes, variability and concurrency, using a few simple examples. Processes are virtually everywhere, i.e., almost everything we do has some notion of activities, which we execute (repetitively) to achieve some goal. Clearly, from a business perspective, understanding how processes are executed within a company is a vital first step in order to be able to get grip of the process, and, eventually, improve the process. In general, the field of Business Process Management studies methods, tools and techniques in order to achieve such aforementioned understanding of processes running at a company. So where does process mining come into play? Well, exactly at the point of understanding the process! So, what is process mining exactly?


Process mining represents a collection of tools, methods, techniques, algorithms, etc., that allows us to achieve a better understanding of the execution of a process, by means of analyzing the operational execution data that is generated during the execution of the process.


Beautiful definition, isn't it. Essentially, within process mining, we aim to get a better understanding of how processes are executed within a company. This is not so much different from the aforementioned field of Business Process Management, however, within process mining, a key element of gaining knowledge is the operational execution data!

Let us reconsider the previous example. Let us assume that the employees of the burger restaurant maintain a large table, e.g., an Mircosoft Excel file, in which they keep track of everything they do. Consider the following table, which represents the complete record of the first order described in the previous section.

Table 1: A simple example event log fragment, capturing the trace of process behavior for the first order described in the previous section.
Order Number Activity Employee Date Time
1337 Take Order Lucy April 1st 2020 1:37PM
1337 Note Address of Customer Lucy April 1st 2020 1:39PM
1337 Register Payment Method Lucy April 1st 2020 1:40PM
1337 Prepare Burger Luigi April 1st 2020 1:41PM
1337 Grab Soda Lucy April 1st 2020 1:42PM
1337 Put Burger in Box Luigi April 1st 2020 1:52PM
1337 Wrap Order Lucy April 1st 2020 1:53PM
1337 Deliver Order Mike April 1st 2020 1:55PM

Okay, so what is captured in our table? First of all, it appears to be the case that our order got number 1337. We ordered our burger at April 1st. The first activity, i.e., taking our order was performed by Lucy, at 1:37PM. The final activity of our order, i.e., the delivery was performed around 1:55PM, quite a fast service! Whereas this example is a bit simplistic, most modern information systems actually track various parts of the execution of processes! In virtually any field, e.g., finance, health care, production technology, etc., data of the form presented are present in the underlying information systems used.

Each row in the table is referred to as an event, capturing the execution of a specific activity. We differentiate between events and activities, because in the context of some process, we are often able to repeat an activity. Hence, activities are executed, yet, events capture their execution. Within process mining, the sequence of rows depicted in the previous table, is often referred to as a trace. Hence, when we talk about traces, we talk about sequences of captured events. In the example table, all the recorded activities, i.e., the events, are listed in the context of our order. Hence, in this example, the order number allows us to identify in context of which process instance the events have been recorded. In the more general sense, the order number is referred to as the process instance identifier/case identifier. Hence, when we refer to a case (identifier), we refer to the unique way to identify in what context the events have been recorded. Examples of other process instance identifiers are a customer identifier/name, a patient identifier/name, a product ID, etc.

Imagine that we create a large table, containing all recorded traces of all orders over the past year. Such a table, e.g., containing thousands of orders, is referred to as an Event Log. Finally, there is one more difference between Table 1, and the average event log we find/extract from modern-day information systems. It is very likely that during the execution of order 1337, say at 1:42PM, another customer came in to submit an order. Hence, around 1:42PM Lucy probably started to take the order with number 1338, however, for simplicity, we have not included the activities performed for order 1338 in Table 1. In general, it is very normal that several instances of the process are executed at the same time. For example, in a bank, multiple mortgage applications run at the same point in time. Similarly, in a hospital, multiple patients are treated at the same time.

Process Models

Thus far, we have seen processes, and, we have seen a simple example of an event log. However, as we have indicated, process mining is all about understanding processes. Hence, to be able to understand, and reason on, processes, we need some means to communicate about processes, i.e., means to come to a common understanding of how a specific process is expected to be executed. This is where process models come in. From an abstract point of view, a process model is simply a description of a process. The simplest form of a process model is simply a document, written in natural language, describing how the process is supposed to be executed. In the context of our burger restaurant, we could start our process description by writing “The first activity that needs to be performed for this process is the take order activity...”

The problem with describing a process in natural language is the inherent inconsistency and unclarity of natural language, i.e., different people might interpret certain statement differently. Furthermore, since process mining is a computer science discipline, we want a computer to be able to reason about and even compute with a process model. Using a process description in natural language simply does not allow us to do this. Therefore, we typically use a more mathematical notion of a process model. For example, consider the following process model, loosely describing the process executed at the burger restaurant as described earlier.

Figure 1: A simplified process model (using "Business Process Model and Notation", i.e., BPMN, notation) of the Burger Restaurant example process

Within the figure, the leftmost circle represents the ‘start point’ of the process. Similarly, the rightmost circle represents the ‘end point’ of the process. The rectangles represent activities that can be executed in the process. The arrows in the model indicate ‘the flow’ of the process. Hence, after the starting point, the first activity to be executed is the Take Order activity. Observe that, the symbol connected to the take order activity, i.e., on the right-hand side, is a diamond shape with a +-symbol inside. This indicates the start of a parallel block of behavior, i.e., the Note Address and the Note Payment Method activities can be performed in any order, or, at the same time. Note that, the same holds for the Grab Drinks and Prepare Burger activities, however, these activities are only executed, after noting the address and noting the payment method have been completed! After this, the order is wrapped, i.e., visualized by the Wrap Order activity. After this activity, another symbol is present in the process model, i.e., a diamond shape with an x-symbol inside. This indicates that an exclusive choice needs to be made, i.e., either one of the two connecting activities needs to be executed. Hence, the order is either delivered or the customer will pick up his/her order.

Observe that, the process model does not capture the full behavior of the process as described in the first section, i.e., it is a simplified model of the process as described, just to highlight the concept of process models. Note that, using the model, we can reason what sequences of behavior, are supposed to happen in the process, i.e., as described by the model. For example, the sequence <Take order, Note Address, Note Payment Method, Grab Drinks, Prepare Burger, Wrap Order, Deliver> is a sequence that, according to the model, should be executable in the process. Similarly, the sequences <Take order, Note Address, Note Payment Method, Grab Drinks, Prepare Burger, Wrap Order, Wait for pickup > (Wait for Pickup instead of Delivery) and <Take order, Note Payment Method, Note Address, Grab Drinks, Prepare Burger, Wrap Order, Deliver > (Note Address and Note Payment are swapped), are part of the behavior of the process, i.e., according to the provided model. Consider the following image, taken from, Process Mining: Data Science in Action; Wil M.P. van der Aalst (2016), page 69, in which all the core elements of the Business and Process Modeling Notation are listed.

Figure 2: Fundamental elements of the "Business Process Model and Notation", i.e., BPMN, notation, taken from Process Mining: Data Science in Action; Wil M.P. van der Aalst (2016), page 69.

A detailed explanation of different (business) process modeling notations is out of the scope of this getting started page, i.e., business process modeling is an interesting field in its own right (both from an academical as well as an industry perspective). For convenience, we refer to the seminal work Fundamentals of Business Process Management, by Dumas et al. (2018), as well as The Application of Petri Nets to Workflow Management, by Wil M.P. van der Aalst (1998), highlighting the use of Petri nets (a more mathematically rigour notation compared to BPMN) in the context of business process modeling. Again, Process Mining: Data Science in Action; Wil M.P. van der Aalst (2016), serves as a perfect reference of well.

Process Discovery

Let us briefly recapture the three concepts described in the previous three sections, and, their relation.

  • Processes; The execution of activities, with some ordering among these activities, in order to achieve a (business) goal, e.g., assembly of a product, providing a service, etc.
  • Event Log; A historical collection of data, capturing the execution of a process. In a way, it is a digital snapshot of what has happened in the past, i.e., how the process has been executed before.
  • Process Model; A model (ranging from natural language to a mathematical model) describing how a process is supposed to behave.
Obviously, process mining is all about the interplay of these three entities.

Let’s get straight into it. One of the most fundamental challenges in process mining is process discovery. The goal is simple: a process discovery algorithm takes an event log as an input, and, returns a process model. Ideally, such a process discovery algorithm does this completely automatically, or, semi-automatically. In the context of our burger restaurant example, assume we have been logging all the orders over the past month, and, we have constructed a large event log. If we give this large event log to a process discovery algorithm, ideally, the algorithm discovers a model like the process model depicted in the previous section, i.e., where we introduced process models. Several process discovery algorithms have been developed, yielding various different types of process models. A few of these have been implemented in pm4py.

If you are reasonably farmiliar with Business Process Management, you might know the term 'process discovery' already. Often, BPM consultancy firms apply process discovery in a manual fashion. Various stake-holders that are responsible for certain parts of the process will be interviewed. Based on these interviews, a consultant typically manually creates a process model. However, such a process model typically describes an idealized view of the process. Typically stake-holders will (knowingly or not) describe only a small fraction of the true behavior of the process. This is where process discovery (i.e., in the process mining sense) comes in. The process model, i.e., discovered on the basis of the logged event data shows us what actually happened, rather than what is perceived to have happened.

Conformance Checking

In conformance checking, we assume that we already have a process model that accurately reflects the intended process behavior. Such a process model is either drawn by hand, i.e., by a process expert, or is the result of applying a process discovery algorithm on an event log. Note that there are various use cases in which checking whether the execution of a process is in line with a reference model. For example, the reference model expresses certain rules or regulations dictated by policy makers. Alternatively, a specific sequence of activities might be required to handle a complaint correctly.

There exist various ways to compute whether or not a given process model and event log are compliant w.r.t. each other. However, an elaborate discussion of these types of techniques is far out of scope here. The main take-away w.r.t. conformance checking is the following. Assume that we observe the trace of process behavior, depicted in Table 2, for our burger restaurant:

Table 2: A simple example event log fragment, capturing a trace of process behavior for the burger restaurant. Observe that, w.r.t. to the reference model depicted in Figure 1, Lucy has registered the payment method twice, and, the order was not wrapped. Conformance checking techniques allow us to compute such observations, and, quantify the number of deviations observed.
Order Number Activity Employee Date Time
1338 Take Order Lucy April 1st 2020 1:42PM
1338 Note Address of Customer Lucy April 1st 2020 1:44PM
1338 Register Payment Method Lucy April 1st 2020 1:45PM
1338 Register Payment Method Lucy April 1st 2020 1:46PM
1338 Prepare Burger John April 1st 2020 1:51PM
1338 Grab Soda Lucy April 1st 2020 1:47PM
1338 Put Burger in Box John April 1st 2020 1:57PM
1338 Deliver Order Mike April 1st 2020 2:00PM

Conformance checking techniques allow us to compute that, in this example, sadly, Lucy has registered the payment method twice, and, the order was not wrapped! Additionally, conformance checking techniques allow us to quantify (numerically) the number of deviations observed in an event log.

TLDR; Key Take-Aways

Process Mining is a collection of data-driven algorithms/methods/techniques/methodologies, with the main aim to better understand processes. The core idea of process mining is to acquire said understanding by analyzing the operational event data that was generated and stored during the execution of the process under study. Such operational event data is often referred to as an Event Log. As such, an event log is one of the main input artefacts of any process mining analysis.

There is a variety of analyses one can perform, using an event log as an input. Two of the most common use cases in process mining (there are many more) are:

  • Process Discovery;(Semi)Automated techniques that translate the event data into a Process Model that accurately describes the process, i.e., as captured in the event log.
  • Conformance Checking;Automated techniques that allow us quantify to what degree a process, i.e., as captured in an event log, is conforming to a process model, specifying the intended behavior of the process.

Finally, what does pm4py have to do with all of this? pm4py is simply a python library that implements a wide variety of process mining algorithms!

Importing Your First Event Log

In this section, we explain how to import (and export) event data in PM4Py. We assume that you are familiar with the conceptual basics of process mining, i.e., as described in the previous section.

File Types: CSV and XES

As explained in the previous section, process mining exploits Event Logs to generate knowledge of a process. A wide variety of information systems, e.g., SAP, ORACLE, SalesForce, etc., allow us to extract, in one way or the other, event logs similar to the example event log presented in Table 1 and Table 2. All the examples we show in this section and all algorithms implemented in pm4py assume that we have already extracted the event data into an appropriate event log format. Hence, the core of pm4py does not support any data extraction features. However, we provide solutions for data extraction purposes, i.e., please inspect the corresponding solutions page.

In order to support interoperability between different process mining tools and libraries, two standard data formats are used to capture event logs, i.e., Comma Separated Value (CSV) files and eXtensible Event Stream (XES) files. CSV files resemble the example tables shown in the previous section, i.e., Table 1 and Table 2. Each line in such a file describes an event that occurred. The columns represent the same type of data, as shown in the examples, e.g., the case for which the event occurred, the activity, the timestamp, the resource executing the activity, etc. The XES file format is an XML-based format that allows us to describe process behavior. We will not go into details w.r.t. the format of XES files, i.e., we refer to http://xes-standard.org/for an overview.

In the remainder of this tutorial, we will use an oftenly used dummy example event log to explain the basic process mining operations. The process that we are considering is a simplified process related to customer complaint handling, i.e., taken from the book of van der Aalst. The process, and the event data we are going to use, looks as follows.

Figure 3: Running example BPMN-based process model describing the behavior of the simple process that we use in this tutorial.

Let’s get started! We have prepared a small sample event log, containing behavior similar equal to the process model in Figure 3. You can find the sample event log here. Please download the file and store it somewhere on your computer, e.g., your Downloads folder (On Windows: this is 'C:/Users/user_name/Dowloads'). Consider Figure 4, in which we depict the first 25 rows of the example file.

Figure 4: Running example csv data set which we will use in this tutorial.

Note that, the data depicted in Figure 4 describes a table, however, in text format. Each line in the file corresponds to a row in the table. Whenever we encounter a ‘;’ symbol on a line, this implies that we are ‘entering’ the next column. The first line (i.e., row) specifies the name of each column. Observe that, in the data table described by the file, we have 5 columns, being: case_id, activity, timestamp, costs and resource. Observe that, similar to our previous example, the first column represents the case identifier, i.e., allowing us to identify what activity has been logged in the context of what instance of the process. The second column shows the activity that has been performed. The third column shows at what point in time the activity was recorded. In this example data, additional information is present as well. In this case, the fourth column tracks the costs of the activity, whereas the fifth row tracks what resource has performed the activity.

Before we go into loading the example file into PM4Py, let us briefly take a look at the data. Observe that, lines 2-10 show the events that have been recorded for the process identified by case identifier 3. We observe that first a register request activity was performed, followed by the examine casually, check ticket, decide,reinitiate request, examine thoroughlycheck ticket,decide, and finally, pay compensation activities. Note that, indeed, in this case the recorded process instance behaves as described by the model depicted in Figure 3.

Loading CSV Files

Given that we have familiarized ourselves with event logs and a way to represent event logs in a CSV file, it is time to start doing some process mining! We are going to load the event data, and, we are going to count how many cases are present in the event log, as well as the number of events. Note that, for all this, we are effectively using a third-party library called pandas. We do so because pandas is the de-facto standard of loading/manipulating csv-based data. Hence, any process mining algorithm implemented in PM4Py, using an event log as an input, can work directly with a pandas file!

                                    import pandas


def import_csv(file_path):
    event_log = pandas.read_csv(file_path, sep=';')
    num_events = len(event_log)
    num_cases = len(event_log.case_id.unique())
    print("Number of events: {}\nNumber of cases: {}".format(num_events, num_cases))


if __name__ == "__main__":
    import_csv("C:/Users/demo/Downloads/running-example.csv")
						
                                
Example 1: Loading an event log stored in a CSV file and computing the number of cases and the number of events in the file. In this example, no PM4Py is used yet, it is all being handled using pandas. If you run the code yourself, make sure to replace the path 'C:/Users/demo/Downloads/running-example.csv', to the appropriate path on your computer containing the running example file.

Observe that the output of the code is as follows.

                                    >> Number of events: 42
>> Number of cases: 6
                                    
                                

We will quickly go through the above example code. In the first line, we import the pandas library. The last lines (containing the if-statement) make sure that the code, when pasted, runs on its own (we will omit these lines from future examples). The core of the script is the function import_csv. As an input parameter, it requires the path to the csv file. The script uses the pandas read_csv-function, to load the event data. To calculate the number of events, we simply query the length of the data frame, i.e., by calling len(event_log). To calculate the number of cases, we use a built-in pandas function to return the number of unique values of the case_id column, i.e., event_log.case_id.unique(). Since that function returns a pandas built-in array object containing all the values of the column, we again query for its length. Note that, as is often the case when programming, there is a wide variety of ways to compute the aforementioned example statistics on the basis of a given CSV file.

Now we have loaded our first event log, it is time to put some PM4Py into the mix. Let us assume that we are not only interested in the number of events and cases, yet, we also want to figure out what activities occur first, and what activities occur last in the traces described by the event log. PM4Py has a specific built-in function for this, i.e., get_start_activities() and get_end_activities() respectively. Consider Example 2, in which we present the corresponding script.

                                    import pandas
import pm4py


def import_csv(file_path):
    event_log = pandas.read_csv(file_path, sep=';')
    event_log = pm4py.format_dataframe(event_log, case_id='case_id', activity_key='activity', timestamp_key='timestamp')
    start_activities = pm4py.get_start_activities(event_log)
    end_activities = pm4py.get_end_activities(event_log)
    print("Start activities: {}\nEnd activities: {}".format(start_activities, end_activities))
						
                                
Example 2: Loading an event log stored in a CSV file and computing the start and end activities of the traces in the event log. If you run the code yourself, make sure to point the file path to the appropriate path on your computer containing the running example file.

Observe that running the code gives us the following output:

                                    >> Start activities: {'register request': 6}
>> End activities: {'pay compensation': 3, 'reject request': 3}
						
                                

Note that, we now import pandas and pm4py. The first line of our script again loads the event log stored in CSV format as a data frame. The second line transforms the event data table into a format that can be used by any process mining algorithm in pm4py. That is, the format_dataframe()-function creates a copy of the input event log, and renames the assigned columns to standardized column names used in pm4py. In our example, the column case_id is renamed to case:concept:name, the activity column is renamed to concept:name and the timestamp column is renamed to time:timestamp. The underlying reasons for using the aforementioned standard names is primarily related to XES-based (the other file format that we will look at shortly) legacy. Hence, it is advisable to always import a csv based log as follows.

                                    import pandas as pd
import pm4py

event_log = pm4py.format_dataframe(pd.read_csv(file_path, sep=';'), case_id='case_id',
activity_key='activity',timestamp_key='timestamp')
						
                                

Note that, in this example, the value of the arguments, i.e., sep, case_id, activity_key and timestamp_key are depending on the input data. To obtain the activities that occur first and, respectively, last in any trace in the event log, we call the pm4py.get_start_activities(event_log) and the pm4py.get_end_activities(event_log) functions. The functions return a dictionary, containing the activities as a key, and, the number of observations (i.e., number of traces in which they occur first, respectively, last) in the event log.

PM4Py exploits a built-in pandas function to detect the format of the timestamps in the input data automatically. However, pandas looks at the timestamp values in each row in isolation. In some cases, this can lead to problems. For example, if the provided value is 2020-01-18, i.e., first the year, then the month, and then the day of the date, in some cases, a value of 2020-02-01 may be interpreted wrongly as January 2nd, i.e., rather than February 1st. To alleviate this problem, an additional parameter can be provided to the format_dataframe() method, i.e., the timest_format parameter. The default Python timestamp format codes can be used to provide the timestamp format.. In this example, the timestamp format is %Y-%m-%d %H:%M:%S%z. In general, we advise to specify the timestamp format!

Loading XES Files

Next to CSV files, event data can also be stored in an XML-based format, i.e., in XES files. In an XES file, we can describe a containment relation, i.e., a log contains a number of traces, which in turn contain several events. Furthermore, an object, i.e., a log, trace, or event, is allowed to have attributes. The advantage is that certain data attributes that are constant for a log or a trace, can be stored at that level. For example, assume that we only know the total costs of a case, rather than the costs of the individual events. If we want to store this information in a CSV file, we either need to replicate this information (i.e., we can only store data in rows, which directly refer to events), or, we need to explicitly define that certain columns only get a value once, i.e., referring to case-level attributes. The XES standard more naturally supports the storage of this type of information.

Consider Figure 5, in which we depict a snapshot of the running example data stored in the .xes file format. The complete file can be downloaded here.

Figure 5: Running example xes data set.

Observe that the trace with number 1 (reflected by the [string key=”concept:name”]-tag on line 9) is the first trace recorded in this event log. The first event of the trace represents the “register request” activity executed by Pete. The second event is the “examine thoroughly” activity, executed by Sue, etc. We will not elaborate on the XES standard in detail here, i.e., we refer to the XES homepage, and, to our video tutorial on importing XES for more information.

Importing an XES file is fairly straightforward. PM4Py has a special read_xes()-function that can parse a given xes file and load it in PM4Py, i.e., as an Event Log object. Consider the following code snippet, in which we show how to import an XES event log. Like the previous example, the script outputs activities that can start and end a trace.

                                    def import_xes(file_path):
event_log = pm4py.read_xes(file_path)
start_activities = pm4py.get_start_activities(event_log)
end_activities = pm4py.get_end_activities(event_log)
print("Start activities: {}\nEnd activities: {}".format(start_activities, end_activities))
                                    
                                

Exporting Event Data

Now we are able to import event data into PM4Py, let’s take a look at the opposite, i.e., exporting event data. Exporting of event logs can be very useful, e.g., we might want to convert a .csv file into a .xes file or we might want to filter out certain (noisy) cases and save the filtered event log. Like importing, exporting of event data is possible in two ways, i.e., exporting to csv (using pandas) and exporting to xes. In the upcoming sections, we show how to export an event log stored as a pandas data frame into a csv file, a pandas data frame as a xes file, a PM4Py event log object as a csv file and finally, a PM4Py event log object as a xes file.

Storing a Pandas Data Frame as a csv file

Storing an event log that is represented as a pandas dataframe is straightforward, i.e., we can directly use the to_csv (full reference here) function of the pandas DataFrame object. Consider the following example snippet of code, in which we show this functionality.

                                    import pandas as pd
event_log = pm4py.format_dataframe(pd.read_csv('C:/Users/demo/Downloads/running-example.csv', sep=';'), case_id='case_id',
activity_key='activity', timestamp_key='timestamp')
event_log.to_csv('C:/Users/demo/Desktop/running-example-exported.csv')
                                    
                                

Note that the example code imports the running example csv file as a pandas data frame, and, exports it to a csv file at the location ‘C:/Users/demo/Desktop/running-example-exported.csv’. Note that, by default, pandas uses a ‘,’-symbol rather than ‘;’-symbol as a column separator.

Storing a Pandas DataFrame as a .xes file

It is also possible to store a pandas data frame to a xes file. This is simply done by calling the pm4py.write_xes() function. You can pass the dataframe as an input parameter to the function, i.e., pm4py handles the internal conversion of the dataframe to an event log object prior to writing it to disk. Note that this construct only works if you have formatted the data frame, i.e., as highlighted earlier in the importing CSV section.

                                    import pandas
import pm4py

event_log = pm4py.format_dataframe(pandas.read_csv('C:/Users/demo/Downloads/running-example.csv', sep=';'), case_id='case_id',
                                       activity_key='activity', timestamp_key='timestamp')
pm4py.write_xes(event_log, 'C:/Users/demo/Desktop/running-example-exported.xes')
                                    
                                

Storing an Event Log object as a .csv file

In some cases, we might want to store an event log object, e.g., obtained by importing a .xes file, as a csv file. For example, certain (commercial) process mining tools only support csv importing. For this purpose, pm4py offers conversion functionality that allows you to convert your event log object into a data frame, which you can subsequently export using pandas.

                                    import pm4py

event_log = pm4py.read_xes(C:/Users/demo/Downloads/running-example.xes)
df = pm4py.convert_to_dataframe(event_log)
df.to_csv('C:/Users/demo/Desktop/running-example-exported.csv')
                                    
                                

Storing an Event Log Object as a .xes File

Storing an event log object as a .xes file is rather straightforward. In pm4py, the write_xes() method allows us to do so. Consider the simple example script below in which we show an example of this functionality.

                                    import pm4py

event_log = pm4py.read_xes(C:/Users/demo/Downloads/running-example.xes)
pm4py.write_xes(event_log, 'C:/Users/demo/Desktop/running-example-exported.xes')
                                    
                                

Generic Event Log Filtering

Like any data-driven field, the successful application of process mining needs data munging and crunching. In pm4py, you can munge and crunch your data in two ways, i.e., you can write lambda functions and apply them on your event log, or, you can apply pre-built filtering and transformation functions. In this section, we briefly explain how to use generic lambda functionality in pm4py, in the next section, we cover specific filtering functions.

In a nutshell, a lambda function allows you to specify a function that needs to be applied on a given element. As a simple example, take the following python code:

                                    f = lambda x: 2 * x
print(f(2))
                                    
                                    

In the code, we assign a lambda function to variable f. The function specifies that on each possible input it receives, the resulting function that is applied is a multiplication by 2. The output of the script is as follows:

                                    >> 4
                                    
                                    

Invoking variable f with argument value 2, triggers Python to execute the lambda function on the given argument, i.e., 2. Clearly, 2*2=4, hence, the output of the script is four. Note that, this only works if we provide any form of input that can be combined with the ‘* 2’ operation. For example, for Strings, the ‘* 2’ operation concatenates the input argument with itself:

                                    f = lambda x: 2 * x
print(f('Pete'))

>> PetePete

f = lambda x: 3 * x
print(f('Pete'))

>> PetePetePete
                                    
                                    

Lambda functions allow us to write short, type-independent functions. Given a list of objects, Python provides two core functions that apply a given lambda function on top of each element of the given list (in fact, any iterable):

  • filter(f,l); Apply the given lambda function f as a filter on the list (iterable) l.
  • map(f,l); Apply the given lambda function f as a transformation on the list (iterable) l.
(For more information, study the concept of ‘higher order functions’ in Python, e.g., as introduced here.) Consider the following basic example script in which we apply both a filter and a map function on a list of numbers:

                                    l = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

filtered = filter(lambda n: n >= 5, l)
transformed = map(lambda n: n * 3, l)

print(filtered)
print(transformed)

>> <filter object at 0x000002172EC34AC8>
>> <map object at 0x000002172EC34B38>

print(list(filtered))
print(list(transformed))

>> [5, 6, 7, 8, 9, 10]
>> [3, 6, 9, 12, 15, 18, 21, 24, 27, 30]
                                    
                                    

The functionality of the filter and map functions need little to no explanation, i.e., the filter retains all numbers in the list greater or equal to five, the map multiplies each given number by 3. However, what is interesting, is the fact that the resulting objects are no longer lists (or iterables), rather a filter object and map object are returned. These objects can be easily transformed to a list by wrapping it with a list() cast (i.e., the last part of the script shows this functionality).

In pm4py, event logs mimic lists of traces, which in turn, mimic lists of events. Clearly, lambda functions can therefore be applied to event logs and traces. However, as we have shown in the previous example, after applying such a lamda-based map or filter, the resulting object is no longer an event log. Furthermore, casting a filter object or map object to an event log in pm4py is a bit more involved, i.e., it is not so trivial as list(filtered) in the previous example. To this end, pm4py offers wrapper functions that make sure that after applying your higher-order function with a lambda function, the resulting object is again an Event Log object. Consider the following script in which we provide a small example of this functionality.

                                    import pm4py

log = pm4py.read_xes('C:/Users/demo/Downloads/running-example.xes')
for t in log:
    print(len(t))

>> 9
>> 5
>> 5
>> 5
>> 13

long_traces = pm4py.filter_log(lambda t: len(t) > 5, log)
for t in long_traces:
    print(len(t))

>> 9
>> 13

short_traces = pm4py.filter_log(lambda t: len(t) <= 5, log)
for t in short_traces:
    print(len(t))

>> 5
>> 5
>> 5

print(type(long_traces))
print(type(long_traces[0]))

>> <class 'pm4py.objects.log.log.EventLog'>
>> <class 'pm4py.objects.log.log.Trace'>

                                

In the script, with one line of code, we retain the traces that at least (or at most) have a certain length. Furthermore, the type of the resulting object is again an pm4py EventLog object, and, the traces within the event log are pm4py trace objects as well.

Apart from log filtering, pm4py exposes the following functionality related to higher-order functions:

  • pm4py.filter_log(f, log); filters the log according to function f.
  • pm4py.filter_trace(f,log); filters the trace according to function f.
  • pm4py.sort_log(log, key, reverse); sorts the event log according to a given key, reversed order if reverse is True.
  • pm4py.sort_trace(trace, key, reverse); sorts the trace according to a given key, reversed order if reverse is True.

Pre-Built Event Log Filters

There are various pre-built filters in PM4Py, which make commonly needed process mining filtering functionality a lot easier. In the upcoming list, we briefly give an overview of these functions. We describe how to call them, their main input parameters and their return objects.

  • filter_start_activities(log, activities, retain=True); This function filters the given event log object (either a data frame or a PM4Py event log object) based on a given set of input activity names that need to occur at the starting point of a trace. If we set retain to False, we remove all traces that contain any of the specified activities as their first event.
  • filter_end_activities(log, activities, retain=True); Similar functionality to the start activity filter. However, in this case, the filter is applied for the activities that occur at the end of a trace.
  • filter_event_attribute_values(log, attribute_key, values, level="case", retain=True); Filters an event log (either data frame or PM4Py EventLog object) on event attributes. The attribute_key is a string representing the attribute key to filter, the values parameter allows you to specify a set of allowed values. If the level parameter is set to 'case', then any trace that contains at least one event that matches the attribute-value combination is retained. If the level parameter value is set to 'event', only the events are retained that describe the specified value. Setting retain to False inverts the filter.
  • filter_trace_attribute_values(log, attribute_key, values, retain=True); Keeps (or removes if retain is set to False) only the traces that have an attribute value for the provided attribute_key and listed in the collection of corresponding values.
  • filter_variants(log, variants, retain=True) ; Keeps those traces that correspond to a specific activity execution sequence, i.e., known as a variant. For example, in a large log, we want to retain all traces that describe the execution sequence 'a', 'b', 'c'. The variants parameter is a collection of lists of activity names.
  • filter_directly_follows_relation(log, relations, retain=True); This function filters all traces that contain a specified 'directly follows relation'. Such a relation is simply a pair of activities, e.g., ('a','b') s.t., 'a' is directly followed by 'b' in a trace. For example, the trace <'a','b','c','d'> contains directly follows pairs ('a','b'), ('b','c') and ('c','d'). The relations parameter is a set of tuples, containing activity names. The retain parameter allows us to express whether or not we want to keep or remove the mathcing traces.
  • filter_eventually_follows_relation(log, relations, retain=True) This function allows us to match traces on a generalization of the directly follows relation, i.e., an arbitrary number of activities is allowed to occur in-between the input relations. For example, when we call the function with a relation ('a','b'), any trace in which we observe activity 'a' at some point, to be followed later by activity 'b', again at some point, adheres to this filter. For example, a trace <'a','b','c','d'> contains eventually follows pairs ('a','b'), ('a','c') ('a','d'), ('b','c'), ('b','d') and ('c','d'). Again, the relations parameter is a set of tuples, containing activity names and the retain parameter allows us to express whether or not we want to keep or remove the matching traces.
  • filter_time_range(log, dt1, dt2, mode='events'); Filters the event log based on a given time range, defined by timestamps dt1 and dt2. The timestamps should be of the form datetime.datetime. The filter has three modes (default: 'events'):
    • 'events'; Retains all events that fall in the provided time range. Removes any empty trace in the filtered event log.
    • 'traces_contained'; Retains any trace that is completely 'contained' within the given time frame. For example, this filter is useful if one is interested to retain all full traces in a specific day/month/year.
    • 'traces_intersecting'; Retains any trace that has at least one event that falls into the given time range.

Consider the example code below, in which we provide various example applications of the mentioned filtering functions, using the running example event log. Try to copy-paste each line in your own environment and play around with the resulting filtered event log to get a good idea of the functionality of each filter. Note that, all functions shown below also work when providing a dataframe as an input!

                                    import pm4py
import datetime as dt

log = pm4py.read_xes('C:/Users/demo/Downloads/running-example.xes')

filtered = pm4py.filter_start_activities(log, {'register request'})

filtered = pm4py.filter_start_activities(log, {'register request TYPO!'})

filtered = pm4py.filter_end_activities(log, 'pay compensation')

filtered = pm4py.filter_event_attribute_values(log, 'org:resource', {'Pete', 'Mike'})

filtered = pm4py.filter_event_attribute_values(log, 'org:resource', {'Pete', 'Mike'}, level='event')

filtered = pm4py.filter_trace_attribute_values(log, 'concept:name', {'3', '4'})

filtered = pm4py.filter_trace_attribute_values(log, 'concept:name', {'3', '4'}, retain=False)

filtered = pm4py.filter_variants(log, [
    ['register request', 'check ticket', 'examine casually', 'decide', 'pay compensation']])

filtered = pm4py.filter_variants(log, [
    ['register request', 'check ticket', 'examine casually', 'decide', 'reject request']])

filtered = pm4py.filter_directly_follows_relation(log, [('check ticket', 'examine casually')])

filtered = pm4py.filter_eventually_follows_relation(log, [('examine casually', 'reject request')])

filtered = pm4py.filter_time_range(log, dt.datetime(2010, 12, 30), dt.datetime(2010, 12, 31), mode='events')

filtered = pm4py.filter_time_range(log, dt.datetime(2010, 12, 30), dt.datetime(2010, 12, 31),
                                   mode='traces_contained')

filtered = pm4py.filter_time_range(log, dt.datetime(2010, 12, 30), dt.datetime(2010, 12, 31),
                                   mode='traces_intersecting')

                                

Discovering Your First Process Model

Since we have studied basic conceptual knowledge of process mining and event data munging and crunching, we focus on process discovery. As indicated, the goal is to discover, i.e., primarily completely automated and algorithmically, a process model that accurately describes the process, i.e., as observed in the event data. For example, given the running example event data, we aim to discover the process model that we have used to explain the running example's process behavior, i.e., Figure 3. This section briefly explains what modeling formalisms exist in PM4Py while applying different process discovery algorithms. Secondly, we give an overview of the implemented process discovery algorithms, their output type(s), and how we can invoke them. Finally, we discuss the challenges of applying process discovery in practice.

Obtaining a Process Model

There are three different process modeling notations that are currently supported in PM4Py. These notations are: BPMN, i.e., models such as the ones shown earlier in this tutorial, Process Trees and Petri nets. A Petri net is a more mathematical modeling representation compared to BPMN. Often the behavior of a Petri net is more difficult to comprehend compared to BPMN models. However, due to their mathematical nature, Petri nets are typically less ambiguous (i.e., confusion about their described behavior is not possible). Process Trees represent a strict subset of Petri nets and describe process behavior in a hierarchical manner. In this tutorial, we will focus primarily on BPMN models and process trees. For more information about Petri nets and their application to (business) process modeling (from a ‘workflow’ perspective), we refer to this article.

Interestingly, none of the algorithms implemented in PM4Py directly discovers a BPMN model. However, any process tree can easily be translated to a BPMN model. Since we have already discussed the basic operators of BPMN models, we will start with the discovery of a process tree, which we convert to a BPMN model. Later, we will study the ‘underlying’ process tree. The algorithm that we are going to use is the ‘Inductive Miner’; More details about the (inner workings of the) algorithm can be found in this presentation and in this article. Consider the following code snippet. We discover a BPMN model (using a conversion from process tree to BPMN) using the inductive miner, based on the running example event data set.

                                    import pm4py
log = pm4py.read_xes('C:/Users/demo/Downloads/running-example.xes')

process_tree = pm4py.discover_tree_inductive(log)
bpmn_model = pm4py.convert_to_bpmn(process_tree)
pm4py.view_bpmn(bpmn_model)

                                

Note that the resulting process model is the following image:

Figure 6: BPMN model discovered based on the running example event data set, using the Inductive Miner implementation of PM4Py.

Observe that the process model that we discovered, is indeed the same model as the model that we have used before, i.e., as shown in Figure 3.

As indicated, the algorithm used in this example actually discovers a Process Tree. Such a process tree is, mathematically speaking, a rooted tree annotated with ‘control-flow’ information. We’ll first use the following code snippet to discover a process tree based on the running example, and, afterwards shortly analyze the model.

                                    import pm4py
log = pm4py.read_xes('C:/Users/demo/Downloads/running-example.xes')

process_tree = pm4py.discover_tree_inductive(log)
pm4py.view_process_tree(process_tree)

                                
Figure 7: Process Tree model discovered based on the running example event data set, using the Inductive Miner implementation of PM4Py.

We the process tree model from top to bottom. The first circle, i.e., the ‘root’ of the process tree, describes a ‘->’ symbol. This means that, when srolling further down, the process described by the model executes the ‘children’ of the root from left to right. Hence, first “register request” is executed, followed by the circle node with the ‘*’ symbol, finally to be followed by the node with the ‘X’ symbol. The node with the ‘*’ represents ‘repeated behavior’, i.e., the possibility to repeat the behavior. When scrolling further down, the left-most ‘subtree’ of the ‘*’-operator is always executed, the right-most child (in this case, “reinitiate request”) triggers a repeated execution of the left-most child. Observe that this is in line with the process models we have seen before, i.e., the “reinitiate request” activity allows us to repeat the behavior regarding examinations and checking the ticket. When we go further down below in the subtree of the ‘*’-operator, we again observe a ‘->’ node. Hence, its left-most child is executed first, followed by its right-most child (“decide”). The left-most child of the ‘->’ node has a ‘+’ symbol. This represents concurrent behavior; hence, its children can be executed simultaneously or in any order. Its left-most child is the “check ticket” activity. Its right-most child is a node with an ‘X’ symbol (just like the right-most child of the tree's root). This represents an exclusive choice, i.e., one of the children is executed (either “examine casually” or “examine thoroughly”). Observe that the process tree describes the exact same behavior as the BPMN models shown before.

Obtaining a Process Map

Many commercial process mining solutions do not provide extended support for discovering process models. Often, as a main visualization of processes, process maps are used. A process map contains activities and connections (by means of arcs) between them. A connection between two activities usually means that there some form of precedence relation. In its simplest form, it means that the ‘source’ activity directly precedes the ‘target’ activity. Let’s quickly take a look at a concrete example! Consider the following code snippet, in which we learn a ‘Directly Follows Graph’ (DFG)-based process map:

                                    import pm4py
log = pm4py.read_xes('C:/Users/demo/Downloads/running-example.xes')

dfg, start_activities, end_activities = pm4py.discover_dfg(log)
pm4py.view_dfg(dfg, start_activities, end_activities)
                                    
Figure 8: Process Map (DFG-based) discovered based on the running example event data set.

The pm4py.discover_dfg(log) function returns a triple. The first result, i.e., called dfg in this example, is a dictionary mapping pairs of activities that follow each other directly, to the number of corresponding observations. The second and third arguments are the start and end activities observed in the event log (again counters). In the visualization, the green circle represents the start of any observed process instance. The orange circle represents the end of an observed process instance. In 6 cases, the register request is the first activity observed (represented by the arc labeled with value 6). In the event log, the check ticket activity is executed directly after the register request activity. The examine thoroughly activity is following registration once, examine casually follows 3 times. Note that, indeed, in total, the register activity is followed by 6 different events, i.e., there are 6 traces in the running example event log. However, note that there are typically much more relations observable compared to the number of cases in an event log. Even using this simple event data, the DFG-based process map of the process is much more complex than the process models learned earlier. Furthermore, it is much more difficult to infer the actual execution of the process based on the process map. Hence, when using process maps, one should be very carefully when trying to comprehend the actual process.

In PM4Py, we also implemented the Heuristics Miner, a more advanced process map discovery algorithm, compared to its DFG-based alternative. We won’t go into the algorithmic details here, however, in a HM-based process map, the arcs between activities represent observed concurrency. For example, the algorithm is able to detect that the ticket check and examination are concurrent. Hence, these activities will not be connected in the process map. As such, a HM-based process map is typically simpler compared to a DFG-based process map.

                                    import pm4py
log = pm4py.read_xes('C:/Users/demo/Downloads/running-example.xes')

map = pm4py.discover_heuristics_net(log)
pm4py.view_heuristics_net(map)
                                    
Figure 9: Process Map (HM-based) discovered based on the running example event data set.

Conformance Checking

Under Construction