Supported/Described Version(s): PM4Py 1.3.X
This documentation assumes that the reader has a basic understanding of process mining and python concepts.

Handling Event Data

In this section, information about importing and exporting event logs, stored in various data formats, is presented. Before we dive into the details of importing and exporting various different types of files containing event data, we first briefly explain the two basic notions of event data used within PM4Py. We assume the reader to be farmiliar with the general concept of an event log. In general, we distingiush between two different event data object types:

  • Event Stream (objects.log.log.EventStream); Simply represents a sequence of events. Events themselves are simply an extension of the Mapping class of python (collections.abc.Mapping), which allows us to use events as a dict. From a programming perspective, an Event Stream behaves exactly like a list object in Python. However, when applying lambda functions, the result needs to be explicitly casted to an EventStream object.
  • Event Log (objects.log.log.EventLog); Represents a sequence of sequences of events. The concept of an event log is the more traditional view on event data, i.e., executions of a process are captured in traces of events. However, in PM4Py, the Event Log maintains an order of traces. In this way, sorting traces using some specific sorting criterion is supported naturally, and, lambda functions and filters are easily applied on top of Event Logs as well.

Importing IEEE XES files

IEEE XES is a standard format describing how event logs are stored. For more information about the format, please study the IEEE XES Website. A simple synthetic event log can be downloaded from here. Note that several real event logs have been made available, over the past few years. You can find them here.


The example code on the right shows how to import an event log, stored in the IEEE XES format, given a file path to the log file. The code fragment uses the standard importer (iterparse, described in a later paragraph). Note that IEEE XES Event Logs are imported into an Event Log object, i.e., as described earlier.

from pm4py.objects.log.importer.xes import importer as xes_importer
log = xes_importer.apply('<path_to_xes_file.xes>')

Event logs are stored as an extension of the Python list data structure. To access a trace in the log, it is enough to provide its index in the event log. Consider the example on the right on how to access the different objects stored in the imported log.

print(log[0]) #prints the first trace of the log
print(log[0][0]) #prints the first event of the first trace

The apply() method of the xes_importer, i.e. located in pm4py.objects.log.importer.xes.importer.py, contains two optional parameters: variant and parameters. The variant parameter indicates which variant of the importer to use. The parameters parameter is a Python dictionary, specifying specific parameters of choice.

from pm4py.objects.log.importer.xes import importer as xes_importer
variant = xes_importer.Variants.ITERPARSE
parameters = {variant.value.Parameters.TIMESTAMP_SORT: True}
log = xes_importer.apply('<path_to_xes_file>',
                         variant=variant, parameters=parameters)

This method invocation style is used throughout PM4Py in the various different algorithms implemented, i.e., by wrapping around the different implementations, new variants of algorithms are easily called, using previously written PM4Py code. W.r.t. XES importers, two variants are provided. One implementation is based on the iterparse() function of xml.etree. The other variant is a line-by-line, custom parser (for improved performance). It does not follow the standard and is able to import traces, simple trace attributes, events, and simple event attributes. To specify a variant, we add the following argument to the call to the importer: variant=xes_importer.Variants.ITERPARSE (note that, in the example code, this is encapsulated in local varaible variant). The xes_importer.Variants.ITERPARSE-value, actually maps on the underlying Python module, implementing the iterparse-based importer. We are able to access that reference, by accessing the value property, e.g., xes_importer.Variants.ITERPARSE.value. That module, contains a parameter definition, i.e., Parameters, containing all possible parameters for the iterparse-variant. As an example, parameter TIMESTAMP_SORT is one of those, accessed by xes_importer.Variants.ITERPARSE.value.Parameters.TIMESTAMP_SORT. Click the button below, to reveal all variants and corresponding parameters defined for importing IEEE XES files.

Variant Parameter Key Type Default Description
Iterparse (ITERPARSE) TIMESTAMP_SORT boolean False If True, the log is sorted by timestamp.
TIMESTAMP_KEY string 'time:timestamp' If timestamp_sort is True, then using this event-attribute key to read timestamps.
REVERSE_SORT boolean False If True, the sorting is inverted.
INSERT_TRACE_INDICES boolean False If True, trace indices are added as an event attribute for each event
MAX_TRACES integer 1000000000 Maximum number of traces to import from the log
Line-By-Line (LINE_BY_LINE) TIMESTAMP_SORT boolean False (Same as Iterparse)
TIMESTAMP_KEY string 'time:timestamp' (Same as Iterparse)
REVERSE_SORT boolean False (Same as Iterparse)
INSERT_TRACE_INDICES boolean False (Same as Iterparse)
MAX_TRACES integer 1000000000 (Same as Iterparse)
MAX_BYTES integer 100000000000 Maximum number of bytes to read

Importing CSV files

Apart from the IEEE XES standard, a lot of event logs are actually stored in a CSV file. In general, there is two ways to deal with CSV files in PM4Py:

  • Import the CSV into a pandas DataFrame; In general, most existing algorithms in PM4Py are coded to be flexible in terms of their input, i.e., if a certain event log object is provided that is not in the right form, we translate it to the appropriate form for you. Hence, after importing a dataframe, most algorithms are directly able to work with the data frame.
  • Convert the CSV into an event log object (similar to the result of the IEEE XES importer presented in the previous section); In this case, the first step is to import the CSV file using pandas (similar to the previous bullet) and subsequently converting it to the event log object. In the remainder of this section, we briefly highlight how to convert a pandas DataFrame to an event log. Note that most algorithms use the same type of conversion, in case a given event data object is not of the right type.

To convert objects in PM4Py, there is a dedicated package, i.e., objects.conversion. The conversion package allows one to convert an object of a certain type to a new object of a different type (if such a conversion is applicable). Within the conversion package, a standard naming convention is applied, i.e., the type of the input object defines the package in which the code resides. Thus, since we assume that the imported DataFrame represents an event log, we find the appropriate conversion in the objects.conversion.log package.


The example code on the right shows how to convert a CSV file into the PM4Py internal event data object types. By default, the converter converts the dataframe to an Event Log object (i.e., not an Event Stream).

Actually, we suggest to sort the dataframe by its timestamp column. In the example on the right, it is assumed that the timestamp column is timestamp. This ensures that events are sorted by their timestamp.

import pandas as pd
from pm4py.objects.log.util import dataframe_utils
from pm4py.objects.conversion.log import converter as log_converter

log_csv = pd.read_csv('<path_to_csv_file.csv>', sep=',')
log_csv = dataframe_utils.convert_timestamp_columns_in_df(log_csv)
log_csv = log_csv.sort_values('<timestamp_column>')
event_log = log_converter.apply(log_csv)

Note that the example code above does not directly work in a lot of cases. There are a few reasons for this. First of all, a CSV-file, by definition, is more close to an Event Stream, i.e., it represents a sequence of events. Since an event log 'glues' events together that belong to the same case, i.e., into a trace of events, we need to specify to the converter what attribute to use for this. The parameter we need to set for this, i.e., in the converter is the CASE_ID_KEY parameter. Its default value is 'case:concept:name'. Hence, when our input event data, stored in a csv-file has a column with the name case:concept:name, that column is used to define traces.


Therefore, let us consider a very simple example event log, and, assume it is stored as a csv-file:

case activity timestamp clientID
1 register request 20200422T0455 1337
2 register request 20200422T0457 1479
1 submit payment 20200422T0503 1337

In this small example table, we observe four columns, i.e., case, activity, timestamp and clientID. Clearly, when importing the data and converting it to an Event Log object, we aim to combine all rows (events) with the same value for the case column together. Hence, the default value of the CASE_ID_KEY parameter is not set to the right value. Another interesting phenomenon in the example data is the fourth column, i.e., clientID. In fact, the client ID is an attribute that will not change over the course of execution a process instance, i.e., it is a case-level attribute. PM4Py allows us to specify that a column actually describes a case-level attribute (under the assumption that the attribute does not change during the execution of a process). However, for this, we need to specify an additional parameter, i.e., the CASE_ATTRIBUTE_PREFIX parameter, with default value 'case:'.


The example code on the right shows how to convert the previously examplified csv data file. After loading the csv file of the example table, we rename the clientID column to case:clientID (this is a specific operation provided by pandas!). Then, we specify that the column identifying the case identifier attribute is the column with name 'case'. Note that the full parameter path is log_converter.Variants.TO_EVENT_LOG.value.Parameters.CASE_ID_KEY

import pandas as pd
from pm4py.objects.conversion.log import converter as log_converter

log_csv = pd.read_csv('<path_to_csv_file.csv>', sep=',')
log_csv.rename(columns={'clientID': 'case:clientID'}, inplace=True)
parameters = {log_converter.Variants.TO_EVENT_LOG.value.Parameters.CASE_ID_KEY: 'case'}
event_log = log_converter.apply(log_csv, parameters=parameters, variant=log_converter.Variants.TO_EVENT_LOG)

In case we would like to use a different prefix for the case-level attributes, e.g., 'caseAttr', we can do so by mapping the CASE_ATTRIBUTE_PREFIX (full path: log_converter.Variants.TO_EVENT_LOG.value.Parameters.CASE_ATTRIBUTE_PREFIX) to the value 'caseAttr'. Note that in the call to the converter, in this case, we explicitly set the variant to be used, e.g., log_converter.Variants.TO_EVENT_LOG. Finally, note that any type of data format that can be parsed to a Pandas DataFrame, is supported by PM4Py.

Converting Event Data

In this section, we describe how to convert event log objects from one object type to another object type. As mentioned in the previous section, the conversion functionality of event logs is located in pm4py.objects.conversion.log.converter. There are three objects, which we are able to 'switch' between, i.e., Event Log, Event Stream and Data Frame objects. Please refer to the previous code snippet for an example of applying log conversion (applied when importing a CSV object). Finally, note that most algorithms internally use the converters, in order to be able to handle an input event data object of any form. In such a case, the default parameters are used.

Variant Parameter Key Type Default Description
TO_EVENT_LOG STREAM_POST_PROCESSING boolean False Removes events that have no type information.
CASE_ATTRIBUTE_PREFIX string 'case:' Any attribute (column in case of DF) with the prefix 'case:' is stored as a trace attribute.
CASE_ID_KEY string 'case:concept:name' Attribute (column in case of DF) that needs to be used to define traces.
DEEP_COPY boolean False If set to True objects will be created using a deep-copy (if applicable). Avoids side-effects (specifically when converting an Event Stream to an Event Log).
TO_EVENT_STREAM STREAM_POST_PROCESSING boolean False (Same as TO_EVENT_LOG)
CASE_ATTRIBUTE_PREFIX string 'case:' Any trace attribute (in case of converting an Event Log to an Event Stream object) will get this prefix. Not applicable if we translate a DataFrame to an Event Stream object.
DEEP_COPY boolean False (Same as TO_EVENT_LOG)
TO_DATA_FRAME CASE_ATTRIBUTE_PREFIX string 'case:' (Same as TO_EVENT_STREAM; will only be applied if input is an Event Log object, i.e., which will first be translated to an Event Stream Object.)
DEEP_COPY boolean False (Same as TO_EVENT_STREAM)

Exporting IEEE XES files

Exporting an Event Log object to an IEEE Xes file is fairly straightforward in PM4Py. Consider the example code fragment on the right, which depicts this functionality.

from pm4py.objects.log.exporter.xes import exporter as xes_exporter
xes_exporter.apply(log, '<path_to_exported_log.xes>')

In the example, the log object is assumed to be an Event Log object. The exporter also accepts an Event Stream or DataFrame object as an input. However, the exporter will first convert the given input object into an Event Log. Hence, in this case, standard parameters for the conversion are used. Thus, if the user wants more control, it is advisable to apply the conversion to Event Log, prior to exporting.

Variant Parameter Key Type Default Description
ETree (ETREE) COMPRESS boolean False If True, the log is stored as a 'xes.gz' file.

Exporting logs to CSV

To export an event log to a csv-file, PM4Py uses Pandas. Hence, an event log is first converted to a Pandas Data Frame, after which it is written to disk.

import pandas as pd
from pm4py.objects.conversion.log import converter as log_converter
dataframe = log_converter.apply(log, variant=log_converter.Variants.TO_DATA_FRAME)
dataframe.to_csv('<path_to_csv_file.csv>')
                                

In case an event log object is provided that is not a dataframe, i.e., an Event Log or Event Stream, the conversion is applied, using the default parameter values, i.e., as presented in the Converting Event Data section. Note that exporting event data to as csv file has no parameters. In case more control over the conversion is needed, please apply a conversion to dataframe first, prior to exporting to csv.

I/O with Other File Types

At this moment, I/O of any format supported by Pandas (dataframes) is implicitly supported. As long as data can be loaded into a Pandas dataframe, PM4Py is reasonably able to work with such files.

Generic Event Data Manipulation

Since Event Logs and Event Streams are iterables (note: this does not apply for dataframes), they are applicable to be used in combination with lambda functions. However, as they contain more information (such as log-level attributes), directly appying, e.g., a filter, does not work. Therefore, a utility package is available that wraps around filtering/maps/sorting in order to combine this functionality with Event Logs. The code is located in pm4py.objects.log.util.func

Consider the code fragment on the right, which first imports an event log and then filters out each trace with a length shorter than three. The func.filter_ function mimics the built-in Python function filter(). However, it returns the filtered list of traces, included in an Event Log (or Event Stream) object.

from pm4py.objects.log.importer.xes import importer as xes_importer
from pm4py.objects.log.util import func

log = xes_importer.apply('<path_to_imported_log.xes>')
log = func.filter_(lambda t: len(t) > 2, log)

Apart from the filter_-function, the pm4py.objects.log.util.func package provides a map_ and a sort_ function.

Filtering Event Data

PM4Py also has various specific methods to filter an event log.

Filtering on timeframe

In the following paragraph, various methods regarding filtering with time frames are present. For each of the methods, the log and Pandas Dataframe methods are revealed.

One might be interested in only keeping the traces that are contained in a specific interval, e.g. 09 March 2011 and 18 January 2012. The first code snippet works for a log object, the second one for a dataframe object.

from pm4py.algo.filtering.log.timestamp import timestamp_filter
filtered_log = timestamp_filter.filter_traces_contained
               (log, "2011-03-09 00:00:00", "2012-01-18 23:59:59")
from pm4py.algo.filtering.pandas.timestamp import timestamp_filter
df_timest_intersecting = timestamp_filter.filter_traces_intersecting
               (dataframe, "2011-03-09 00:00:00", "2012-01-18 23:59:59",
                                          parameters={timestamp_filter.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      timestamp_filter.Parameters.TIMESTAMP_KEY: "time:timestamp"})

However, it is also possible to keep the traces that are intersecting with a time interval. The first example is again for log objects, the second one for dataframe objects.

from pm4py.algo.filtering.log.timestamp import timestamp_filter
filtered_log = timestamp_filter.filter_traces_intersecting
               (log, "2011-03-09 00:00:00", "2012-01-18 23:59:59")
from pm4py.algo.filtering.pandas.timestamp import timestamp_filter
df_timest_intersecting = timestamp_filter.filter_traces_intersecting
               (dataframe, "2011-03-09 00:00:00", "2012-01-18 23:59:59",
                                          parameters={timestamp_filter.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      timestamp_filter.Parameters.TIMESTAMP_KEY: "time:timestamp"})

Until now, only trace based techniques have been discussed. However, there is a method to keep the events that are contained in specific timeframe. As previously mentioned, the first code snippet provides information about how to apply this technique on log objects, whereby the second snippets provides information about how to apply this on dataframe objects.

from pm4py.algo.filtering.log.timestamp import timestamp_filter
filtered_log_events = timestamp_filter.apply_events
               (log, "2011-03-09 00:00:00", "2012-01-18 23:59:59")
from pm4py.algo.filtering.pandas.timestamp import timestamp_filter
df_timest_events = timestamp_filter.apply_events
               (dataframe, "2011-03-09 00:00:00", "2012-01-18 23:59:59",
                                          parameters={timestamp_filter.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      timestamp_filter.Parameters.TIMESTAMP_KEY: "time:timestamp"})

Filter on case performance

This filter permits to keep only traces with duration that is inside a specified interval. In the examples, traces between 1 and 10 days are kept. Note that the time parameters are given in seconds. The first code snippet applies this technique on log object, the second one on a dataframe object.

from pm4py.algo.filtering.log.cases import case_filter
filtered_log = case_filter.filter_case_performance(log, 86400, 864000)
from pm4py.algo.filtering.pandas.cases import case_filter
df_cases = case_filter.filter_case_performance
               (dataframe, min_case_performance=86400, max_case_performance=864000,
                                          parameters={case_filter.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      case_filter.Parameters.TIMESTAMP_KEY: "time:timestamp"})

Filter on start activities

In general, PM4Py offers two methods to filter a log or a dataframe on start activities. In the first method, a list of start activities has to be specified. On the activities that are contained in the list, the filter is applied on. In the second method, a decreasing factor is used. An explanation can be inspected by clicking on the button below.

Suppose the following start activity and their respective occurrences.
Activity Number of occurrences
A 1000
B 700
C 300
D 50
Assume DECREASING_FACTOR to be 0.6. The most frequent start activity is kept, A in this case. Then, the number of occurrences of the next frequent activity is divided by the number of occurrences of this activity. Therefore, the computation is 700/1000=0.7. Since 0.7>0.6, B is kept as admissible start activity. In the next step, the number of occurrences of activity C and B are compared. In this case 300/700≈0.43. Since 0.43<0.6, C is not accepted as admissible start activity and the method stops here.

First of all, it might be necessary to know the starting activities. Therefore, code snippets are provided. Subsequently, an example of filtering is provided. The first snippet is working with log object, the second one is working on a dataframe. log_start is a dictionary that contains as key the activity and as value the number of occurrence.

from pm4py.algo.filtering.log.start_activities import start_activities_filter
log_start = start_activities_filter.get_start_activities(log)
filtered_log = start_activities_filter.apply(log, ["S1"]) #suppose "S1" is the start activity you want to filter on
from pm4py.algo.filtering.pandas.start_activities import start_activities_filter
log_start = start_activities_filter.get_start_activities(dataframe)
df_start_activities = start_activities_filter.apply(dataframe, ["S1"],
                                          parameters={start_activities_filter.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      start_activities_filter.Parameters.ACTIVITY_KEY: "concept:name"}) #suppose "S1" is the start activity you want to filter on
						

As mentioned earlier, there is also a method that aims to keep the frequent start activities. Again, the first snippet is about a log object, the second is about a dataframe object. The default value for DECREASING_FACTOR is 0.6.

from pm4py.algo.filtering.log.start_activities import start_activities_filter
log_af_sa = start_activities_filter.apply_auto_filter
               (log, parameters={start_activities_filter.Parameters.DECREASING_FACTOR: 0.6})
from pm4py.algo.filtering.pandas.start_activities import start_activities_filter
df_auto_sa = start_activities_filter.apply_auto_filter
               (dataframe, parameters={start_activities_filter.Parameters.DECREASING_FACTOR: 0.6})

Filter on end activities

In general, PM4Py offers two methods to filter a log or a dataframe on end activities. In the first method, a list of end activities has to be specified. On the activities that are contained in the list, the filter is applied on. In the second method, a decreasing factor is used. An explanation can be inspected by clicking on the button in the start activity section.

This filter permits to keep only traces with an end activity among a set of specified activities. First of all, it might be necessary to know the end activities. Therefore, code snippets are provided. Subsequently, an example of filtering is provided. Here, for the dataframe filtering, a further attribute specification is possible: case:concept:name is in this case the column of the dataframe that is the Case ID, concept:name is the column of the dataframe that is the activity.

from pm4py.algo.filtering.log.end_activities import end_activities_filter
end_activities = end_activities_filter.get_end_activities(log)
filtered_log = end_activities_filter.apply(log, ["pay compensation"])
from pm4py.algo.filtering.pandas.end_activities import end_activities_filter
end_acitivites = end_activities_filter.get_end_activities(df)
filtered_df = end_activities_filter.apply(df, ["pay compensation"],
                                          parameters={end_activities_filter.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      end_activities_filter.Parameters.ACTIVITY_KEY: "concept:name"})

Filter on variants

A variant is a set of cases that share the same control-flow perspective, so a set of cases that share the same classified events (activities) in the same order. In this section, we will focus for all methods first on log objects, then we will continue with the dataframe.

To get the list of variants contained in a given log, the following code could be used. The first code is for an log object, the second for a dataframe. The result is expressed as a dictionary having as key the variant and as value the list of cases that share the variant.

from pm4py.algo.filtering.log.variants import variants_filter
variants = variants_filter.get_variants(log)
from pm4py.statistics.traces.pandas import case_statistics
variants = case_statistics.get_variants_df(df,
                                          parameters={case_statistics.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      case_statistics.Parameters.ACTIVITY_KEY: "concept:name"})

If the number of occurrences of the variants is of interest, the following code retrieves a list of variants along with their count (so, a dictionary which key is the variant and the value is the number of occurrences).

from pm4py.statistics.traces.log import case_statistics
variants_count = case_statistics.get_variant_statistics(log)
variants_count = sorted(variants_count, key=lambda x: x['count'], reverse=True)
from pm4py.statistics.traces.pandas import case_statistics
variants_count = case_statistics.get_variant_statistics(df,
                                          parameters={case_statistics.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      case_statistics.Parameters.ACTIVITY_KEY: "concept:name",
                                                      case_statistics.Parameters.TIMESTAMP_KEY: "time:timestamp"})
variants_count = sorted(variants_count, key=lambda x: x['case:concept:name'], reverse=True)

To filter based on variants, assume that variants is a list, whereby each element is a variant (expressed in an equal way as in the variants retrieval method). The first method can be applied on log objects, the second can be applied on dataframe objects. Note that the variants given in variants are kept.

from pm4py.algo.filtering.log.variants import variants_filter
             filtered_log1 = variants_filter.apply(log, variants)
from pm4py.algo.filtering.pandas.variants import variants_filter
             filtered_df1 = variants_filter.apply(df, variants,
                                          parameters={variants_filter.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      variants_filter.Parameters.ACTIVITY_KEY: "concept:name"})

Contrary to the previous example, suppose you want to filter the given variants out. Again, let variants be a list, whereby each element is a variant.

filtered_log2 = variants_filter.apply(log, variants, parameters={variants_filter.Parameters.POSITIVE: False})
filtered_df2 = variants_filter.apply(df, variants,
                                          parameters={variants_filter.Parameters.POSITIVE: False, variants_filter.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      variants_filter.Parameters.ACTIVITY_KEY: "concept:name"})

A filter to keep automatically the most common variants could be applied through the apply_auto_filter method. This method accepts a parameter called DECREASING_FACTOR (default value is 0.6; further details are provided in the start activities filter).

auto_filtered_log = variants_filter.apply_auto_filter(log)
auto_filtered_df = variants_filter.apply_auto_filter(df,
                                          parameters={variants_filter.Parameters.POSITIVE: False, variants_filter.Parameters.CASE_ID_KEY: "case:concept:name",
                                                      variants_filter.Parameters.ACTIVITY_KEY: "concept:name"})

Filter on attributes values

Filtering on attributes values permits alternatively to:

  • Keep cases that contains at least an event with one of the given attribute values
  • Remove cases that contains an event with one of the the given attribute values
  • Keep events (trimming traces) that have one of the given attribute values
  • Remove events (trimming traces) that have one of the given attribute values

Example of attributes are the resource (generally contained in org:resource attribute) and the activity (generally contained in concept:name attribute). As noted before, the first method can be applied on log objects, the second on dataframe objects.

To get the list of resources and activities contained in the log, the following code could be used.

from pm4py.algo.filtering.log.attributes import attributes_filter
activities = attributes_filter.get_attribute_values(log, "concept:name")
resources = attributes_filter.get_attribute_values(log, "org:resource")
from pm4py.algo.filtering.pandas.attributes import attributes_filter
activities = attributes_filter.get_attribute_values(df, attribute_key="concept:name")
resources = attributes_filter.get_attribute_values(df, attribute_key="org:resource")

To filter traces containing/not containing a given list of resources, the following code could be used.


tracefilter_log_pos = attributes_filter.apply(log, ["Resource10"],
                                          parameters={attributes_filter.Parameters.ATTRIBUTE_KEY: "org:resource", attributes_filter.Parameters.POSITIVE: True})
tracefilter_log_neg = attributes_filter.apply(log, ["Resource10"],
                                          parameters={attributes_filter.Parameters.ATTRIBUTE_KEY: "org:resource", attributes_filter.Parameters.POSITIVE: False})

df_traces_pos = attributes_filter.apply(df, ["Resource10"],
                                          parameters={attributes_filter.Parameters.CASE_ID_KEY: "case:concept:name", attributes_filter.Parameters.ATTRIBUTE_KEY: "org:resource", attributes_filter.Parameters.POSITIVE: True})
df_traces_neg = attributes_filter.apply(df, ["Resource10"],
                                          parameters={attributes_filter.Parameters.CASE_ID_KEY: "case:concept:name", attributes_filter.Parameters.ATTRIBUTE_KEY: "org:resource", attributes_filter.Parameters.POSITIVE: False})

To apply automatically a filter on events attributes (trimming traces and keeping only events containing the attribute with a frequent value), the apply_auto_filter method is provided. The method accepts as parameters the attribute name and the DECREASING_FACTOR (default 0.6; an explanation could be found on the start activities filter).

from pm4py.algo.filtering.log.attributes import attributes_filter
filtered_log = attributes_filter.apply_auto_filter(log, parameters={
    attributes_filter.Parameters.ATTRIBUTE_KEY: "concept:name", attributes_filter.Parameters.DECREASING_FACTOR: 0.6})
from pm4py.algo.filtering.pandas.attributes import attributes_filter
filtered_df = attributes_filter.apply_auto_filter(df, parameters={
    attributes_filter.Parameters.CASE_ID_KEY: "case:concept:name", attributes_filter.Parameters.ATTRIBUTE_KEY: "concept:name", attributes_filter.Parameters.DECREASING_FACTOR: 0.6})

Filter on numeric attribute values

Filtering on numeric attribute values provide options that are similar to filtering on string attribute values (that we already considered).

First, we import, the log. Subsequently, we want to keep only the events satisfying an amount comprised between 34 and 36. An additional filter aims to to keep only cases with at least an event satisfying the specified amount. The filter on cases provide the option to specify up to two attributes that are checked on the events that shall satisfy the numeric range. For example, if we are interested in cases having an event with activity Add penalty that has an amount between 34 and 500, a code snippet is also provided.

import os
from pm4py.objects.log.importer.xes import importer as xes_importer
log = xes_importer.apply(os.path.join("tests", "input_data", "roadtraffic100traces.xes"))

from pm4py.algo.filtering.log.attributes import attributes_filter
filtered_log_events = attributes_filter.apply_numeric_events(log, 34, 36,
                                             parameters={attributes_filter.Parameters.ATTRIBUTE_KEY: "amount"})

filtered_log_cases = attributes_filter.apply_numeric(log, 34, 36,
                                             parameters={attributes_filter.Parameters.ATTRIBUTE_KEY: "amount"})

filtered_log_cases = attributes_filter.apply_numeric(log, 34, 500,
                                             parameters={attributes_filter.Parameters.ATTRIBUTE_KEY: "amount",
                                                         attributes_filter.Parameters.STREAM_FILTER_KEY1: "concept:name",
                                                         attributes_filter.Parameters.STREAM_FILTER_VALUE1: "Add penalty"})

The former method can also be applied on dataframes.

import os
from pm4py.objects.log.adapters.pandas import csv_import_adapter
df = csv_import_adapter.import_dataframe_from_path(os.path.join("tests", "input_data", "roadtraffic100traces.csv"))

from pm4py.algo.filtering.pandas.attributes import attributes_filter
filtered_df_events = attributes_filter.apply_numeric_events(df, 34, 36,
                                             parameters={attributes_filter.Parameters.CASE_ID_KEY: "case:concept:name", attributes_filter.Parameters.ATTRIBUTE_KEY: "amount"})

filtered_df_cases = attributes_filter.apply_numeric(df, 34, 36,
                                             parameters={attributes_filter.Parameters.CASE_ID_KEY: "case:concept:name", attributes_filter.Parameters.ATTRIBUTE_KEY: "amount"})

filtered_df_cases = attributes_filter.apply_numeric(df, 34, 500,
                                             parameters={attributes_filter.Parameters.CASE_ID_KEY: "case:concept:name", attributes_filter.Parameters.ATTRIBUTE_KEY: "amount",
                                                         attributes_filter.Parameters.STREAM_FILTER_KEY1: "concept:name",
                                                         attributes_filter.Parameters.STREAM_FILTER_VALUE1: "Add penalty"})

Process Discovery

Process Discovery algorithms want to find a suitable process model that describes the order of events/activities that are executed during a process execution.

In the following, we made up an overview to visualize the advantages and disadvantages of the mining algorithms.

Alpha Alpha+ Heuristic Inductive
Cannot handle loops of length one and length two Can handle loops of length one and length two Takes frequency into account Can handle invisible tasks
Invisible and duplicated tasks cannot be discovered Invisible and duplicated tasks cannot be discovered Detects short loops Model is sound
Discovered model might not be sound Discovered model might not be sound Does not guarantee a sound model Most used process mining algorithm
Weak against noise Weak against noise

Alpha Miner

The alpha miner is one of the most known Process Discovery algorithm and is able to find:

  • A Petri net model where all the transitions are visible and unique and correspond to classified events (for example, to activities).
  • An initial marking that describes the status of the Petri net model when a execution starts.
  • A final marking that describes the status of the Petri net model when a execution ends.

We provide an example where a log is read, the Alpha algorithm is applied and the Petri net along with the initial and the final marking are found. The log we take as input is the running-example.xes.

First, the log has to be imported.

import os
from pm4py.objects.log.importer.xes import importer as xes_importer
log = xes_importer.apply(os.path.join("tests","input_data","running-example.xes"))

Subsequently, the Alpha Miner is applied.

from pm4py.algo.discovery.alpha import algorithm as alpha_miner
net, initial_marking, final_marking = alpha_miner.apply(log)

IMDFc

IMDFc is a specific implementation of the Inductive Miner Directly Follows algorithm (IMDF; for further details see this link) that aims to construct a sound workflow net with good values of fitness (in most cases, assuring perfect replay fitness). The basic idea of Inductive Miner is about detecting a 'cut' in the log (e.g. sequential cut, parallel cut, concurrent cut and loop cut) and then recur on sublogs, which were found applying the cut, until a base case is found. The Directly-Follows variant avoids the recursion on the sublogs but uses the Directly Follows graph.

IMDFc models usually make extensive use of hidden transitions, especially for skipping/looping on a portion on the model. Furthermore, each visible transition has a unique label (there are no transitions in the model that share the same label).

Two process models can be derived: Petri Net and Process Tree.

To mine a Petri Net, we provide an example. A log is read, IMDFc is applied and the Petri net along with the initial and the final marking are found. The log we take as input is the running-example.xes. First, the log is read, then the IMDFc algorithm is applied.

import os
from pm4py.objects.log.importer.xes import importer as xes_importer
from pm4py.algo.discovery.inductive import algorithm as inductive_miner

log = xes_importer.apply(os.path.join("tests","input_data","running-example.xes"))
net, initial_marking, final_marking = inductive_miner.apply(log)

To obtain a process tree, the provided code snippet can be used. The last two lines of code are responsible for the visualization of the process tree.

from pm4py.algo.discovery.inductive import algorithm as inductive_miner
from pm4py.visualization.process_tree import visualizer as pt_visualizer

tree = inductive_miner.apply_tree(log)

gviz = pt_visualizer.apply(tree)
pt_visualizer.view(gviz)

It is also possible to convert a process tree into a petri net.

from pm4py.objects.conversion.process_tree import converter as pt_converter
net, initial_marking, final_marking = pt_converter.apply(tree, variant=pt_converter.Variants.TO_PETRI_NET)

Heuristic Miner

Heuristics Miner is an algorithm that acts on the Directly-Follows Graph, providing way to handle with noise and to find common constructs (dependency between two activities, AND). The output of the Heuristics Miner is an Heuristics Net, so an object that contains the activities and the relationships between them. The Heuristics Net can be then converted into a Petri net. The paper can be visited by clicking on the upcoming link: this link).

It is possible to obtain a Heuristic Net and a Petri Net.

To apply the Heuristics Miner to discover an Heuristics Net, it is necessary to import a log. Then, a Heuristic Net can be found. There are also numerous possible parameters that can be inspected by clicking on the following button.

from pm4py.objects.log.importer.xes import importer as xes_importer
import os
log_path = os.path.join("tests", "compressed_input_data", "09_a32f0n00.xes.gz")
log = xes_importer.apply(log_path)

from pm4py.algo.discovery.heuristics import algorithm as heuristics_miner
heu_net = heuristics_miner.apply_heu(log, parameters={heuristics_miner.Variants.CLASSIC.value.Parameters.DEPENDENCY_THRESH: 0.99})
Parameter name Meaning
DEPENDENCY_THRESH dependency threshold of the Heuristics Miner (default: 0.5)
AND_MEASURE_THRESH AND measure threshold of the Heuristics Miner (default: 0.65)
MIN_ACT_COUNT minimum number of occurrences of an activity to be considered (default: 1)
MIN_DFG_OCCURRENCES minimum number of occurrences of an edge to be considered (default: 1)
DFG_PRE_CLEANING_NOISE_THRESH cleaning threshold of the DFG (in order to remove weaker edges, default 0.05)
LOOP_LENGTH_TWO_THRESH thresholds for the loops of length 2

To visualize the Heuristic Net, code is also provided on the right-hand side.

from pm4py.visualization.heuristics_net import visualizer as hn_visualizer
gviz = hn_visualizer.apply(heu_net)
hn_visualizer.view(gviz)

To obtain a Petri Net that is based on the Heuristics Miner, the code on the right hand side can be used. Also this Petri Net can be visualized.

from pm4py.algo.discovery.heuristics import algorithm as heuristics_miner
net, im, fm = heuristics_miner.apply(log, parameters={heuristics_miner.Variants.CLASSIC.value.Parameters.DEPENDENCY_THRESH: 0.99})

from pm4py.visualization.petrinet import visualizer as pn_visualizer
gviz = pn_visualizer.apply(net, im, fm)
pn_visualizer.view(gviz)

Directly-Follows Graph

Process models modeled using Petri nets have a well-defined semantic: a process execution starts from the places included in the initial marking and finishes at the places included in the final marking. In this section, another class of process models, Directly-Follows Graphs, are introduced. Directly-Follows graphs are graphs where the nodes represent the events/activities in the log and directed edges are present between nodes if there is at least a trace in the log where the source event/activity is followed by the target event/activity. On top of these directed edges, it is easy to represent metrics like frequency (counting the number of times the source event/activity is followed by the target event/activity) and performance (some aggregation, for example, the mean, of time inter-lapsed between the two events/activities).

First, we have to import the log. Subsequently, we can extract the Directly-Follows Graph. In addition, code is provided to visualize the Directly-Follows Graph. This visualization is a colored visualization of the Directly-Follows graph that is decorated with the frequency of activities.

import os
from pm4py.objects.log.importer.xes import importer as xes_importer
log = xes_importer.apply(os.path.join("tests","input_data","running-example.xes"))

from pm4py.algo.discovery.dfg import algorithm as dfg_discovery
dfg = dfg_discovery.apply(log)

from pm4py.visualization.dfg import visualizer as dfg_visualization
gviz = dfg_visualization.apply(dfg, log=log, variant=dfg_visualization.Variants.FREQUENCY)
dfg_visualization.view(gviz)

To get a Directly-Follows graph decorated with the performance between the edges, two parameters of the previous code have to be replaced.

from pm4py.algo.discovery.dfg import algorithm as dfg_discovery
from pm4py.visualization.dfg import visualizer as dfg_visualization

dfg = dfg_discovery.apply(log, variant=dfg_discovery.Variants.PERFORMANCE)
gviz = dfg_visualization.apply(dfg, log=log, variant=dfg_visualization.Variants.PERFORMANCE)
dfg_visualization.view(gviz)

To save the obtained DFG, for instance in the SVG format, code is also provided on the right-hand side.

from pm4py.algo.discovery.dfg import algorithm as dfg_discovery
from pm4py.visualization.dfg import visualizer as dfg_visualization

dfg = dfg_discovery.apply(log, variant=dfg_discovery.Variants.PERFORMANCE)
parameters = {dfg_visualization.Variants.PERFORMANCE.value.Parameters.FORMAT: "svg"}
gviz = dfg_visualization.apply(dfg, log=log, variant=dfg_visualization.Variants.PERFORMANCE, parameters=parameters)
dfg_visualization.save(gviz, "dfg.svg")

Convert Directly-Follows Graph to a Workflow Net

The Directly-Follows Graph is the representation of a process provided by many commercial tools. An idea of Sander Leemans is about converting the DFG into a workflow net that perfectly mimic the DFG in order to able to perform alignments between the behavior described in the model and the behavior described in the log. This is called DFG mining. The following steps are useful to load the log, calculate the DFG, convert it into a workflow net and perform alignments.

First, we have to import the log. Subsequently, we have to mine the Directly-Follow graph. This DFG can then be converted to a workflow net.

from pm4py.objects.log.importer.xes import importer as xes_importer
import os
log = xes_importer.apply(os.path.join("tests", "input_data", "running-example.xes"))

from pm4py.algo.discovery.dfg import algorithm as dfg_discovery
dfg = dfg_discovery.apply(log)

from pm4py.objects.conversion.dfg import converter as dfg_mining
net, im, fm = dfg_mining.apply(dfg)

Adding information about Frequency/Performance

Similar to the Directly-Follows graph, it is also possible to decorate the Petri net with frequency or performance information. This is done by using a replay technique on the model and then assigning frequency/performance to the paths. The variant parameter of the visualizer specifies which annotation should be used. The values for the variant parameter are the following:

  • pn_visualizer.Variants.WO_DECORATION: This is the default value and indicates that the Petri net is not decorated.
  • pn_visualizer.Variants.FREQUENCY: This indicates that the model should be decorated according to frequency information obtained by applying replay.
  • pn_visualizer.Variants.PERFORMANCE: This indicates that the model should be decorated according to performance (aggregated by mean) information obtained by applying replay.

In the case the frequency and performance decoration are chosen, it is required to pass the log as a parameter of the visualization (it needs to be replayed).

The code on the right-hand side can be used to obtain the Petri net mined by the Inductive Miner decorated with frequency information.

from pm4py.visualization.petrinet import visualizer as pn_visualizer
parameters = {pn_visualizer.Variants.FREQUENCY.value.Parameters.FORMAT: "png"}
gviz = pn_visualizer.apply(net, initial_marking, final_marking, parameters=parameters, variant=pn_visualizer.Variants.FREQUENCY, log=log)
pn_visualizer.save(gviz, "inductive_frequency.png")

Classifier

Algorithms implemented in pm4py assume to classify events based on their activity name, which is usually reported inside the concept:name event attribute. In some contexts, it is useful to use another event attribute as activity:

  • Importing an event log from a CSV does not assure to have a concept:name event attribute
  • Multiple events in a case may refer to different lifecycles of the same activity

The example on the right-hand side shows the specification of an activity key for the Alpha Miner algorithm.

import os
from pm4py.objects.log.importer.xes import importer as xes_importer
from pm4py.algo.discovery.alpha import algorithm as alpha_miner
log = xes_importer.apply(os.path.join("tests","input_data","running-example.xes"))
parameters = {alpha_miner.Variants.ALPHA_CLASSIC.value.Parameters.ACTIVITY_KEY: "concept:name"}
net, initial_marking, final_marking = alpha_miner.apply(log, parameters=parameters)

For logs imported from XES format, a list of fields that could be used in order to classify events and apply Process Mining algorithms is usually reported in the classifiers section. The Standard classifier usually includes the activity name (the concept:name attribute) and the lifecycle (the lifecycle:transition attribute); the Event name classifier includes only the activity name.

In PM4Py, it is assumed that algorithms work on a single activity key. In order to use multiple fields, a new attribute should be inserted for each event as the concatenation of the two.

In the following, retrieval and insertion of a corresponding attribute regarding classifiers are discussed.

The example on the right-hand side demonstrates the retrieval of the classifiers inside a log file, using the receipt.xes log. The print command returns a dictionary, whereby the corresponding classifier attribute is revealed.

import os
from pm4py.objects.log.importer.xes import importer as xes_importer

log = xes_importer.apply(os.path.join("tests","input_data","receipt.xes"))
print(log.classifiers)

To use the classifier Activity classifier and write a new attribute for each event in the log, the following code can be used.

from pm4py.objects.log.util import insert_classifier
log, activity_key = insert_classifier.insert_activity_classifier_attribute(log, "Activity classifier")

Then, as before, the Alpha Miner can be applied on the log specifying the newly inserted activity key.

from pm4py.algo.discovery.alpha import algorithm as alpha_miner
parameters = {alpha_miner.Variants.ALPHA_CLASSIC.value.Parameters.ACTIVITY_KEY: activity_key}
net, initial_marking, final_marking = alpha_miner.apply(log, parameters=parameters)

In the following, a technique is shown to insert a new attribute manually.

In the case, the XES specifies no classifiers, and a different field should be used as activity key, there is the option to specify it manually. For example, in this piece of code we read the receipt.xes log and create a new attribute called customClassifier that is the activity name plus the transition. Subsequently, the Alpha Miner can be applied on this new classifier.

import os
from pm4py.objects.log.importer.xes import importer as xes_importer

log = xes_importer.apply(os.path.join("tests","input_data","receipt.xes"))
for trace in log:
 for event in trace:
  event["customClassifier"] = event["concept:name"] + event["lifecycle:transition"]

from pm4py.algo.discovery.alpha import algorithm as alpha_miner
parameters = {alpha_miner.Variants.ALPHA_CLASSIC.value.Parameters.ACTIVITY_KEY: "customClassifier"}
net, initial_marking, final_marking = alpha_miner.apply(log, parameters=parameters)

Petri Net management

Petri nets are one of the most common formalism to express a process model. A Petri net is a directed bipartite graph, in which the nodes represent transitions and places. Arcs are connecting places to transitions and transitions to places, and have an associated weight. A transition can fire if each of its input places contains a number of tokens that is at least equal to the weight of the arc connecting the place to the transition. When a transition is fired, then tokens are removed from the input places according to the weight of the input arc, and are added to the output places according to the weight of the output arc.

A marking is a state in the Petri net that associates each place to a number of tokens and is uniquely associated to a set of enabled transitions that could be fired according to the marking.

Process Discovery algorithms implemented in pm4py returns a Petri net along with an initial marking and a final marking. An initial marking is the initial state of execution of a process, a final marking is a state that should be reached at the end of the execution of the process.

Importing and exporting

Petri nets, along with their initial and final marking, can be imported/exported from the PNML file format. The code on the right-hand side can be used to import a Petri net along with the initial and final marking.

First, we have to import the log. Subsequently, the Petri net is visualized by using the Petri Net visualizer. In addition, the Petri net is exported with its initial marking or initial marking and final marking.

import os
from pm4py.objects.petri.importer import importer as pnml_importer
net, initial_marking, final_marking = pnml_importer.apply(os.path.join("tests","input_data","running-example.pnml"))

from pm4py.visualization.petrinet import visualizer as pn_visualizer
gviz = pn_visualizer.apply(net, initial_marking, final_marking)
pn_visualizer.view(gviz)

from pm4py.objects.petri.exporter import exporter as pnml_exporter
pnml_exporter.apply(net, initial_marking, "petri.pnml")

pnml_exporter.apply(net, initial_marking, "petri_final.pnml", final_marking=final_marking)

Petri Net properties

This section is about how to get the properties of a Petri Net. A property of the pet is, for example, a the enabled transition in a particular marking. However, also a list of places, transitions or arcs can be inspected.

The list of transitions enabled in a particular marking can be obtained using the right-hand code.

from pm4py.objects.petri import semantics
transitions = semantics.enabled_transitions(net, initial_marking)

The function print(transitions) reports that only the transition register request is enabled in the initial marking in the given Petri net. To obtain all places, transitions, and arcs of the Petri net, the code which can be obtained on the right-hand side can be used.

places = net.places
transitions = net.transitions
arcs = net.arcs

Each place has a name and a set of input/output arcs (connected at source/target to a transition). Each transition has a name and a label and a set of input/output arcs (connected at source/target to a place). The code on the right-hand side prints for each place the name, and for each input arc of the place the name and the label of the corresponding transition. However, there also exsits trans.name, trans.label, arc.target.name.

for place in places:
 print("\nPLACE: "+place.name)
 for arc in place.in_arcs:
  print(arc.source.name, arc.source.label)

Creating a new Petri Net

In this section, an overview of the code necessary to create a new Petri net with places, transitions, and arcs is provided. A Petri net object in pm4py should be created with a name.

The code on the right-hand side creates a Petri Net with the name new_petri_net.

# creating an empty Petri net
from pm4py.objects.petri.petrinet import PetriNet, Marking
net = PetriNet("new_petri_net")

In addition, three places are created, namely source, sink, and p_1. These places are added to the previously created Petri Net.

# creating source, p_1 and sink place
source = PetriNet.Place("source")
sink = PetriNet.Place("sink")
p_1 = PetriNet.Place("p_1")
# add the places to the Petri Net
net.places.add(source)
net.places.add(sink)
net.places.add(p_1)

Similar to the places, transitions can be created. However, they need to be assigned a name and a label.

# Create transitions
t_1 = PetriNet.Transition("name_1", "label_1")
t_2 = PetriNet.Transition("name_2", "label_2")
# Add the transitions to the Petri Net
net.transitions.add(t_1)
net.transitions.add(t_2)

Arcs that connect places with transitions or transitions with places might be necessary. To add arcs, code is provided. The first parameter specifies the starting point of the arc, the second parameter its target and the last parameter states the Petri net it belongs to.

# Add arcs
from pm4py.objects.petri import utils
utils.add_arc_from_to(source, t_1, net)
utils.add_arc_from_to(t_1, p_1, net)
utils.add_arc_from_to(p_1, t_2, net)
utils.add_arc_from_to(t_2, sink, net)

To complete the Petri net, an initial and possibly a final marking need to be defined. To accomplish this, we define the initial marking to contain 1 token in the source place and the final marking to contain 1 token in the sink place.

# Adding tokens
initial_marking = Marking()
initial_marking[source] = 1
final_marking = Marking()
final_marking[sink] = 1

The resulting Petri net along with the initial and final marking can be exported, or visualized.

from pm4py.objects.petri.exporter import exporter as pnml_exporter
pnml_exporter.apply(net, initial_marking, "createdPetriNet1.pnml", final_marking=final_marking)

from pm4py.visualization.petrinet import visualizer as pn_visualizer
gviz = pn_visualizer.apply(net, initial_marking, final_marking)
pn_visualizer.view(gviz)

To obtain a specific output format (e.g. svg or png) a format parameter should be provided to the algorithm. The code snippet explains how to obtain an SVG representation of the Petri net. The last lines provide an option to save the visualization of the model.

from pm4py.visualization.petrinet import visualizer as pn_visualizer
parameters = {pn_visualizer.Variants.WO_DECORATION.value.Parameters.FORMAT:"svg"}
gviz = pn_visualizer.apply(net, initial_marking, final_marking, parameters=parameters)
pn_visualizer.view(gviz)

from pm4py.visualization.petrinet import visualizer as pn_visualizer
parameters = {pn_visualizer.Variants.WO_DECORATION.value.Parameters.FORMAT: "svg"}
gviz = pn_visualizer.apply(net, initial_marking, final_marking, parameters=parameters)
pn_visualizer.save(gviz, "alpha.svg")

Conformance Checking

Conformance checking is a techniques to compare a process model with an event log of the same process. The goal is to check if the event log conforms to the model, and, vice versa.

In PM4Py, two fundamental techniques are implemented: token-based replay and alignments.

Token-based replay

Token-based replay matches a trace and a Petri net model, starting from the initial place, in order to discover which transitions are executed and in which places we have remaining or missing tokens for the given process instance. Token-based replay is useful for Conformance Checking: indeed, a trace is fitting according to the model if, during its execution, the transitions can be fired without the need to insert any missing token. If the reaching of the final marking is imposed, then a trace is fitting if it reaches the final marking without any missing or remaining tokens.

For each trace, there are four values which have to be determined: produced tokens, remaining tokens, missing tokens, and consumed tokens. Based on that, a fomrula can be dervied, whereby a petri net (n) and a trace (t) are given as input:

fitness(n, t)=12(1-rp)+12(1-mc)

To apply the formula on the whole event log, p, r, m, and c are calculated for each trace, summed up, and finally placed into the formula above at the end.

In PM4Py there is an implementation of a token replayer that is able to go across hidden transitions (calculating shortest paths between places) and can be used with any Petri net model with unique visible transitions and hidden transitions. When a visible transition needs to be fired and not all places in the preset are provided with the correct number of tokens, starting from the current marking it is checked if for some place there is a sequence of hidden transitions that could be fired in order to enable the visible transition. The hidden transitions are then fired and a marking that permits to enable the visible transition is reached.

The example on the right shows how to apply token-based replay on a log and a Petri net. First, the log is loaded. Then, the Alpha Miner is applied in order to discover a Petri net. Eventually, the token-based replay is applied. The output of the token-based replay, stored in the variable replayed_traces, contains for each trace of the log:
  • trace_is_fit: boolean value (True/False) that is true when the trace is according to the model.
  • activated_transitions: list of transitions activated in the model by the token-based replay.
  • reached_marking: marking reached at the end of the replay.
  • missing_tokens: number of missing tokens.
  • consumed_tokens: number of consumed tokens.
  • remaining_tokens: number of remaining tokens.
  • produced_tokens: number of produced tokens.
import os
from pm4py.objects.log.importer.xes import importer as xes_importer
from pm4py.algo.discovery.alpha import algorithm as alpha_miner

log = xes_importer.apply(os.path.join("tests", "input_data", "running-example.xes"))

net, initial_marking, final_marking = alpha_miner.apply(log)

from pm4py.algo.conformance.tokenreplay import algorithm as token_replay
replayed_traces = token_replay.apply(log, net, initial_marking, final_marking)

Alignments

PM4Py comes with the following set of linear solvers: PuLP (available for any platform), CVXOPT (available for the most widely used platforms including Windows/Linux for Python 3.6/3.7). Alternatively, ORTools can also be used and installed from PIP.

Alignment-based replay aims to find one of the best alignment between the trace and the model. For each trace, the output of an alignment is a list of couples where the first element is an event (of the trace) or » and the second element is a transition (of the model) or ». For each couple, the following classification could be provided:

  • Sync move: the classification of the event corresponds to the transition label; in this case, both the trace and the model advance in the same way during the replay.
  • Move on log: for couples where the second element is », it corresponds to a replay move in the trace that is not mimicked in the model. This kind of move is unfit and signal a deviation between the trace and the model.
  • Move on model: for couples where the first element is », it corresponds to a replay move in the model that is not mimicked in the trace. For moves on model, we can have the following distinction:
    • Moves on model involving hidden transitions: in this case, even if it is not a sync move, the move is fit.
    • Moves on model not involving hidden transitions: in this case, the move is unfit and signals a deviation between the trace and the model.

First, we have to import the log. Subsequently, we apply the Inductive Miner on the imported log. In addition, we compute the alignments.

import os
from pm4py.objects.log.importer.xes import importer as xes_importer
from pm4py.algo.discovery.inductive import algorithm as inductive_miner

log = xes_importer.apply(os.path.join("tests", "input_data", "running-example.xes"))

net, initial_marking, final_marking = inductive_miner.apply(log)

from pm4py.algo.conformance.alignments import algorithm as alignments
alignments = alignments.apply_log(log, net, initial_marking, final_marking)

To inspect the alignments, a code snippet is provided. However, the output (a list) reports for each trace the corresponding alignment along with its statistics. With each trace, a dictionary containing among the others the following information is associated:

  • alignment: contains the alignment (sync moves, moves on log, moves on model)
  • cost: contains the cost of the alignment according to the provided cost function
  • fitness: is equal to 1 if the trace is perfectly fitting
print(alignments)

To use a different classifier, we refer to the Classifier section. However, the following code defines a custom classifier for each event of each trace in the log.

for trace in log:
 for event in trace:
  event["customClassifier"] = event["concept:name"] + event["concept:name"]

A parameters dictionary containing the activity key can be formed.


# define the activity key in the parameters
parameters = {inductive_miner.Variants.DFG_BASED.value.Parameters.ACTIVITY_KEY: "customClassifier", alignments.Variants.VERSION_STATE_EQUATION_A_STAR.value.Parameters.ACTIVITY_KEY: "customClassifier"}

Then, a process model is computed, and alignments are also calculated. Besides, the fitness value is calculated and the resulting values are printed.

# calculate process model using the given classifier
net, initial_marking, final_marking = inductive_miner.apply(log, parameters=parameters)
alignments = alignments.apply_log(log, net, initial_marking, final_marking, parameters=parameters)

from pm4py.evaluation.replay_fitness import evaluator as replay_fitness
log_fitness = replay_fitness.evaluate(alignments, variant=replay_fitness.Variants.ALIGNMENT_BASED)

print(log_fitness) 

It is also possible to select other parameters for the alignments.

  • Model cost function: associating to each transition in the Petri net the corresponding cost of a move-on-model.
  • Sync cost function: associating to each visible transition in the Petri net the cost of a sync move.

On the right-hand side, an implementation of a custom model cost function, and sync cost function can be noted. Also, the model cost funtions and sync cost function has to be inserted later in the parameters. Subsequently, the replay is done.

model_cost_function = dict()
sync_cost_function = dict()
for t in net.transitions:
 # if the label is not None, we have a visible transition
 if t.label is not None:
  # associate cost 1000 to each move-on-model associated to visible transitions
  model_cost_function[t] = 1000
  # associate cost 0 to each move-on-log
  sync_cost_function[t] = 0
 else:
  # associate cost 1 to each move-on-model associated to hidden transitions
  model_cost_function[t] = 1

parameters = {}
parameters[alignments.Variants.VERSION_STATE_EQUATION_A_STAR.value.Parameters.PARAM_MODEL_COST_FUNCTION] = model_cost_function
parameters[alignments.Variants.VERSION_STATE_EQUATION_A_STAR.value.Parameters.PARAM_SYNC_COST_FUNCTION] = sync_cost_function

alignments = alignments.apply_log(log, net, initial_marking, final_marking, parameters=parameters)

Process Tree Generation

In PM4Py we offer support for process trees (visualization, conversion to Petri net and generation of a log) and a functionality to generate them. In this section, the functionalities are examined.

Generation of process trees

The approach 'PTAndLogGenerator', described by the scientific paper 'PTandLogGenerator: A Generator for Artificial Event Data', has been implemented in the PM4Py library.

The code snippet can be used to generate a process tree.

from pm4py.simulation.tree_generator import simulator as tree_gen
parameters = {}
tree = tree_gen.apply(parameters=parameters)
Suppose the following start activity and their respective occurrences.
Parameter Meaning
MODE most frequent number of visible activities (default 20)
MIN minimum number of visible activities (default 10)
MAX maximum number of visible activities (default 30)
SEQUENCE probability to add a sequence operator to tree (default 0.25)
CHOICE probability to add a choice operator to tree (default 0.25)
PARALLEL probability to add a parallel operator to tree (default 0.25)
LOOP probability to add a loop operator to tree (default 0.25)
OR probability to add an or operator to tree (default 0)
SILENT probability to add silent activity to a choice or loop operator (default 0.25)
DUPLICATE probability to duplicate an activity label (default 0)
LT_DEPENDENCY probability to add a random dependency to the tree (default 0)
INFREQUENT probability to make a choice have infrequent paths (default 0.25)
NO_MODELS number of trees to generate from model population (default 10)
UNFOLD

whether or not to unfold loops in order to include choices underneath in dependencies: 0=False, 1=True

if lt_dependency <= 0: this should always be 0 (False)

if lt_dependency > 0: this can be 1 or 0 (True or False) (default 10)

MAX_REPEAT maximum number of repetitions of a loop (only used when unfolding is True) (default 10)

Generation of a log out of a process tree

The code snippet can be used to generate a log, with 100 cases, out of the process tree.
from pm4py.objects.process_tree import semantics
log = semantics.generate_log(tree, no_traces=100)

Conversion into Petri net

The code snippet can be used to convert the process tree into a Petri net.
from pm4py.objects.conversion.process_tree import converter as pt_converter
net, im, fm = pt_converter.apply(tree)

Visualize a Process Tree

A process tree can be printed, as revealed on the right side.
print(tree)
A process tree can also be visualized, as revealed on the right side.
from pm4py.visualization.process_tree import visualizer as pt_visualizer
gviz = pt_visualizer.apply(tree, parameters={pt_visualizer.Variants.WO_DECORATION.value.Parameters.FORMAT: "png"})
pt_visualizer.view(gviz)

Decision Trees

Decision trees are objects that help the understandement of the conditions leading to a particular outcome. In this section, several examples related to the construction of the decision trees are provided.

Ideas behind the building of decision trees are provided in scientific paper: de Leoni, Massimiliano, Wil MP van der Aalst, and Marcus Dees. 'A general process mining framework for correlating, predicting and clustering dynamic behavior based on event logs.'

The general scheme is the following:

  • A representation of the log, on a given set of features, is obtained (for example, using one-hot encoding on string attributes and keeping numeric attributes as-they-are)
  • A representation of the target classes is constructed
  • The decision tree is calculated
  • The decision tree is represented in some ways

Decision tree about the ending activity of a process

A process instance may potentially finish with different activities, signaling different outcomes of the process instance. A decision tree may help to understand the reasons behind each outcome.

First, a log could be loaded. Then, a representation of a log on a given set of features could be obtained. Here:

Parameter Meaning
str_trace_attributes contains the attributes of type string, at trace level, that are one-hot encoded in the final matrix.
str_event_attributes contains the attributes of type string, at event level, that are one-hot-encoded in the final matrix.
num_trace_attributes contains the numeric attributes, at trace level, that are inserted in the final matrix.
num_event_attributes contains the numeric attributes, at event level, that are inserted in the final matrix.
import os
from pm4py.objects.log.importer.xes import importer as xes_importer
log = xes_importer.apply(os.path.join("tests", "input_data", "roadtraffic50traces.xes"))

from pm4py.objects.log.util import get_log_representation
str_trace_attributes = []
str_event_attributes = ["concept:name"]
num_trace_attributes = []
num_event_attributes = ["amount"]
data, feature_names = get_log_representation.get_representation(
                           log, str_trace_attributes, str_event_attributes,
                           num_trace_attributes, num_event_attributes)
Or an automatic representation (automatic selection of the attributes) could be obtained:
data, feature_names = get_log_representation.get_default_representation(log)
Then, the target classes are formed. Each endpoint of the process belongs to a different class.
from pm4py.objects.log.util import get_class_representation
target, classes = get_class_representation.get_class_representation_by_str_ev_attr_value_value(log, "concept:name")
The decision tree could be then calculated and visualized.
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(data, target)

from pm4py.visualization.decisiontree import visualizer as dectree_visualizer
gviz = dectree_visualizer.apply(clf, feature_names, classes)

Decision tree about the duration of a case (Root Cause Analysis)

A decision tree about the duration of a case helps to understand the reasons behind an high case duration (or, at least, a case duration that is above the threshold).

First, a log has to be loaded. A representation of a log on a given set of features could be obtained. Here:

Parameter Meaning
str_trace_attributes contains the attributes of type string, at trace level, that are one-hot encoded in the final matrix.
str_event_attributes contains the attributes of type string, at event level, that are one-hot-encoded in the final matrix.
num_trace_attributes contains the numeric attributes, at trace level, that are inserted in the final matrix.
num_event_attributes contains the numeric attributes, at event level, that are inserted in the final matrix.
import os
from pm4py.objects.log.importer.xes import importer as xes_importer
log = xes_importer.apply(os.path.join("tests", "input_data", "roadtraffic50traces.xes"))

from pm4py.objects.log.util import get_log_representation
str_trace_attributes = []
str_event_attributes = ["concept:name"]
num_trace_attributes = []
num_event_attributes = ["amount"]

data, feature_names = get_log_representation.get_representation(log, str_trace_attributes, str_event_attributes,
                                                             num_trace_attributes, num_event_attributes)
Or an automatic representation (automatic selection of the attributes) could be obtained:
data, feature_names = get_log_representation.get_default_representation(log)
Then, the target classes are formed. There are two classes: First, traces that are below the specified threshold (here, 200 days). Note that the time is given in seconds. Second, traces that are above the specified threshold.
from pm4py.objects.log.util import get_class_representation
target, classes = get_class_representation.get_class_representation_by_trace_duration(log, 2 * 8640000)
The decision tree could be then calculated and visualized.
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(data, target)

from pm4py.visualization.decisiontree import visualizer as dectree_visualizer
gviz = dectree_visualizer.apply(clf, feature_names, classes)