Importing- and Exporting Data

In this section, information about importing and exporting event logs, stored in various data formats, is presented.

Importing IEEE XES files

IEEE XES is a standard format describing how event logs are stored. For more information about the format, please study the IEEE XES Website.
The example code on the right shows how to import an event log, stored in the IEEE XES format, given a file path to the log file.

from pm4py.objects.log.importer.xes import factory as xes_import_factory
log = xes_import_factory.apply('<path_to_xes_file>')

Event logs are stored as an extension of the Python list data structure. To access a trace in the log, it is enough to provide its index in the event log. Consider the example on the right on how to access the different objects stored in the imported log.

print(log[0]) #prints the first trace of the log
print(log[0][0]) #prints the first event of the first trace of the given log

The apply() method of the xes_import_factory, i.e. located in pm4py.objects.log.importer.xes.factory.py, contains two additional parameters: variant and parameters. The variant parameter is typically a string-valued argument, indicating which variant/version of the importer to use. The parameters parameter is a variant-specific Python dictionary, specifying specific parameters of choice. This method invocation style is used throughout PM4Py in the various different algorithms implemented.

import os
from pm4py.objects.log.importer.xes import factory as xes_import_factory
parameters = {"timestamp_sort": True}
log = xes_import_factory.apply('<path_to_xes_file>',
                                                variant="nonstandard",
                                                parameters=parameters)
Parameter Type Impact
timestamp_sort boolean Specify if we should sort log by timestamp
timestamp_key string If timestamp_sort is true, then sort the log by using this event-attribute key
reverse_sort boolean Specify in which direction the log should be sorted
index_trace_indexes boolean Specify if trace indexes should be added as event attribute for each event
max_no_traces_to_import integer Specify the maximum number of traces to import from the log (as occurring in order of the XML file)
Currently, only two variants are supported: iterpase (which is also the default variant), and nonstandard. The former uses the iterparse library internally for xml parsing and complies with the IEEE XES standard. The latter is a custom implementation that reads the XES file in a line-by-line manner (for improved performance). It does not follow the standard and is able to import traces, simple trace attributes, events, and simple event attributes.

Exporting IEEE XES files

To export a log objecto into a xes, the code snippet on the right hand side can be used.

from pm4py.objects.log.exporter.xes import factory as xes_exporter
xes_exporter.export_log(log, "exportedLog.xes")

Importing CSV files

Process Mining algorithms implemented in PM4Py usually take a log as input. However, events in a CSV file are not grouped in traces. To overcome this problem, PM4Py uses pandas to convert the CSV file into a Dataframe, then into a event stream, and finally into a log. There are also different parameters which can be passed.

The code on the right-hand side covers both the importing of the CSV through Pandas and its conversion into the event stream structure. Note that this is not the log structure which you can obtain after importing a XES file.

import os
from pm4py.objects.log.importer.csv import factory as csv_importer
event_stream = csv_importer.import_event_stream(
         os.path.join("tests", "input_data", "running-example.csv") )

If one want to convert the event stream structure, a code snippet is provided. In this code snippet and for the events of the stream, it is assumed that the case:concept:name attribute contains the Case ID, the concept:name attribute contains the activity, and the time:timestamp attribute contains the timestamp.

from pm4py.objects.conversion.log import factory as conversion_factory
from pm4py.util import constants
log = conversion_factory.apply(stream, parameters={constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name",
                                                   constants.PARAMETER_CONSTANT_ACTIVITY_KEY: "concept:name",
                                                   constants.PARAMETER_CONSTANT_TIMESTAMP_KEY: "time:timestamp"})

Sometimes is useful to ingest the CSV into a dataframe using Pandas, operating some pre-filtering on the dataframe, and after that converting it into an event stream (and then log) structure. The last code snippet on the right-hand side covers the ingestion, the conversion into event stream structure and eventually the conversion into log. In this code snippet and for this dataframe, it is assumed that the case:concept:name column contains the Case ID, the concept:name column contains the activity, the time:timestamp column contains the timestamp.

import os
from pm4py.objects.log.adapters.pandas import csv_import_adapter
from pm4py.objects.conversion.log import factory as conversion_factory
from pm4py.util import constants
dataframe = csv_import_adapter.import_dataframe_from_path(
        os.path.join("tests", "input_data",
        "running-example.csv"), sep=",")
log = conversion_factory.apply(dataframe, parameters={constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name",
                                                      constants.PARAMETER_CONSTANT_ACTIVITY_KEY: "concept:name",
                                                      constants.PARAMETER_CONSTANT_TIMESTAMP_KEY: "time:timestamp"})

Exporting logs to CSV

If one want to export into a CSV file, a command is provided to do it for streams.

from pm4py.objects.log.exporter.csv import factory as csv_exporter
csv_exporter.export(event_stream, "outputFile1.csv")

If one want to export into a CSV file, a command is provided to do it by taking a log format as input.

from pm4py.objects.log.exporter.csv import factory as csv_exporter
csv_exporter.export(log, "outputFile2.csv")

Working with Parquet files

Working with Parquet files requires the installation of a separate library for the management of Parquet files. We advise the installation of Pyarrow (0.15.1). Alternatively, Fastparquet is also available.

The importing of a Parquet file leads to a Pandas dataframe as output. Any log object in PM4Py (through automatic conversion to Pandas dataframe) could be then stored in a Parquet file. To import a Parquet log with all its columns, the instructions on the right-hand side could be used.

import os
from pm4py.objects.log.importer.parquet import factory as parquet_importer
log_path = os.path.join("tests", "input_data", "running-example.parquet")
dataframe = parquet_importer.apply(log_path)

To import a Parquet log with only a set of columns, a code snippet is provided.

dataframe = parquet_importer.apply(log_path,
  parameters={"columns": ["case:concept:name", "concept:name"]})

To see an example about exporting into the Parquet format, a code snippet is revealed on the right.

from pm4py.objects.log.exporter.parquet import factory as parquet_exporter
parquet_exporter.apply(log, "running-example.parquet")

Sorting

It is possible to sort an event log or an event stream. There are two implementations, whereby the first one can be achieved by the second one. In this example, the sorting is only based on the timestamp of events.

from pm4py.objects.log.util import sorting
log = sorting.sort_timestamp(log)

In this example. a lambda expression is used. If reverse=False, the output is sorted ascending. Otherwise, if reverse=true, the output is descending.

from pm4py.objects.log.util import sorting
sorted_log = sorting.sort_lambda(log,
         lambda x: x.attributes["concept:name"], reverse=False)

Sampling

This operation works for log objects. The purpose is to to have a subset of traces that reflect the behavior of the whole event log. The parameter n specifies, how many traces have to be kept.

from pm4py.objects.log.util import sampling
sampled_log = sampling.sample(log, n=50)

Filtering

PM4Py also has various methods to filter an event log. These meth

Filtering on timeframe

In the following paragrpah, various methods regarding filtering with time frames are present. For each of the methods, the log and Pandas Dataframe methods are revealed.

One might be interested in only keeping the traces that are contained in a specific interval, e.g. 09 March 2011 and 18 January 2012. The first code snippet works for a log object, the second one for a dataframe object.

from pm4py.algo.filtering.log.timestamp import timestamp_filter
filtered_log = timestamp_filter.filter_traces_contained
               (log, "2011-03-09 00:00:00", "2012-01-18 23:59:59")
from pm4py.algo.filtering.pandas.timestamp import timestamp_filter
df_timest_intersecting = timestamp_filter.filter_traces_intersecting
               (dataframe, "2011-03-09 00:00:00", "2012-01-18 23:59:59",
                                          parameters={constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name",
                                                      constants.PARAMETER_CONSTANT_TIMESTAMP_KEY: "time:timestamp"})

However, it is also possible to keep the traces that are intersecting with a time interval. The first example is again for log objects, the second one for dataframe objects.

from pm4py.algo.filtering.log.timestamp import timestamp_filter
filtered_log = timestamp_filter.filter_traces_intersecting
               (log, "2011-03-09 00:00:00", "2012-01-18 23:59:59")
from pm4py.algo.filtering.pandas.timestamp import timestamp_filter
df_timest_intersecting = timestamp_filter.filter_traces_intersecting
               (dataframe, "2011-03-09 00:00:00", "2012-01-18 23:59:59",
                                          parameters={constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name",
                                                      constants.PARAMETER_CONSTANT_TIMESTAMP_KEY: "time:timestamp"})

Until now, only trace based techniques have been discussed. However, there is a method to keep the events that are contained in specific timeframe. As previously mentioned, the first code snippet provides information about how to apply this technique on log objects, whereby the second snippets provides information about how to apply this on dataframe objects.

from pm4py.algo.filtering.log.timestamp import timestamp_filter
filtered_log_events = timestamp_filter.apply_events
               (log, "2011-03-09 00:00:00", "2012-01-18 23:59:59")
from pm4py.algo.filtering.pandas.timestamp import timestamp_filter
df_timest_events = timestamp_filter.apply_events
               (dataframe, "2011-03-09 00:00:00", "2012-01-18 23:59:59",
                                          parameters={constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name",
                                                      constants.PARAMETER_CONSTANT_TIMESTAMP_KEY: "time:timestamp"})

Filter on case performance

This filter permits to keep only traces with duration that is inside a specified interval. In the examples, traces between 1 and 10 days are kept. Note that the time parameters are given in seconds. The first code snippet applies this technique on log object, the second one on a dataframe object.

from pm4py.algo.filtering.log.cases import case_filter
filtered_log = case_filter.filter_on_case_performance(log, 86400, 864000)
from pm4py.algo.filtering.pandas.cases import case_filter
df_cases = case_filter.filter_on_case_performance
               (dataframe, min_case_performance=86400, max_case_performance=864000,
                                          parameters={constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name",
                                                      constants.PARAMETER_CONSTANT_TIMESTAMP_KEY: "time:timestamp"})

Filter on start activities

In general, PM4Py offers two methods to filter a log or a dataframe on start activities. In the first method, a list of start activities has to be specified. On the activities that are contained in the list, the filter is applied on. In the second method, a decreasing factor is used. An explanation can be inspected by clicking on the button below.

Suppose the following start activity and their respective occurrences.
Activity Number of occurrences
A 1000
B 700
C 300
D 50
Assume decreasingFactor=0.6. The most frequent start activity is kept, A in this case. Then, the number of occurrences of the next frequent activity is divided by the number of occurrences of this activity. Therefore, the computation is 700/1000=0.7. Since 0.7>0.6, B is kept as admissible start activity. In the next step, the number of occurrences of activity C and B are compared. In this case 300/700≈0.43. Since 0.43<0.6, C is not accepted as admissible start activity and the method stops here.

First of all, it might be necessary to know the starting activities. Therefore, code snippets are provided. Subsequently, an example of filtering is provided. The first snippet is working with log object, the second one is working on a dataframe. log_start is a dictionary that contains as key the activity and as value the number of occurrence.

from pm4py.algo.filtering.log.start_activities import start_activities_filter
log_start = start_activities_filter.get_start_activities(log)
filtered_log = start_activities_filter.apply(log, ["S1"]) #suppose "S1" is the start activity you want to filter on
from pm4py.algo.filtering.pandas.start_activities import start_activities_filter
log_start = start_activities_filter.get_start_activities(dataframe)
df_start_activities = start_activities_filter.apply(dataframe, ["S1"],
                                          parameters={constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name",
                                                      constants.PARAMETER_CONSTANT_ACTIVITY_KEY: "concept:name"}) #suppose "S1" is the start activity you want to filter on
						

As mentioned earlier, there is also a method that aims to keep the frequent start activities. Again, the first snippet is about a log object, the second is about a dataframe object. The default value for decreasingFactor is 0.6.

from pm4py.algo.filtering.log.start_activities import start_activities_filter
log_af_sa = start_activities_filter.apply_auto_filter
               (log, parameters={"decreasingFactor": 0.6})
from pm4py.algo.filtering.pandas.start_activities import start_activities_filter
df_auto_sa = start_activities_filter.apply_auto_filter
               (dataframe, parameters={"decreasingFactor": 0.6})

Filter on end activities

In general, PM4Py offers two methods to filter a log or a dataframe on end activities. In the first method, a list of end activities has to be specified. On the activities that are contained in the list, the filter is applied on. In the second method, a decreasing factor is used. An explanation can be inspected by clicking on the button in the start activity section.

This filter permits to keep only traces with an end activity among a set of specified activities. First of all, it might be necessary to know the end activities. Therefore, code snippets are provided. Subsequently, an example of filtering is provided. Here, for the dataframe filtering, a further attribute specification is possible: case:concept:name is in this case the column of the dataframe that is the Case ID, concept:name is the column of the dataframe that is the activity.

from pm4py.algo.filtering.log.end_activities import end_activities_filter
end_activities = end_activities_filter.get_end_activities(log)
filtered_log = end_activities_filter.apply(log, ["pay compensation"])
from pm4py.algo.filtering.pandas.end_activities import end_activities_filter
from pm4py.util import constants
end_acitivites = end_activities_filter.get_end_activities(df)
filtered_df = end_activities_filter.apply(df, ["pay compensation"],
                                          parameters={constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name",
                                                      constants.PARAMETER_CONSTANT_ACTIVITY_KEY: "concept:name"})

Filter on variants

A variant is a set of cases that share the same control-flow perspective, so a set of cases that share the same classified events (activities) in the same order. In this section, we will focus for all methods first on log objects, then we will continue with the dataframe.

To get the list of variants contained in a given log, the following code could be used. The first code is for an log object, the second for a dataframe. The result is expressed as a dictionary having as key the variant and as value the list of cases that share the variant.

from pm4py.algo.filtering.log.variants import variants_filter
variants = variants_filter.get_variants(log)
from pm4py.statistics.traces.pandas import case_statistics
variants = case_statistics.get_variants_df(df,
                                          parameters={constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name",
                                                      constants.PARAMETER_CONSTANT_ACTIVITY_KEY: "concept:name"})

If the number of occurrences of the variants is of interest, the following code retrieves a list of variants along with their count (so, a dictionary which key is the variant and the value is the number of occurrences).

from pm4py.statistics.traces.log import case_statistics
variants_count = case_statistics.get_variant_statistics(log)
variants_count = sorted(variants_count, key=lambda x: x['count'], reverse=True)
from pm4py.statistics.traces.pandas import case_statistics
variants_count = case_statistics.get_variant_statistics(df,
                                          parameters={constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name",
                                                      constants.PARAMETER_CONSTANT_ACTIVITY_KEY: "concept:name",
                                                      constants.PARAMETER_CONSTANT_TIMESTAMP_KEY: "time:timestamp"})
variants_count = sorted(variants_count, key=lambda x: x['case:concept:name'], reverse=True)

To filter based on variants, assume that variants is a list, whereby each element is a variant (expressed in an equal way as in the variants retrieval method). The first method can be applied on log objects, the second can be applied on dataframe objects. Note that the variants given in variants are kept.

filtered_log1 = variants_filter.apply(log, variants)
filtered_df1 = variants_filter.apply(df, variants,
                                          parameters={constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name",
                                                      constants.PARAMETER_CONSTANT_ACTIVITY_KEY: "concept:name"})

Contrary to the previous example, suppose you want to filter the given variants out. Again, let variants be a list, whereby each element is a variant.

filtered_log2 = variants_filter.apply(log, variants, parameters={"positive": False})
filtered_df2 = variants_filter.apply(df, variants,
                                          parameters={"positive": False, constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name",
                                                      constants.PARAMETER_CONSTANT_ACTIVITY_KEY: "concept:name"})

A filter to keep automatically the most common variants could be applied through the apply_auto_filter method. This method accepts a parameter called decreasingFactor (default value is 0.6; further details are provided in the start activities filter).

auto_filtered_log = variants_filter.apply_auto_filter(log)
auto_filtered_log = variants_filter.apply_auto_filter(df)

Filter on attributes values

Filtering on attributes values permits alternatively to:

  • Keep cases that contains at least an event with one of the given attribute values
  • Remove cases that contains an event with one of the the given attribute values
  • Keep events (trimming traces) that have one of the given attribute values
  • Remove events (trimming traces) that have one of the given attribute values

Example of attributes are the resource (generally contained in org:resource attribute) and the activity (generally contained in concept:name attribute). As noted before, the first method can be applied on log objects, the second on dataframe objects.

To get the list of resources and activities contained in the log, the following code could be used.

from pm4py.algo.filtering.log.attributes import attributes_filter
activities = attributes_filter.get_attribute_values(log, "concept:name")
resources = attributes_filter.get_attribute_values(log, "org:resource")
from pm4py.algo.filtering.pandas.attributes import attributes_filter
activities = attributes_filter.get_attribute_values(df, attribute_key="concept:name")
resources = attributes_filter.get_attribute_values(df, attribute_key="org:resource")

To filter traces containing/not containing a given list of resources, the following code could be used.

from pm4py.util import constants
tracefilter_log_pos = attributes_filter.apply(log, ["Resource10"],
                                          parameters={constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "org:resource", "positive": True})
tracefilter_log_neg = attributes_filter.apply(log, ["Resource10"],
                                          parameters={constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "org:resource", "positive": False}))
from pm4py.util import constants
df_traces_pos = attributes_filter.apply(df, ["Resource10"],
                                          parameters={constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name", constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "org:resource", "positive": True})
df_traces_neg = attributes_filter.apply(df, ["Resource10"],
                                          parameters={constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name", constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "org:resource", "positive": False})

To apply automatically a filter on events attributes (trimming traces and keeping only events containing the attribute with a frequent value), the apply_auto_filter method is provided. The method accepts as parameters the attribute name and the decreasingFactor (default 0.6; an explanation could be found on the start activities filter).

from pm4py.algo.filtering.log.attributes import attributes_filter
from pm4py.util import constants
filtered_log = attributes_filter.apply_auto_filter(log, parameters={
    constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "concept:name", "decreasingFactor": 0.6})
from pm4py.algo.filtering.pandas.attributes import attributes_filter
from pm4py.util import constants
filtered_df = attributes_filter.apply_auto_filter(df, parameters={
    constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "concept:name", "decreasingFactor": 0.6})

Filter on numeric attribute values

Filtering on numeric attribute values provide options that are similar to filtering on string attribute values (that we already considered).

First, we import, the log. Subsequently, we want to keep only the events satisfying an amount comprised between 34 and 36. An additional filter aims to to keep only cases with at least an event satisfying the specified amount. The filter on cases provide the option to specify up to two attributes that are checked on the events that shall satisfy the numeric range. For example, if we are interested in cases having an event with activity Add penalty that has an amount between 34 and 500, a code snippet is also provided.

import os
from pm4py.objects.log.importer.xes import factory as xes_importer
log = xes_importer.apply(os.path.join("tests", "input_data", "roadtraffic100traces.xes"))

from pm4py.algo.filtering.log.attributes import attributes_filter
from pm4py.util import constants
filtered_log_events = attributes_filter.apply_numeric_events(log, 34, 36,
                                             parameters={constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "amount"})

filtered_log_cases = attributes_filter.apply_numeric(log, 34, 36,
                                             parameters={constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "amount"})

filtered_log_cases = attributes_filter.apply_numeric(log, 34, 500,
                                             parameters={constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "amount",
                                                         "stream_filter_key1": "concept:name",
                                                         "stream_filter_value1": "Add penalty"})

The former method can also be applied on dataframes.

import os
from pm4py.objects.log.adapters.pandas import csv_import_adapter
df = csv_import_adapter.import_dataframe_from_path(os.path.join("tests", "input_data", "roadtraffic100traces.csv"))

from pm4py.algo.filtering.pandas.attributes import attributes_filter
from pm4py.util import constants
filtered_df_events = attributes_filter.apply_numeric_events(df, 34, 36,
                                             parameters={constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name", constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "amount"})

filtered_df_cases = attributes_filter.apply_numeric(df, 34, 36,
                                             parameters={constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name", constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "amount"})

filtered_df_cases = attributes_filter.apply_numeric(df, 34, 500,
                                             parameters={constants.PARAMETER_CONSTANT_CASEID_KEY: "case:concept:name", constants.PARAMETER_CONSTANT_ATTRIBUTE_KEY: "amount",
                                                         "stream_filter_key1": "concept:name",
                                                         "stream_filter_value1": "Add penalty"})

Process Discovery

Process Discovery algorithms want to find a suitable process model that describes the order of events/activities that are executed during a process execution.

In the following, we made up an overview to visualize the advantages and disadvantages of the mining algorithms.

Alpha Alpha+ Heuristic Inductive
Cannot handle loops of length one and length two Can handle loops of length one and length two Takes frequency into account Can handle invisible tasks
Invisible and duplicated tasks cannot be discovered Invisible and duplicated tasks cannot be discovered Detects short loops Model is sound
Discovered model might not be sound Discovered model might not be sound Detects skipping activities Most used process mining algorithm
Weak against noise Weak against noise Does not guarantee a sound model

Alpha Miner

The alpha miner is one of the most known Process Discovery algorithm and is able to find:

  • A Petri net model where all the transitions are visible and unique and correspond to classified events (for example, to activities).
  • A initial marking that describes the status of the Petri net model when a execution starts.
  • A final marking that describes the status of the Petri net model when a execution ends.

We provide an example where a log is read, the Alpha algorithm is applied and the Petri net along with the initial and the final marking are found. The log we take as input is the running-example.xes.

First, the log has to be imported.

import os
from pm4py.objects.log.importer.xes import factory as xes_importer
log = xes_importer.import_log(os.path.join("tests","input_data","running-example.xes"))

Subsequently, the Alpha Miner is applied.

from pm4py.algo.discovery.alpha import factory as alpha_miner
net, initial_marking, final_marking = alpha_miner.apply(log)

IMDFb

IMDFb is a specific implementation of the Inductive Miner Directly Follows algorithm (IMDF; for further details see this link) that aims to construct a sound workflow net with good values of fitness (in most cases, assuring perfect replay fitness). The basic idea of Inductive Miner is about detecting a 'cut' in the log (e.g. sequential cut, parallel cut, concurrent cut and loop cut) and then recurs on sublogs, which were found applying the cut, until a base case is found. The Directly-Follows variant avoids the recursion on the sublogs but uses the Directly Follows graph.

IMDFb models usually make extensive use of hidden transitions, especially for skipping/looping on a portion on the model. Furthermore, each visible transition has a unique label (there are no transitions in the model that share the same label).

Two process models can be derived: A Petri Net and a Process Tree.

To mine a Petri Net, we provide an example. A log is read, IMDFb is applied and the Petri net along with the initial and the final marking are found. The log we take as input is the running-example.xes. First, the log is read, then the IMDFb algorithm is applied.

import os
from pm4py.objects.log.importer.xes import factory as xes_importer
from pm4py.algo.discovery.inductive import factory as inductive_miner

log = xes_importer.import_log(os.path.join("tests","input_data","running-example.xes"))
net, initial_marking, final_marking = inductive_miner.apply(log)

To obtain a process tree, the provided code snippet can be used. The last two lines of code are responsible for the visualization of the process tree.

from pm4py.algo.discovery.inductive import factory as inductive_miner
from pm4py.visualization.process_tree import factory as pt_vis_factory

tree = inductive_miner.apply_tree(log)

gviz = pt_vis_factory.apply(tree)
pt_vis_factory.view(gviz)

It is also possible to convert a process tree into a petri net.

from pm4py.objects.conversion.process_tree import factory as pt_conv_factory
net, initial_marking, final_marking = pt_conv_factory.apply(tree, variant=pt_conv_factory.TO_PETRI_NET)

Heuristic Miner

Heuristics Miner is an algorithm that acts on the Directly-Follows Graph, providing way to handle with noise and to find common constructs (dependency between two activities, AND). The output of the Heuristics Miner is an Heuristics Net, so an object that contains the activities and the relationships between them. The Heuristics Net can be then converted into a Petri net. The paper can be visited by clicking on the upcoming link: this link).

It is possible to obtain a Heuristic Net and a Petri Net.

To apply the Heuristics Miner to discover an Heuristics Net, it is necessary to import a log. Then, a Heuristic Net can be found. Thre are also numerous pramters possible that can be inspected by clicking on the following button.

from pm4py.objects.log.importer.xes import factory as xes_importer
import os
log_path = os.path.join("tests", "compressed_input_data", "09_a32f0n00.xes.gz")
log = xes_importer.apply(log_path)

from pm4py.algo.discovery.heuristics import factory as heuristics_miner
heu_net = heuristics_miner.apply_heu(log, parameters={"dependency_thresh": 0.99})
Parameter name Meaning
dependency_thresh dependency threshold of the Heuristics Miner (default: 0.5)
and_measure_thresh AND measure threshold of the Heuristics Miner (default: 0.65)
min_act_count minimum number of occurrences of an activity to be considered (default: 1)
min_dfg_occurrences minimum number of occurrences of an edge to be considered (default: 1)
dfg_pre_cleaning_noise_thresh cleaning threshold of the DFG (in order to remove weaker edges, default 0.05)

To visualize the Heuristic Net, code is also provided on the right-hand side.

from pm4py.visualization.heuristics_net import factory as hn_vis_factory
gviz = hn_vis_factory.apply(heu_net)
hn_vis_factory.view(gviz)

To obtain a Petri Net that is based on the Heuristic Miner, the code on the right hand side can be used. Also this Petri Net can be visualized.

from pm4py.algo.discovery.heuristics import factory as heuristics_miner
net, im, fm = heuristics_miner.apply(log, parameters={"dependency_thresh": 0.99})

from pm4py.visualization.petrinet import factory as pn_vis_factory
gviz = pn_vis_factory.apply(net, im, fm)
pn_vis_factory.view(gviz)

Directly-Follows Graph

Process models modeled using Petri nets have a well-defined semantic: a process execution starts from the places included in the initial marking and finishes at the places included in the final marking. In this section, another class of process models, Directly-Follows Graphs, are introduced. Directly-Follows graphs are graphs where the nodes represent the events/activities in the log and directed edges are present between nodes if there is at least a trace in the log where the source event/activity is followed by the target event/activity. On top of these directed edges, it is easy to represent metrics like frequency (counting the number of times the source event/activity is followed by the target event/activity) and performance (some aggregation, for example, the mean, of time inter-lapsed between the two events/activities).

First, we have to import the log. Subsequently, we can extract the Directly-Follows Graph. In addition, code is provided to visualize the Directly-Follows Graph. This visualization is a colored visualization of the Directly-Follows graph that is decorated with the frequency of activities.

import os
from pm4py.objects.log.importer.xes import factory as xes_importer
log = xes_importer.import_log(os.path.join("tests","input_data","running-example.xes"))

from pm4py.algo.discovery.dfg import factory as dfg_factory
dfg = dfg_factory.apply(log)

from pm4py.visualization.dfg import factory as dfg_vis_factory
gviz = dfg_vis_factory.apply(dfg, log=log, variant="frequency")
dfg_vis_factory.view(gviz)

To get a Directly-Follows graph decorated with the performance between the edges, two paragrphas of the previous code have to be replaced.

from pm4py.algo.discovery.dfg import factory as dfg_factory
from pm4py.visualization.dfg import factory as dfg_vis_factory

dfg = dfg_factory.apply(log, variant="performance")
gviz = dfg_vis_factory.apply(dfg, log=log, variant="performance")
dfg_vis_factory.view(gviz)

To save the obtained DFG, for instance in the SVG format, code is also provided on the right-hand side.

from pm4py.algo.discovery.dfg import factory as dfg_factory
from pm4py.visualization.dfg import factory as dfg_vis_factory

dfg = dfg_factory.apply(log, variant="performance")
parameters = {"format":"svg"}
gviz = dfg_vis_factory.apply(dfg, log=log, variant="performance", parameters=parameters)
dfg_vis_factory.save(gviz, "dfg.svg")

Convert Directly-Follows Graph to a Workflow Net

The Directly-Follows Graph is the representation of a process provided by many commercial tools. An idea of Sander Leemans is about converting the DFG into a workflow net that perfectly mimic the DFG in order to able to perform alignments between the behavior described in the model and the behavior described in the log. This is called DFG mining. The following steps are useful to load the log, calculate the DFG, convert it into a workflow net and perform alignments.

First, we have to import the log. Subsequently, we have to mine the Directly-Follow graph. This DFG can then be converted to a workflow net.

from pm4py.objects.log.importer.xes import factory as xes_importer
log = xes_importer.apply("C:\\running-example.xes")

from pm4py.algo.discovery.dfg import factory as dfg_factory
dfg = dfg_factory.apply(log)

from pm4py.objects.conversion.dfg import factory as dfg_mining_factory
net, im, fm = dfg_mining_factory.apply(dfg)

Adding information about Frequency/Performance

Similar to the Directly-Follows graph, it is also possible to decorate the Petri net with frequency or performance information. This is done by using a replay technique on the model and then assigning frequency/performance to the paths. The variant parameter of the factory specifies which annotation should be used. The values for the variant parameter are the following:

  • wo_decoration: This is the default value and indicates that the Petri net is not decorated.
  • frequency: This indicates that the model should be decorated according to frequency information obtained by applying replay.
  • performance: This indicates that the model should be decorated according to performance (aggregated by mean) information obtained by applying replay.

In the case the frequency and performance decoration are chosen, it is required to pass the log as a parameter of the visualization (it needs to be replayed).

The code on the right-hand side can be used to obtain the Petri net mined by the Inductive Miner decorated with frequency information.

from pm4py.visualization.petrinet import factory as pn_vis_factory
parameters = {"format":"png"}
gviz = pn_vis_factory.apply(net, initial_marking, final_marking, parameters=parameters, variant="frequency", log=log)
pn_vis_factory.save(gviz, "inductive_frequency.png")

Classifier

Algorithms implemented in pm4py assume to classify events based on their activity name, which is usually reported inside the concept:name event attribute. In some contexts, it is useful to use another event attribute as activity:

  • Importing an event log from a CSV does not assure to have a concept:name event attribute
  • Multiple events in a case may refer to different lifecycles of the same activity

The example on the right-hand side shows the specification of an activity kef for the Alpha Miner algorithm.

import os
from pm4py.objects.log.importer.xes import factory as xes_importer
from pm4py.algo.discovery.alpha import factory as alpha_miner
from pm4py.util import constants
log = xes_importer.import_log(os.path.join("tests","input_data","running-example.xes"))
parameters = {constants.PARAMETER_CONSTANT_ACTIVITY_KEY: "concept:name"}
net, initial_marking, final_marking = alpha_miner.apply(log, parameters=parameters)

For logs imported from XES format, a list of fields that could be used in order to classify events and apply Process Mining algorithms is usually reported in the classifiers section. The Standard classifier usually includes the activity name (the concept:name attribute) and the lifecycle (the lifecycle:transition attribute); the Event name classifier includes only the activity name.

In pm4py, it is assumed that algorithms work on a single activity key. In order to use multiple fields, a new attribute should be inserted for each event as the concatenation of the two.

In the following, retrieval and insertion of a corresponding attribute regarding classifiers are discussed.

The example on the right-hand side demonstrates the retrieval of the classifiers inside a log file, using the receipt.xes log. The print command returns a dictionary, whereby the corresponding classifier attribute is revealed.

import os
from pm4py.objects.log.importer.xes import factory as xes_importer

log = xes_importer.import_log(os.path.join("tests","input_data","receipt.xes"))
print(log.classifiers)
net, initial_marking, final_marking = alpha_miner.apply(log, parameters=parameters)

To use the classifier Activity classifier and write a new attribute for each event in the log, the following code can be used.

from pm4py.objects.log.util import insert_classifier
log, activity_key = insert_classifier.insert_activity_classifier_attribute(log, "Activity classifier")

Then, as before, the Alpha Miner can be applied on the log specifying the newly inserted activity key.

from pm4py.algo.discovery.alpha import factory as alpha_miner
from pm4py.util import constants
parameters = {constants.PARAMETER_CONSTANT_ACTIVITY_KEY: activity_key}
net, initial_marking, final_marking = alpha_miner.apply(log, parameters=parameters)

In the following, a technique is shown to insert a new attribute manually.

In the case, the XES specifies no classifiers, and a different field should be used as activity key, there is the option to specify it manually. For example, in this piece of code we read the receipt.xes log and create a new attribute called customClassifier that is the activity name plus the transition. Subsequently, the Alpha Miner can be applied on this new classifier.

import os
from pm4py.objects.log.importer.xes import factory as xes_importer
from pm4py.util import constants

log = xes_importer.import_log(os.path.join("tests","input_data","receipt.xes"))
for trace in log:
 for event in trace:
  event["customClassifier"] = event["concept:name"] + event["lifecycle:transition"]

from pm4py.algo.discovery.alpha import factory as alpha_miner
parameters = {constants.PARAMETER_CONSTANT_ACTIVITY_KEY: "customClassifier"}
net, initial_marking, final_marking = alpha_miner.apply(log, parameters=parameters)

Petri Net managmenet

Petri nets are one of the most common formalism to express a process model. A Petri net is a directed bipartite graph, in which the nodes represent transitions and places. Arcs are connecting places to transitions and transitions to places, and have an associated weight. A transition can fire if each of its input places contains a number of tokens that is at least equal to the weight of the arc connecting the place to the transition. When a transition is fired, then tokens are removed from the input places according to the weight of the input arc, and are added to the output places according to the weight of the output arc.

A marking is a state in the Petri net that associates each place to a number of tokens and is uniquely associated to a set of enabled transitions that could be fired according to the marking.

Process Discovery algorithms implemented in pm4py returns a Petri net along with an initial marking and a final marking. An initial marking is the initial state of execution of a process, a final marking is a state that should be reached at the end of the execution of the process.

Importing and exporting

Petri nets, along with their initial and final marking, can be imported/exported from the PNML file format. The code on the right-hand side can be used to import a Petri net along with the initial and final marking.

First, we have to import the log. Subsequently, the Petri net is visualized by using the Petri Net visualizer. In addition, the Petri net is exported with its inital marking or initial marking and final marking.

import os
from pm4py.objects.petri.importer import pnml as pnml_importer
net, initial_marking, final_marking = pnml_importer.import_net(os.path.join("tests","input_data","running-example.pnml"))

from pm4py.visualization.petrinet import factory as pn_vis_factory
gviz = pn_vis_factory.apply(net, initial_marking, final_marking)
pn_vis_factory.view(gviz)

from pm4py.objects.petri.exporter import pnml as pnml_exporter
pnml_exporter.export_net(net, initial_marking, "petri.pnml")

pnml_exporter.export_net(net, initial_marking, "petri_final.pnml", final_marking=final_marking)

Petri Net properties

This section is about how to get the properties of a Petri Net. A property of the pet is, for example, a the enabled transition in a particular marking. However, also a list of places, transitions or arcs can be inspected.

The list of transitions enabled in a particular marking can be obtained using the right-hand code.

from pm4py.objects.petri import semantics
transitions = semantics.enabled_transitions(net, initial_marking)

The function print(transitions) reports that only the transition register request is enabled in the initial marking in the given Petri net. To obtained all places, transitions, and arcs of the Petri net, the code which can be obtained on the right-hand side can be used.

places = net.places
transitions = net.transitions
arcs = net.arcs

Each place has a name and a set of input/output arcs (connected at source/target to a transition). Each transition has a name and a label and a set of input/output arcs (connected at source/target to a place). The code on the right-hand side prints for each place the name, and for each input arc of the place the name and the label of the corresponding transition. However, there also exsits trans.name, trans.label, arc.target.name.

for place in places:
 print("\nPLACE: "+place.name)
 for arc in place.in_arcs:
  print(arc.source.name, arc.source.label)

Creating a new Petri Net

In this section, an overview of the code necessary to create a new Petri net with places, transitions, and arcs is provided. A Petri net object in pm4py should be created with a name.

The code on the right-hand side creates a Petri Net with the name new_petri_net.

# creating an empty Petri net
from pm4py.objects.petri.petrinet import PetriNet, Marking
net = PetriNet("new_petri_net")

In addition, three places are created, namely source, sink, and p_1. These places are added to the previously created Petri Net.

# creating source, p_1 and sink place
source = PetriNet.Place("source")
sink = PetriNet.Place("sink")
p_1 = PetriNet.Place("p_1")
# add the places to the Petri Net
net.places.add(source)
net.places.add(sink)
net.places.add(p_1)

Similar to the places, transitions can be created. However, they need to be assigned a name and a label.

# Create transitions
t_1 = PetriNet.Transition("name_1", "label_1")
t_2 = PetriNet.Transition("name_2", "label_2")
# Add the transitions to the Petri Net
net.transitions.add(t_1)
net.transitions.add(t_2)

However, arcs that connect places with transitions or transitions with places might be necessary. To add arcs, code is provided. The first parameter specifies the starting point of the arc, the second parameter its target and the last parameter states the Petri net it belongs to.

# Add arcs
from pm4py.objects.petri import utils
utils.add_arc_from_to(source, t_1, net)
utils.add_arc_from_to(t_1, p_1, net)
utils.add_arc_from_to(p_1, t_2, net)
utils.add_arc_from_to(t_2, sink, net)

To complete the Petri net, an initial and possibly a final marking need to be defined. To accomplish this, we define the initial marking to contain 1 token in the source place and the final marking to contain 1 token in the sink place.

# Adding tokens
initial_marking = Marking()
initial_marking[source] = 1
final_marking = Marking()
final_marking[sink] = 1

The resulting Petri net along with the initial and final marking can be exported, or visualized.

from pm4py.objects.petri.exporter import pnml as pnml_exporter
pnml_exporter.export_net(net, initial_marking, "createdPetriNet1.pnml", final_marking=final_marking)

from pm4py.visualization.petrinet import factory as pn_vis_factory
gviz = pn_vis_factory.apply(net, initial_marking, final_marking)
pn_vis_factory.view(gviz)

To obtain a specific output format (e.g. svg or png) a format parameter should be provided to the algorithm. The code snippet explains how to obtain an SVG representation of the Petri net. The last lines provide an option to save the visualization of the model.

from pm4py.visualization.petrinet import factory as pn_vis_factory
parameters = {"format":"svg"}
gviz = pn_vis_factory.apply(net, initial_marking, final_marking, parameters=parameters)
pn_vis_factory.view(gviz)

from pm4py.visualization.petrinet import factory as pn_vis_factory
parameters = {"format":"svg"}
gviz = pn_vis_factory.apply(net, initial_marking, final_marking, parameters=parameters)
pn_vis_factory.save(gviz, "alpha.svg")

Find cycles inside a Petri net

A cycle in a Petri net is a set of places and transitions that could be repeated several times (during a process execution).

Cycles can be detected by converting the Petri net into a directed graph using the networkx library.

First, a log is imported. Second, the Inductive Miner is applied. Third, the cycles can be obtained. The list cylces then contains a lists, in which cycles are written down.

from pm4py.objects.log.importer.xes import factory as xes_importer
import os
log = xes_importer.apply(os.path.join("tests","input_data","running-example.xes"))

from pm4py.algo.discovery.inductive import factory as inductive_miner
net, initial_marking, final_marking = inductive_miner.apply(log)

from pm4py.objects.petri import utils
cycles = utils.get_cycles_petri_net_places(net)

Find the strongly connected components in a Petri net

Strongly connected components are subnets in which a path in the graph exists between each element. Strongly connected components can be detected by converting the Petri net into a directed graph using the networkx library.

First, a log is imported. Second, the Inductive Miner is applied. Third, the strongly connected components in the Petri net are retrieved. Subsequently, the subnet is visualized.

from pm4py.objects.log.importer.xes import factory as xes_importer
import os
log = xes_importer.apply(os.path.join("tests","input_data","running-example.xes"))

from pm4py.algo.discovery.inductive import factory as inductive_miner
net, initial_marking, final_marking = inductive_miner.apply(log)

from pm4py.objects.petri import utils
scc = utils.get_strongly_connected_subnets(net)

from pm4py.visualization.petrinet import factory as pn_vis_factory
gviz = pn_vis_factory.apply(scc[0][0], scc[0][1], scc[0][2])
pn_vis_factory.view(gviz)

Conformance Checking

Conformance checking is a techniques to compare a process model with an event log of the same process. The goal is to check if the event log conforms to the model, and, vice versa.

In PM4Py, two fundamental techniques are implemented: token-based replay and alignments.

Token-based replay

Token-based replay matches a trace and a Petri net model, starting from the initial place, in order to discover which transitions are executed and in which places we have remaining or missing tokens for the given process instance. Token-based replay is useful for Conformance Checking: indeed, a trace is fitting according to the model if, during its execution, the transitions can be fired without the need to insert any missing token. If the reaching of the final marking is imposed, then a trace is fitting if it reaches the final marking without any missing or remaining tokens.

For each trace, there are four values which have to be determined: produced tokens, remaining tokens, missing tokens, and consumed tokens. Based on that, a fomrula can be dervied, whereby a petri net (n) and a trace (t) are given as input:

fitness(n, t)=12(1-rp)+12(1-mc)

To apply the formula on the whole event log, p, r, m, and c are calculated for each trace, summed up, and finally placed into the formula above at the end.

In PM4Py there is an implementation of a token replayer that is able to go across hidden transitions (calculating shortest paths between places) and can be used with any Petri net model with unique visible transitions and hidden transitions. When a visible transition needs to be fired and not all places in the preset are provided with the correct number of tokens, starting from the current marking it is checked if for some place there is a sequence of hidden transitions that could be fired in order to enable the visible transition. The hidden transitions are then fired and a marking that permits to enable the visible transition is reached.

Aside from the fitness value, the replay algorithm can be configured in order to consider a trace completely fitting even if there are remaining tokens, as long as all visible transitions corresponding to events in the trace can be fired. Moreover, it can be configured to reach the final marking through hidden transitions. This is useful when after the last activity, the final marking is not reached but could be reached with the execution of hidden transitions.

Alignments

PM4Py comes with the following set of linear solvers: PuLP (available for any platform), CVXOPT (available for the most widely used platforms including Windows/Linux for Python 3.6/3.7). Alternatively, ORTools can also be used and installed from PIP.

Alignment-based replay aims to find one of the best alignment between the trace and the model. For each trace, the output of an alignment is a list of couples where the first element is an event (of the trace) or » and the second element is a transition (of the model) or ». For each couple, the following classification could be provided:

  • Sync move: the classification of the event corresponds to the transition label; in this case, both the trace and the model advance in the same way during the replay.
  • Move on log: for couples where the second element is », it corresponds to a replay move in the trace that is not mimicked in the model. This kind of move is unfit and signal a deviation between the trace and the model.
  • Move on model: for couples where the first element is », it corresponds to a replay move in the model that is not mimicked in the trace. For moves on model, we can have the following distinction:
    • Moves on model involving hidden transitions: in this case, even if it is not a sync move, the move is fit.
    • Moves on model not involving hidden transitions: in this case, the move is unfit and signals a deviation between the trace and the model.

First, we have to import the log. Subsequently, we apply the Inductive Miner on the imported log. In addition, we compute the alignments.

import os
from pm4py.objects.log.importer.xes import factory as xes_importer
from pm4py.algo.discovery.inductive import factory as inductive_miner

log = xes_importer.import_log(os.path.join("tests", "input_data", "running-example.xes"))

net, initial_marking, final_marking = inductive_miner.apply(log)

from pm4py.algo.conformance.alignments import factory as align_factory
alignments = align_factory.apply_log(log, net, initial_marking, final_marking)

To inspect the alignments, a code snippet is provided. However, the output (a list) reports for each trace the corresponding alignment along with its statistics. With each trace, a dictionary containing among the others the following information is associated:

  • alignment: contains the alignment (sync moves, moves on log, moves on model)
  • cost: contains the cost of the alignment according to the provided cost function
  • fitness: is equal to 1 if the trace is perfectly fitting
print(alignments)

To use a different classifier, we refer to the Classifier section. However, the following code defines a custom classifier for each event of each trace in the log.

for trace in log:
 for event in trace:
  event["customClassifier"] = event["concept:name"] + event["concept:name"]

A parameters dictionary containing the activity key can be formed.

from pm4py.util import constants
# define the activity key in the parameters
parameters = {constants.PARAMETER_CONSTANT_ACTIVITY_KEY: "customClassifier"}

Then, a process model is computed, and alignments are also calculated. Besides, the fitness value is calculated and the resulting values are printed.

# calculate process model using the given classifier
net, initial_marking, final_marking = inductive_miner.apply(log, parameters=parameters)
alignments = align_factory.apply_log(log, net, initial_marking, final_marking, parameters=parameters)

from pm4py.evaluation.replay_fitness import factory as replay_fitness_factory
log_fitness = replay_fitness_factory.evaluate(alignments, variant="alignments")

print(log_fitness) 

It is also possible to select other parameters for the alignments.

  • Model cost function: associating to each transition in the Petri net the corresponding cost of a move-on-model.
  • Sync cost function: associating to each visible transition in the Petri net the cost of a sync move.

On the right-hand side, an implementation of a custom model cost function, and sync cost function can be noted. Also, the model cost funtions and sync cost function has to be inserted later in the paramters. Subsequently, the replay is done.

model_cost_function = dict()
sync_cost_function = dict()
for t in net.transitions:
 # if the label is not None, we have a visible transition
 if t.label is not None:
  # associate cost 1000 to each move-on-model associated to visible transitions
  model_cost_function[t] = 1000
  # associate cost 0 to each move-on-log
  sync_cost_function[t] = 0
 else:
  # associate cost 1 to each move-on-model associated to hidden transitions
  model_cost_function[t] = 1

parameters[pm4py.algo.conformance.alignments.versions.state_equation_a_star.PARAM_MODEL_COST_FUNCTION] = model_cost_function
parameters[pm4py.algo.conformance.alignments.versions.state_equation_a_star.PARAM_SYNC_COST_FUNCTION] = sync_cost_function

alignments = align_factory.apply_log(log, net, initial_marking, final_marking, parameters=parameters)

Process Tree Generation

In PM4Py we offer support for process trees (visualization, conversion to Petri net and generation of a log) and a functionality to generate them. In this section, the functionalities are examined.

Generation of process trees

The approach 'PTAndLogGenerator', described by the scientific paper 'PTandLogGenerator: A Generator for Artificial Event Data', has been implemented in the PM4Py library.

The code snippet can be used to generate a process tree.

from pm4py.algo.simulation.tree_generator import factory as tree_gen_factory
parameters = {}
tree = tree_gen_factory.apply(parameters=parameters)
Suppose the following start activity and their respective occurrences.
Parameter Meaning
mode most frequent number of visible activities (default 20)
min minimum number of visible activities (default 10)
max maximum number of visible activities (default 30)
sequence probability to add a sequence operator to tree (default 0.25)
choice probability to add a choice operator to tree (default 0.25)
parallel probability to add a parallel operator to tree (default 0.25)
loop probability to add a loop operator to tree (default 0.25)
or probability to add an or operator to tree (default 0)
silent probability to add silent activity to a choice or loop operator (default 0.25)
duplicate probability to duplicate an activity label (default 0)
lt_dependency probability to add a random dependency to the tree (default 0)
infrequent probability to make a choice have infrequent paths (default 0.25)
no_models number of trees to generate from model population (default 10)
unfold

whether or not to unfold loops in order to include choices underneath in dependencies: 0=False, 1=True

if lt_dependency <= 0: this should always be 0 (False)

if lt_dependency > 0: this can be 1 or 0 (True or False) (default 10)

max_repeat maximum number of repetitions of a loop (only used when unfolding is True) (default 10)

Generation of a log out of a process tree

The code snippet can be used to generate a log, with 100 cases, out of the process tree.
from pm4py.objects.process_tree import semantics
log = semantics.generate_log(tree, no_traces=100)

Conversion into Petri net

The code snippet can be used to convert the process tree into a Petri net.
from pm4py.objects.conversion.process_tree import factory as pt_conv_factory
net, im, fm = pt_conv_factory.apply(tree)

Visualize a Process Tree

A process tree can be printed, as revealed on the right side.
print(tree)
A process tree can also be visualized, as revealed on the right side.
from pm4py.visualization.process_tree import factory as pt_vis_factory
gviz = pt_vis_factory.apply(tree, parameters={"format": "png"})
pt_vis_factory.view(gviz)

Decision Trees

Decision trees are objects that help the understandement of the conditions leading to a particular outcome. In this section, several examples related to the construction of the decision trees are provided.

Ideas behind the building of decision trees are provided in scientific paper: de Leoni, Massimiliano, Wil MP van der Aalst, and Marcus Dees. 'A general process mining framework for correlating, predicting and clustering dynamic behavior based on event logs.'

The general scheme is the following:

  • A representation of the log, on a given set of features, is obtained (for example, using one-hot encoding on string attributes and keeping numeric attributes as-they-are)
  • A representation of the target classes is constructed
  • The decision tree is calculated
  • The decision tree is represented in some ways

Decision tree about the ending activity of a process

A process instance may potentially finish with different activities, signaling different outcomes of the process instance. A decision tree may help to understand the reasons behind each outcome.

First, a log could be loaded. Then, a representation of a log on a given set of features could be obtained. Here:

Parameter Meaning
str_trace_attributes contains the attributes of type string, at trace level, that are one-hot encoded in the final matrix.
str_event_attributes contains the attributes of type string, at event level, that are one-hot-encoded in the final matrix.
num_trace_attributes contains the numeric attributes, at trace level, that are inserted in the final matrix.
num_event_attributes contains the numeric attributes, at event level, that are inserted in the final matrix.
import os
from pm4py.objects.log.importer.xes import factory as xes_importer
log = xes_importer.apply(os.path.join("tests", "input_data", "roadtraffic50traces.xes"))

from pm4py.objects.log.util import get_log_representation
str_trace_attributes = []
str_event_attributes = ["concept:name"]
num_trace_attributes = []
num_event_attributes = ["amount"]
data, feature_names = get_log_representation.get_representation(
                           log, str_trace_attributes, str_event_attributes,
                           num_trace_attributes, num_event_attributes)
Or an automatic representation (automatic selection of the attributes) could be obtained:
data, feature_names = get_log_representation.get_default_representation(log)
Then, the target classes are formed. Each endpoint of the process belongs to a different class.
from pm4py.objects.log.util import get_class_representation
target, classes = get_class_representation.get_class_representation_by_str_ev_attr_value_value(log, "concept:name")
The decision tree could be then calculated and visualized.
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(data, target)

from pm4py.visualization.decisiontree import factory as dt_vis_factory
gviz = dt_vis_factory.apply(clf, feature_names, classes)

Decision tree about the duration of a case (Root Cause Analysis)

A decision tree about the duration of a case helps to understand the reasons behind an high case duration (or, at least, a case duration that is above the threshold).

First, a log has to be loaded. A representation of a log on a given set of features could be obtained. Here:

Parameter Meaning
str_trace_attributes contains the attributes of type string, at trace level, that are one-hot encoded in the final matrix.
str_event_attributes contains the attributes of type string, at event level, that are one-hot-encoded in the final matrix.
num_trace_attributes contains the numeric attributes, at trace level, that are inserted in the final matrix.
num_event_attributes contains the numeric attributes, at event level, that are inserted in the final matrix.
import os
from pm4py.objects.log.importer.xes import factory as xes_importer
log = xes_importer.apply(os.path.join("tests", "input_data", "roadtraffic50traces.xes"))

from pm4py.objects.log.util import get_log_representation
str_trace_attributes = []
str_event_attributes = ["concept:name"]
num_trace_attributes = []
num_event_attributes = ["amount"]

data, feature_names = get_log_representation.get_representation(log, str_trace_attributes, str_event_attributes,
                                                             num_trace_attributes, num_event_attributes)
Or an automatic representation (automatic selection of the attributes) could be obtained:
data, feature_names = get_log_representation.get_default_representation(log)
Then, the target classes are formed. There are two classes: First, traces that are below the specified threshold (here, 200 days). Note that the time is given in seconds. Second, traces that are above the specified threshold.
from pm4py.objects.log.util import get_class_representation
target, classes = get_class_representation.get_class_representation_by_trace_duration(log, 2 * 8640000)
The decision tree could be then calculated and visualized.
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(data, target)

from pm4py.visualization.decisiontree import factory as dt_vis_factory
gviz = dt_vis_factory.apply(clf, feature_names, classes)