pm4py.discovery.discover_batches(log: Union[EventLog, DataFrame], merge_distance: int = 900, min_batch_size: int = 2, activity_key: str = 'concept:name', timestamp_key: str = 'time:timestamp', case_id_key: str = 'case:concept:name', resource_key: str = 'org:resource') List[Tuple[Tuple[str, str], int, Dict[str, Any]]][source]#

Discover batches from the provided log object

We say that an activity is executed in batches by a given resource when the resource executes several times the same activity in a short period of time.

Identifying such activities may identify points of the process that can be automated, since the activity of the person may be repetitive.

The following categories of batches are detected: - Simultaneous (all the events in the batch have identical start and end timestamps) - Batching at start (all the events in the batch have identical start timestamp) - Batching at end (all the events in the batch have identical end timestamp) - Sequential batching (for all the consecutive events, the end of the first is equal to the start of the second) - Concurrent batching (for all the consecutive events that are not sequentially matched)

The approach has been described in the following paper: Martin, N., Swennen, M., Depaire, B., Jans, M., Caris, A., & Vanhoof, K. (2015, December). Batch Processing: Definition and Event Log Identification. In SIMPDA (pp. 137-140).

The output is a (sorted) list containing tuples. Each tuple contain:
  • Index 0: the activity-resource for which at least one batch has been detected

  • Index 1: the number of batches for the given activity-resource

  • Index 2: a list containing all the batches. Each batch is described by:

    # The start timestamp of the batch # The complete timestamp of the batch # The list of events that are executed in the batch

  • log – event log / Pandas dataframe

  • merge_distance (int) – the maximum time distance between non-overlapping intervals in order for them to be considered belonging to the same batch (default: 15*60 15 minutes)

  • min_batch_size (int) – the minimum number of events for a batch to be considered (default: 2)

  • activity_key (str) – attribute to be used for the activity

  • timestamp_key (str) – attribute to be used for the timestamp

  • case_id_key (str) – attribute to be used as case identifier

  • resource_key (str) – attribute to be used as resource

Return type:

List[Tuple[Tuple[str, str], int, Dict[str, Any]]]

import pm4py

batches = pm4py.discover_log_skeleton(dataframe, activity_key='concept:name', case_id_key='case:concept:name', timestamp_key='time:timestamp', resource_key='org:resource')