×

The impact of biased sampling of event logs on the performance of process discovery. (English) Zbl 1473.68141

Summary: With Process discovery algorithms, we discover process models based on event data, captured during the execution of business processes. The process discovery algorithms tend to use the whole event data. When dealing with large event data, it is no longer feasible to use standard hardware in a limited time. A straightforward approach to overcome this problem is to down-size the data utilizing a random sampling method. However, little research has been conducted on selecting the right sample, given the available time and characteristics of event data. This paper systematically evaluates various biased sampling methods and evaluates their performance on different datasets using four different discovery techniques. Our experiments show that it is possible to considerably speed up discovery techniques using biased sampling without losing the resulting process model quality. Furthermore, due to the implicit filtering (removing outliers) obtained by applying the sampling technique, the model quality may even be improved.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62D99 Statistical sampling theory and related topics
68T09 Computational aspects of data analysis and big data
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] van der Aalst, WMP, Process mining—data science in action (2016), Berlin: Springer, Berlin
[2] Pourbafrani, M.; van Zelst, SJ; van der Aalst, WMP; Abramowicz, W.; Klein, G., Supporting automatic system dynamics model generation for simulation in the context of process mining, Business information systems, 249-263 (2020), Cham: Springer, Cham
[3] Park, G.; Song, M., Predicting performances in business processes using deep neural networks, Decis Support Syst, 129, 113191 (2020)
[4] van der Aalst WMP et al (2011) Process mining manifesto. In: Business process management BPM workshops, Clermont-Ferrand, France, pp 169-194
[5] Verbeek HMW, Buijs JCAM, van Dongen BF, van der Aalst WMP (2010) Xes, xesame, and prom 6. In Soffer P, Proper E (eds) Information systems evolution-CAiSE Forum 2010, Hammamet, Tunisia, June 7-9, 2010, Selected Extended Papers. Volume 72 of lecture notes in business information processing. Springer, pp 60-75
[6] van der Aalst WMP, Bolt A, van Zelst S (2017) RapidProM: mine your processes and not just your data. CoRR abs/1703.03740
[7] Fani Sani M, van Zelst SJ, van der Aalst WMP (2019) The impact of event log subset selection on the performance of process discovery algorithms. In: New trends in databases and information systems, ADBIS 2019 short papers, workshops BBIGAP, QAUCA, SemBDM, SIMPDA, M2P, MADEISD, and doctoral consortium, bled, Slovenia, September 8-11, 2019, proceedings. Volume 1064 of communications in computer and information science. Springer, pp 391-404
[8] van der Aalst, WM; Weijters, T.; Maruster, L., Workflow mining: discovering process models from event logs, IEEE Trans Knowl Data Eng, 16, 9, 1128-1142 (2004)
[9] Leemans SJJ, Fahland D, van der Aalst WMP (2013) Discovering block-structured process models from event logs—a constructive approach. In Colom JM, Desel J (eds) Application and theory of petri nets and concurrency—34th international conference, PETRI NETS 2013, Milan, Italy, June 24-28, 2013, proceedings. Volume 7927 of lecture notes in computer science. Springer, pp 311-329 · Zbl 1381.68211
[10] Fani Sani M, van Zelst SJ, van der Aalst WMP (2017) Improving process discovery results by filtering outliers using conditional behavioural probabilities. In: Business process management workshops—BPM 2017, Barcelona, Spain, September 10-11, 2017, revised papers. Volume 308 of lecture notes in business information processing. Springer, pp 216-229
[11] Augusto, A.; Conforti, R.; Dumas, M.; Rosa, ML; Polyvyanyy, A., Split miner: automated discovery of accurate and simple business process models from event logs, Knowl Inf Syst, 59, 2, 251-284 (2019)
[12] Leemans SJJ, Fahland D, van der Aalst WMP (2013) Discovering block-structured process models from event logs containing infrequent behaviour. In Lohmann N, Song M, Wohed P (eds) Business process management workshops-BPM 2013 international workshops, Beijing, China, August 26, 2013, Revised papers. Volume 171 of lecture notes in business information processing. Springer, pp 66-78
[13] van Zelst SJ, van Dongen BF, van der Aalst WMP (2015) Avoiding over-fitting in ilp-based process discovery. In Motahari-Nezhad HR, Recker J, Weidlich M (eds) Business process management—13th international conference, BPM 2015, Innsbruck, Austria, August 31-September 3, 2015, proceedings. Volume 9253 of lecture notes in computer science. Springer, pp 163-171
[14] Pegoraro M, Uysal MS, van der Aalst WMP (2019) Discovering process models from uncertain event data. In: Business process management workshops-BPM 2019 international workshops, Vienna, Austria, September 1-6, 2019, Revised selected, pp 238-249
[15] Horita, H.; Kurihashi, Y.; Miyamori, N., Extraction of missing tendency using decision tree learning in business process event log, Data, 5, 3, 82 (2020)
[16] Conforti, R.; Rosa, ML; ter Hofstede, AHM, Filtering out infrequent behavior from business process event logs, IEEE Trans Knowl Data Eng, 29, 2, 300-314 (2017)
[17] Fani Sani M, van Zelst SJ, van der Aalst WMP (2018) Applying sequence mining for outlier detection in process mining. In: On the move to meaningful internet systems. OTM 2018 conferences-confederated international conferences: CoopIS, C&TC, and ODBASE 2018, Valletta, Malta, October 22-26, 2018, Proceedings, Part II. Volume 11230 of lecture notes in computer science. Springer, pp 98-116
[18] Fani Sani, M.; van Zelst, SJ; van der Aalst, WMP, Repairing outlier behaviour in event logs using contextual behaviour, Inf Syst Arch, 14, 5:1-5:24 (2018)
[19] Conforti R, La Rosa M, Ter Hofstede AH, Augusto A (2020) Automatic repair of same-timestamp errors in business process event logs. In: International conference on process mining, ICPM 2020, Padua, Italy, October 4-9, 2020. IEEE, pp 327-345
[20] Sadeghianasl S, ter Hofstede AH, Suriadi S, Turkay S (2020) Collaborative and interactive detection and repair of activity labels in process event logs. In: International conference on process mining, ICPM 2020, Padua, Italy, October 4-9, 2020. IEEE, pp 41-48
[21] Tax, N.; Sidorova, N.; van der Aalst, WMP, Discovering more precise process models from event logs by filtering out chaotic activities, J Intell Inf Syst, 52, 1, 107-139 (2019)
[22] Dees M, Hompes B, van der Aalst WM (2020) Events put into context (epic). In: International conference on process mining, ICPM 2020, Padua, Italy, October 4-9, 2020. IEEE, pp 65-72
[23] Fani Sani M, Berti A, van Zelst SJ, van der Aalst WMP (2019) Filtering toolkit: Interactively filter event logs to improve the quality of discovered models. In: Proceedings of the dissertation award, doctoral consortium, and demonstration track at on business process management BPM 2019, Vienna, Austria, September 1-6, 2019. Volume 2420 of CEUR workshop proceedings. CEUR-WS.org, pp 134-138
[24] Fani Sani M, van Zelst SJ, van der Aalst WMP (2020) Conformance checking approximation using subset selection and edit distance. In: Advanced information systems engineering-32nd international conference, CAiSE 2020, Grenoble, France, June 8-12, 2020, proceedings. Volume 12127 of lecture notes in computer science. Springer, pp 234-251
[25] Rafiei M, van der Aalst WMP (2020) Privacy-preserving data publishing in process mining. In: Business process management forum-BPM forum 2020, Seville, Spain, September 13-18, 2020, proceedings. Volume 392 of lecture notes in business information processing. Springer, pp 122-138
[26] Carmona J, Cortadella J (2010) Process mining meets abstract interpretation. In Balcázar JL, Bonchi F, Gionis A, Sebag M (eds) Machine learning and knowledge discovery in databases, European Cconference, ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010, proceedings, Part I. Volume 6321 of lecture notes in computer science. Springer, pp 184-199
[27] Bauer M, Senderovich A, Gal A, Grunske L, Weidlich M (2018) How much event data is enough? A statistical framework for process discovery. In Krogstie J, Reijers HA (eds) Advanced information systems engineering-30th international conference, CAiSE 2018, Tallinn, Estonia, June 11-15, 2018, proceedings. Volume 10816 of lecture notes in computer science. Springer, pp 239-256
[28] Berti A (2017) Statistical sampling in process mining discovery. In: The 9th international conference on information, process, and knowledge management, pp 41-43
[29] Weijters AJMM, Ribeiro JTS (2011) Flexible heuristics miner (FHM). In: Proceedings of the IEEE symposium on computational intelligence and data mining, CIDM 2011, April 11-15, 2011, Paris, France. IEEE, pp 310-317
[30] Fani Sani M, van Zelst SJ, van der Aalst WMP (2018) Repairing outlier behaviour in event logs. In Abramowicz W, Paschke A (eds) Business information systems-21st international conference, BIS 2018, Berlin, Germany, July 18-20, 2018, proceedings. Volume 320 of lecture notes in business information processing. Springer, pp 115-131
[31] van Dongen BF (2012) BPIC 2012. Eindhoven University of Technology
[32] Ward Steeman: BPIC 2013. Eindhoven University of Technology (2013)
[33] van Dongen BF (2017) BPIC 2017. Eindhoven University of Technology
[34] van Dongen B, Borchert F (2018) BPIC 2018. Eindhoven University of Technology
[35] Mannhardt F (2017) Hospital billing-event log. Eindhoven University of Technology. Dataset 326-347
[36] De Leoni M, Mannhardt F (2015) Road traffic fine management process
[37] Mannhardt F (2016) Sepsis cases-event log. Eindhoven University of Technology
[38] van Zelst S, van Dongen B, van der Aalst WMP, Verbeek HMW (2017) Discovering workflow nets using integer linear programming. Computing · Zbl 1395.90166
[39] Weerdt JD, Backer MD, Vanthienen J, Baesens B (2011) A robust f-measure for evaluating discovered process models. In: Proceedings of the IEEE symposium on computational intelligence and data mining, CIDM 2011, part of the IEEE symposium series on computational intelligence 2011, April 11-15, 2011, Paris, France. IEEE, pp 148-155
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.