DISTRIBUTED DEVELOPMENT OF THE ALGORITHM FOR THE DETECTION OF LOCAL OUTLIER FACTORS IN APACHE SPARK

Distributed development of the algorithm for detecting local anomalous factors in Apache Spark

Authors

  • Raynel Roberto Rodríguez Oliva Empresa Correos de Cuba
  • Lester Guerra Denis Universidad Tecnológica de La Habana "José Antonio Echeverría"
  • Humberto Díaz Pando Universidad Tecnológica de La Habana "José Antonio Echeverría"

Keywords:

Apache Spark, Local outlier factor, MapReduce, parallel processing

Abstract

The development achieved with information and communications technologies has resulted in the growth of all data stored and / or exchanged electronically. Data mining techniques are capable of extracting knowledge from these stored data. One of the tasks of data mining is the detection of outliers. When the volume of stored data can not be processed by traditional infrastructures, other, more efficient ways of processing information are needed. The parallel processing of information is a type of processing that allows the execution of several processes concurrently, achieving impressive calculation powers. The objective of this research is to develop the algorithm for the detection of local outliers factors to be executed in Apache Spark which implements the MapReduce programming model. Two variants are proposed, the first is deterministic and the second is more efficient than the first but with approximate results. Based on the experiments carried out and the results obtained with the non-parametric hypothesis tests, it is shown that the proposed variants reduce the execution times in relation to their sequential variant.

References

A. De Mauro, M. J. G., and M. Grimaldi. (2016). A Formal Definition of Big Data based on its essential features. Library Review, 65 (3), 122-135.

Barney, B. (2016). Introduction to Parallel Computing. Retrieved 10 de abril de 2017, 2017, from https://computing.llnl.gov/tutorials/parallel_comp/

Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). LOF: identifying density-based local outliers. Paper presented at the ACM sigmod record.

Brinker, T. J., Hekler, A., Enk, A. H., Klode, J., Hauschild, A., Berking, C., . . . Holland-Letz, T. (2019). Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task. European Journal of Cancer, 113, 47-54.

Calvo-Valverde, L.-A., & Acuña-Alpízar, N. J. (2018). Aplicación de métodos agregados en la detección de puntos atípicos en series de tiempo meteorológicas. Revista Tecnología en Marcha, 31(1), 98-109.

Camblor, P. M. (2012). Ajuste del valor-p por contrastes múltiples. Revista chilena de salud pública, 16(3), p. 225-232.

Campos, G. O., Zimek, A., Sander, J., Campello, R. J., Micenková, B., Schubert, E. & Houle, M. E. (2016). On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery, 30(4), 891-927.

da Silva Galaco, A. R. B., Freire, R. O., Jesus, L. T., & Serra, O. A. (2020). Experimental and theoretical study of isoreticular lanthanoid organic framework (LOF): Structure and luminescence. Journal of Luminescence, 117179.

Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.

Devi, R. D. H., & Devi, M. I. (2016). Outlier detection algorithm combined with decision tree classifier for early diagnosis of breast cancer. Int J Adv Engg Tech/Vol. VII/Issue II/April-June, 93, 98.

Ferdosi, B. J., & Tarek, M. M. (2019). Visual verification and analysis of outliers using optimal outlier detection result by choosing proper algorithm and parameter Emerging Technologies in Data Mining and Information Security (pp. 507-517): Springer.

Hassanien, A. E., Azar, A. T., Snasel, V., Kacprzyk, J., & Abawajy, J. (2015). Big Data in Complex Systems: Springer.

Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: lightning-fast big data analysis: " O'Reilly Media, Inc.".

López Miguel, P. (2019). Detección de actividades anómalas en espacios públicos mediante redes neuronales profundas.

Miner, D., & Shook, A. (2012). MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems: " O'Reilly Media, Inc.".

Minitab. Soporte de Minitab. Retrieved 14 de febrero,2017, 2017, from http://support.minitab.com

Mishra, S., & Chawla, M. (2019). A comparative study of local outlier factor algorithms for outliers detection in data streams Emerging Technologies in Data Mining and Information Security (pp. 347-356): Springer.

Orallo, M. J. H., Quintana, M. J. R., Ramírez, C. F., & Schmidt, C. (2004). Introducción a la Minería de Datos. Madrid: Pearson Prentice Hall.

Reyes-Ortiz, J. L., Oneto, L., & Anguita, D. (2015). Big data analytics in the cloud: Spark on hadoop vs mpi/openmp on beowulf. Procedia Computer Science, 53, 121-130.

Tiwari, H. M. W., Daryanto, Y. (2018). Big Data Analytics in Supply Chain Management between 2010 and 2016: Insights to industries. Computer and Industrial Engineering, pp. 319-330.

Vadoodparast, M., & Hamdan, A. R. (2015). Fraudulent Electronic Transaction Detection Using Dynamic KDA Model. International Journal of Computer Science and Information Security, 13(3), 90.

Wang, W., Li, H., Wang, K., He, C., & Bai, M. (2020). Pavement crack detection on geodesic shadow removal with local oriented filter on LOF and improved Level set. Construction and Building Materials, 237, 117750.

Xiong, Y., Zhu, Y., Philip, S. Y., & Pei, J. (2013). Towards Cohesive Anomaly Mining. Paper presented at the AAAI.

Zhang, J., You, H., & Jia, R. (2020). Reliability hazard characterization of wafer-level spatial metrology parameters based on LOF-KNN method. Microelectronics Reliability, 107, 113599.

Zhang, Z. M., Salerno, J. J., & Yu, P. S. (2003). Applying data mining in investigating money laundering crimes. Paper presented at the Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining.

Published

2020-04-24

How to Cite

Rodríguez Oliva, R. R., Guerra Denis, L., & Díaz Pando, H. (2020). DISTRIBUTED DEVELOPMENT OF THE ALGORITHM FOR THE DETECTION OF LOCAL OUTLIER FACTORS IN APACHE SPARK: Distributed development of the algorithm for detecting local anomalous factors in Apache Spark. Revista Cubana De Transformación Digital, 1(1), 119–131. Retrieved from https://rctd.uic.cu/rctd/article/view/49

Issue

Section

Original Articles - Technologies Artificial Intelligence