Data Ingestion Frameworks for Data Lakes: An Overview

Main Article Content

Hamza Elkina
Taher Zaki

Abstract

Nowadays, information is considered a new capital of organizations as it is considered the basis of decisions made by the pilot committee, incorrect or incomplete information can cause significant losses. In the field of higher education, the emergence of new technologies and the increase of devices and connected users have allowed the creation of new data sources such as software packages, platforms, and social networks. Faced with this amount of information called big data, databases began to show their inability to manage and process this flow, leading to a new data storage technology called data warehousing that has allowed for many years, efficient handling of structured data sources. The need to manage semi-structured and unstructured data types has necessitated the use of new solutions such as data lakes to meet the new challenges imposed. As a new technology, the data lake is still equivocal, due to the incomplete or complicated presentation. For that matter, we will present a comparison between data warehouse and data lake, and discuss big data as well as data lakes as a new data management technology in higher education by presenting its essential components principally the data pipeline rarely cited in the literature. For this reason, a comparative study has been conducted to evaluate the existing data pipeline solutions and propose more valuable ones. In addition, we will introduce the university's data lake ecosystem to disambiguate the data lake concept and its essential components.

Article Details

How to Cite
Hamza Elkina, & Taher Zaki. (2023). Data Ingestion Frameworks for Data Lakes: An Overview. Journal for ReAttach Therapy and Developmental Diversities, 6(10s(2), 1700–1708. https://doi.org/10.53555/jrtdd.v6i10s(2).2220
Section
Articles
Author Biographies

Hamza Elkina

Innovation in Mathematics and Intelligent Systems Research Laboratory, Faculty of Applied Sciences, Ibnou Zohr University, Ait Melloul, Morocco

Taher Zaki

Innovation in Mathematics and Intelligent Systems Research Laboratory, Faculty of Applied Sciences, Ibnou Zohr University, Ait Melloul, Morocco

References

Apache Airflow. (2022). [Python]. The Apache Software Foundation. https://github.com/apache/airflow (Original work published 2015)

Apache Kafka. (n.d.). Retrieved October 7, 2022, from https://kafka.apache.org/documentation/

Armbrust, M., Ghodsi, A., Xin, R., & Zaharia, M. (2021). Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. 8.

Batini, C., & Scannapieco, M. (2016). Data and Information Quality. https://doi.org/10.1007/978-3-319-24106-7

Bicevska, Z., & Oditis, I. (2017). Towards NoSQL-based Data Warehouse Solutions. Procedia Computer Science, 104, 104–111. https://doi.org/10.1016/j.procs.2017.01.080

Blazic, G., Poscic, P., & Jaksic, D. (2017). Data warehouse architecture classification. 2017 40th International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2017 - Proceedings, 1491–1495. https://doi.org/10.23919/MIPRO.2017.7973657

Chen, J., Chen, Y., Du, X., Li, C., Lu, J., Zhao, S., & Zhou, X. (2013). Big data challenge: A data management perspective. Frontiers of Computer Science 2013 7:2, 7(2), 157–164. https://doi.org/10.1007/S11704-013-3903-7

Chevalier, M., El Malki, M., Kopliku, A., Teste, O., & Tournier, R. (2015). How Can We Implement a Multidimensional Data Warehouse Using NoSQL? In S. Hammoudi, L. Maciaszek, E. Teniente, O. Camp, & J. Cordeiro (Eds.), Enterprise Information Systems (pp. 108–130). Springer International Publishing. https://doi.org/10.1007/978-3-319-29133-8_6

Chu, X., Ilyas, I. F., Krishnan, S., & Wang, J. (2016). Data Cleaning: Overview and Emerging Challenges. Proceedings of the 2016 International Conference on Management of Data, 2201–2206. https://doi.org/10.1145/2882903.2912574

Dabbèchi, H., Haddar, N. Z., Elghazel, H., & Haddar, K. (2021). Social Media Data Integration: From Data Lake to NoSQL Data Warehouse. In A. Abraham, V. Piuri, N. Gandhi, P. Siarry, A. Kaklauskas, & A. Madureira (Eds.), Intelligent Systems Design and Applications (pp. 701–710). Springer International Publishing. https://doi.org/10.1007/978-3-030-71187-0_64

Dehury, C. K., Jakovits, P., Srirama, S. N., Giotis, G., & Garg, G. (2022). TOSCAdata: Modeling data pipeline applications in TOSCA. Journal of Systems and Software, 186, 111164. https://doi.org/10.1016/j.jss.2021.111164

Dehury, C. K., Srirama, S. N., & Chhetri, T. R. (2020). CCoDaMiC: A framework for Coherent Coordination of Data Migration and Computation platforms. Future Generation Computer Systems, 109, 1–16.

https://doi.org/10.1016/j.future.2020.03.029

El Aissi, M. E. M., Benjelloun, S., Loukili, Y., Lakhrissi, Y., Boushaki, A. E., Chougrad, H., & Elhaj Ben Ali, S. (2022). Data Lake Versus Data Warehouse Architecture: A Comparative Study. Lecture Notes in Electrical Engineering, 745, 201–210. https://doi.org/10.1007/978-981-33-6893-4_19

Fang, H. (2015). Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), 820–824. https://doi.org/10.1109/CYBER.2015.7288049

Giebler, C., Gröger, C., Hoos, E., Eichler, R., Schwarz, H., & Mitschang, B. (2021). The Data Lake Architecture Framework. Gesellschaft für Informatik, Bonn. https://doi.org/10.18420/btw2021-19

Giebler, C., Gröger, C., Hoos, E., Schwarz, H., & Mitschang, B. (2019). Leveraging the Data Lake: Current State and Challenges. In C. Ordonez, I.-Y. Song, G. Anderst-Kotsis, A. M. Tjoa, & I. Khalil (Eds.), Big Data Analytics and Knowledge Discovery (pp. 179–188). Springer International Publishing. https://doi.org/10.1007/978-3-030-27520-4_13

Herden, O. (2020). Architectural Patterns for Integrating Data Lakes into Data Warehouse Architectures. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12581 LNCS, 12–27. https://doi.org/10.1007/978-3-030-66665-1_2

Hortigüela-Alcalá, D., Sánchez-Santamaría, J., Pérez-Pueyo, Á., & Abella-García, V. (2019). Social networks to promote motivation and learning in higher education from the students’ perspective. Innovations in Education and Teaching International, 56(4), 412–422. https://doi.org/10.1080/14703297.2019.1579665

Martín, C., Langendoerfer, P., Zarrin, P. S., Díaz, M., & Rubio, B. (2022). Kafka-ML: Connecting the data stream with ML/AI frameworks. Future Generation Computer Systems, 126, 15–33. https://doi.org/10.1016/j.future.2021.07.037

Miloslavskaya, N., & Tolstoy, A. (2016). Big Data, Fast Data and Data Lake Concepts. Procedia Computer Science, 88, 300–305. https://doi.org/10.1016/j.procs.2016.07.439

Mukherjee, R., & Kar, P. (2017). A Comparative Review of Data Warehousing ETL Tools with New Trends and Industry Insight. 2017 IEEE 7th International Advance Computing Conference (IACC), 943–948. https://doi.org/10.1109/IACC.2017.0192

Munirathinam, S., Sun, S., Rosin, J., Sirigibathina, H., & Chinthakindi, A. (2019). Design and Implementation of Manufacturing Data Lake in Hadoop. 2019 IEEE International Conference on Smart Manufacturing, Industrial & Logistics Engineering (SMILE), 19–23. https://doi.org/10.1109/SMILE45626.2019.8965302

Pentaho, Hadoop, and Data Lakes. (2010, October 14). James Dixon’s Blog. https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/

Poojara, S. R., Dehury, C. K., Jakovits, P., & Srirama, S. N. (2022). Serverless data pipeline approaches for IoT data in fog and cloud computing. Future Generation Computer Systems, 130, 91–105. https://doi.org/10.1016/j.future.2021.12.012

Ravat, F., & Zhao, Y. (2019). Data Lakes: Trends and Perspectives. In S. Hartmann, J. Küng, S. Chakravarthy, G. Anderst-Kotsis, A. M. Tjoa, & I. Khalil (Eds.), Database and Expert Systems Applications (pp. 304–313). Springer International Publishing. https://doi.org/10.1007/978-3-030-27615-7_23

Rooney, S., Bauer, D., Garcés-Erice, L., Urbanetz, P., Froese, F., & Tomic, S. (2019). Experiences with Managing Data Ingestion into a Corporate Datalake. 2019 IEEE 5th International Conference on Collaboration and Internet Computing (CIC), 101–109. https://doi.org/10.1109/CIC48465.2019.00021

Saggar, S., Bitoni, C., Khurana, I., & Alhawat, R. (2022). Data Warehouse with Big Data Technology for Higher Education (SSRN Scholarly Paper No. 4128707). https://doi.org/10.2139/ssrn.4128707

Salaki, R. J., & Ratnam, K. A. (2018). Agile Analytics: Applying in the Development of Data Warehouse for Business Intelligence System in Higher Education. In Á. Rocha, H. Adeli, L. P. Reis, & S. Costanzo (Eds.), Trends and Advances in Information Systems and Technologies (pp. 1038–1048). Springer International Publishing. https://doi.org/10.1007/978-3-319-77703-0_101

Santoso, L. W. & Yulia. (2017). Data Warehouse with Big Data Technology for Higher Education. Procedia Computer Science, 124, 93–99. https://doi.org/10.1016/J.PROCS.2017.12.134

Sawadogo, P., & Darmont, J. (2021). On data lake architectures and metadata management. Journal of Intelligent Information Systems, 56(1), 97–120. https://doi.org/10.1007/s10844-020-00608-7

Shan, H., & Gubin, E. (n.d.). DATA CLEANING FOR DATA ANALYSIS. 2.

Singh, P. (2019). Airflow. In P. Singh (Ed.), Learn PySpark: Build Python-based Machine Learning and Deep Learning Models (pp. 67–84). Apress. https://doi.org/10.1007/978-1-4842-4961-1_4

Smolinski, M. (2018). Impact of Storage Space Configuration on Transaction Processing Performance for Relational Database in PostgreSQL. In S. Kozielski, D. Mrozek, P. Kasprowski, B. Małysiak-Mrozek, & D. Kostrzewa (Eds.), Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety (pp. 157–167). Springer International Publishing. https://doi.org/10.1007/978-3-319-99987-6_12

Spotify/luigi. (2022). [Python]. Spotify. https://github.com/spotify/luigi (Original work published 2012)

Sreemathy, J., Joseph V., I., Nisha, S., Prabha I., C., & Priya R.M., G. (2020). Data Integration in ETL Using TALEND. 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), 1444–1448. https://doi.org/10.1109/ICACCS48705.2020.9074186

Sukare, N., & Al, E. (2021). Smart Classroom Environment using IoT in advanced and lebanese French university Education. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 12(7), Article 7. https://doi.org/10.17762/turcomat.v12i7.3395

Taleb, I., Serhani, M. A., & Dssouli, R. (2018). Big Data Quality Assessment Model for Unstructured Data. 2018 International Conference on Innovations in Information Technology (IIT), 69–74. https://doi.org/10.1109/INNOVATIONS.2018.8605945

Vaisman, A., & Zimányi, E. (2014). Data Warehouse Systems. https://link.springer.com/book/10.1007/978-3-642-54655-6

Williamson, B. (2018). The hidden architecture of higher education: Building a big data infrastructure for the ‘smarter university.’ International Journal of Educational Technology in Higher Education, 15(1), 12. https://doi.org/10.1186/s41239-018-0094-1

Wyatt, L., Caufield, B., & Pol, D. (2009). Principles for an ETL Benchmark. In R. Nambiar & M. Poess (Eds.), Performance Evaluation and Benchmarking (Vol. 5895, pp. 183–198). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-10424-4_14

Yangui, R., Nabli, A., & Gargouri, F. (2017). ETL Based Framework for NoSQL Warehousing. In M. Themistocleous & V. Morabito (Eds.), Information Systems (pp. 40–53). Springer International Publishing. https://doi.org/10.1007/978-3-319-65930-5_4

Zhao, Y., Megdiche, I., & Ravat, F. (2021). Data Lake Ingestion Management (arXiv:2107.02885). arXiv. https://doi.org/10.48550/arXiv.2107.02885