Data Ingestion Frameworks for Data Lakes: An Overview

Main Article Content

Hamza Elkina
Taher Zaki


Nowadays, information is considered a new capital of organizations as it is considered the basis of decisions made by the pilot committee, incorrect or incomplete information can cause significant losses. In the field of higher education, the emergence of new technologies and the increase of devices and connected users have allowed the creation of new data sources such as software packages, platforms, and social networks. Faced with this amount of information called big data, databases began to show their inability to manage and process this flow, leading to a new data storage technology called data warehousing that has allowed for many years, efficient handling of structured data sources. The need to manage semi-structured and unstructured data types has necessitated the use of new solutions such as data lakes to meet the new challenges imposed. As a new technology, the data lake is still equivocal, due to the incomplete or complicated presentation. For that matter, we will present a comparison between data warehouse and data lake, and discuss big data as well as data lakes as a new data management technology in higher education by presenting its essential components principally the data pipeline rarely cited in the literature. For this reason, a comparative study has been conducted to evaluate the existing data pipeline solutions and propose more valuable ones. In addition, we will introduce the university's data lake ecosystem to disambiguate the data lake concept and its essential components.

Article Details

How to Cite
Hamza Elkina, & Taher Zaki. (2023). Data Ingestion Frameworks for Data Lakes: An Overview. Journal for ReAttach Therapy and Developmental Diversities, 6(10s(2), 1700–1708.
Author Biographies

Hamza Elkina

Innovation in Mathematics and Intelligent Systems Research Laboratory, Faculty of Applied Sciences, Ibnou Zohr University, Ait Melloul, Morocco

Taher Zaki

Innovation in Mathematics and Intelligent Systems Research Laboratory, Faculty of Applied Sciences, Ibnou Zohr University, Ait Melloul, Morocco


Apache Airflow. (2022). [Python]. The Apache Software Foundation. (Original work published 2015)

Apache Kafka. (n.d.). Retrieved October 7, 2022, from

Armbrust, M., Ghodsi, A., Xin, R., & Zaharia, M. (2021). Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. 8.

Batini, C., & Scannapieco, M. (2016). Data and Information Quality.

Bicevska, Z., & Oditis, I. (2017). Towards NoSQL-based Data Warehouse Solutions. Procedia Computer Science, 104, 104–111.

Blazic, G., Poscic, P., & Jaksic, D. (2017). Data warehouse architecture classification. 2017 40th International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2017 - Proceedings, 1491–1495.

Chen, J., Chen, Y., Du, X., Li, C., Lu, J., Zhao, S., & Zhou, X. (2013). Big data challenge: A data management perspective. Frontiers of Computer Science 2013 7:2, 7(2), 157–164.

Chevalier, M., El Malki, M., Kopliku, A., Teste, O., & Tournier, R. (2015). How Can We Implement a Multidimensional Data Warehouse Using NoSQL? In S. Hammoudi, L. Maciaszek, E. Teniente, O. Camp, & J. Cordeiro (Eds.), Enterprise Information Systems (pp. 108–130). Springer International Publishing.

Chu, X., Ilyas, I. F., Krishnan, S., & Wang, J. (2016). Data Cleaning: Overview and Emerging Challenges. Proceedings of the 2016 International Conference on Management of Data, 2201–2206.

Dabbèchi, H., Haddar, N. Z., Elghazel, H., & Haddar, K. (2021). Social Media Data Integration: From Data Lake to NoSQL Data Warehouse. In A. Abraham, V. Piuri, N. Gandhi, P. Siarry, A. Kaklauskas, & A. Madureira (Eds.), Intelligent Systems Design and Applications (pp. 701–710). Springer International Publishing.

Dehury, C. K., Jakovits, P., Srirama, S. N., Giotis, G., & Garg, G. (2022). TOSCAdata: Modeling data pipeline applications in TOSCA. Journal of Systems and Software, 186, 111164.

Dehury, C. K., Srirama, S. N., & Chhetri, T. R. (2020). CCoDaMiC: A framework for Coherent Coordination of Data Migration and Computation platforms. Future Generation Computer Systems, 109, 1–16.

El Aissi, M. E. M., Benjelloun, S., Loukili, Y., Lakhrissi, Y., Boushaki, A. E., Chougrad, H., & Elhaj Ben Ali, S. (2022). Data Lake Versus Data Warehouse Architecture: A Comparative Study. Lecture Notes in Electrical Engineering, 745, 201–210.

Fang, H. (2015). Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), 820–824.

Giebler, C., Gröger, C., Hoos, E., Eichler, R., Schwarz, H., & Mitschang, B. (2021). The Data Lake Architecture Framework. Gesellschaft für Informatik, Bonn.

Giebler, C., Gröger, C., Hoos, E., Schwarz, H., & Mitschang, B. (2019). Leveraging the Data Lake: Current State and Challenges. In C. Ordonez, I.-Y. Song, G. Anderst-Kotsis, A. M. Tjoa, & I. Khalil (Eds.), Big Data Analytics and Knowledge Discovery (pp. 179–188). Springer International Publishing.

Herden, O. (2020). Architectural Patterns for Integrating Data Lakes into Data Warehouse Architectures. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12581 LNCS, 12–27.

Hortigüela-Alcalá, D., Sánchez-Santamaría, J., Pérez-Pueyo, Á., & Abella-García, V. (2019). Social networks to promote motivation and learning in higher education from the students’ perspective. Innovations in Education and Teaching International, 56(4), 412–422.

Martín, C., Langendoerfer, P., Zarrin, P. S., Díaz, M., & Rubio, B. (2022). Kafka-ML: Connecting the data stream with ML/AI frameworks. Future Generation Computer Systems, 126, 15–33.

Miloslavskaya, N., & Tolstoy, A. (2016). Big Data, Fast Data and Data Lake Concepts. Procedia Computer Science, 88, 300–305.

Mukherjee, R., & Kar, P. (2017). A Comparative Review of Data Warehousing ETL Tools with New Trends and Industry Insight. 2017 IEEE 7th International Advance Computing Conference (IACC), 943–948.

Munirathinam, S., Sun, S., Rosin, J., Sirigibathina, H., & Chinthakindi, A. (2019). Design and Implementation of Manufacturing Data Lake in Hadoop. 2019 IEEE International Conference on Smart Manufacturing, Industrial & Logistics Engineering (SMILE), 19–23.

Pentaho, Hadoop, and Data Lakes. (2010, October 14). James Dixon’s Blog.

Poojara, S. R., Dehury, C. K., Jakovits, P., & Srirama, S. N. (2022). Serverless data pipeline approaches for IoT data in fog and cloud computing. Future Generation Computer Systems, 130, 91–105.

Ravat, F., & Zhao, Y. (2019). Data Lakes: Trends and Perspectives. In S. Hartmann, J. Küng, S. Chakravarthy, G. Anderst-Kotsis, A. M. Tjoa, & I. Khalil (Eds.), Database and Expert Systems Applications (pp. 304–313). Springer International Publishing.

Rooney, S., Bauer, D., Garcés-Erice, L., Urbanetz, P., Froese, F., & Tomic, S. (2019). Experiences with Managing Data Ingestion into a Corporate Datalake. 2019 IEEE 5th International Conference on Collaboration and Internet Computing (CIC), 101–109.

Saggar, S., Bitoni, C., Khurana, I., & Alhawat, R. (2022). Data Warehouse with Big Data Technology for Higher Education (SSRN Scholarly Paper No. 4128707).

Salaki, R. J., & Ratnam, K. A. (2018). Agile Analytics: Applying in the Development of Data Warehouse for Business Intelligence System in Higher Education. In Á. Rocha, H. Adeli, L. P. Reis, & S. Costanzo (Eds.), Trends and Advances in Information Systems and Technologies (pp. 1038–1048). Springer International Publishing.

Santoso, L. W. & Yulia. (2017). Data Warehouse with Big Data Technology for Higher Education. Procedia Computer Science, 124, 93–99.

Sawadogo, P., & Darmont, J. (2021). On data lake architectures and metadata management. Journal of Intelligent Information Systems, 56(1), 97–120.

Shan, H., & Gubin, E. (n.d.). DATA CLEANING FOR DATA ANALYSIS. 2.

Singh, P. (2019). Airflow. In P. Singh (Ed.), Learn PySpark: Build Python-based Machine Learning and Deep Learning Models (pp. 67–84). Apress.

Smolinski, M. (2018). Impact of Storage Space Configuration on Transaction Processing Performance for Relational Database in PostgreSQL. In S. Kozielski, D. Mrozek, P. Kasprowski, B. Małysiak-Mrozek, & D. Kostrzewa (Eds.), Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety (pp. 157–167). Springer International Publishing.

Spotify/luigi. (2022). [Python]. Spotify. (Original work published 2012)

Sreemathy, J., Joseph V., I., Nisha, S., Prabha I., C., & Priya R.M., G. (2020). Data Integration in ETL Using TALEND. 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), 1444–1448.

Sukare, N., & Al, E. (2021). Smart Classroom Environment using IoT in advanced and lebanese French university Education. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 12(7), Article 7.

Taleb, I., Serhani, M. A., & Dssouli, R. (2018). Big Data Quality Assessment Model for Unstructured Data. 2018 International Conference on Innovations in Information Technology (IIT), 69–74.

Vaisman, A., & Zimányi, E. (2014). Data Warehouse Systems.

Williamson, B. (2018). The hidden architecture of higher education: Building a big data infrastructure for the ‘smarter university.’ International Journal of Educational Technology in Higher Education, 15(1), 12.

Wyatt, L., Caufield, B., & Pol, D. (2009). Principles for an ETL Benchmark. In R. Nambiar & M. Poess (Eds.), Performance Evaluation and Benchmarking (Vol. 5895, pp. 183–198). Springer Berlin Heidelberg.

Yangui, R., Nabli, A., & Gargouri, F. (2017). ETL Based Framework for NoSQL Warehousing. In M. Themistocleous & V. Morabito (Eds.), Information Systems (pp. 40–53). Springer International Publishing.

Zhao, Y., Megdiche, I., & Ravat, F. (2021). Data Lake Ingestion Management (arXiv:2107.02885). arXiv.