Preview

Proceedings of the Southwest State University

Advanced search

ETL Process Efficiency for Predictive Analytics

https://doi.org/10.21869/2223-1560-2024-28-4-67-85

Abstract

   Purpose of research. This paper investigates the effectiveness of different missing value handling methods in dataframes for data preprocessing tasks in predictive analytics. Three open datasets containing information on building characteristics, meteorological conditions, and energy consumption are used as test data.

   The goal of the study is to identify the most effective method for data preprocessing in the ETL process for solving predictive analytics problems.

   Methods. The paper combines dataframes from each dataset and analyzes standard methods of the Pandas module, a high-level library of the Python language, such as direct assignment, the use of indexers, and the fillna method with a dictionary. In addition, a module in Cython, a C-like programming language, is developed to optimize the process of filling missing values, and the performance of each method is evaluated.

   Results. The results demonstrate that direct assignment is the most effective method in terms of performance in Pandas. Using Cython, although theoretically capable of speeding up calculations, in this case showed a significant decrease in performance due to the overhead of data transformation and interaction between Python and Cython. Code profiling confirmed that the place with insufficient performance is Pandas operations, not Cython code execution.

   Conclusion. Thus, for most ETL tasks, it is recommended to use optimized Pandas methods, and Cython should be used only in cases of critical need for performance improvement and with careful optimization of the code to minimize overhead, since writing code similar to Pandas will require significant resources, including for its optimization, which in most cases is redundant.

About the Authors

A. V. Oleynikova
Astrakhan State Technical University
Russian Federation

Alla V. Oleynikova, Post-Graduate Stugent

Applied Informatics Department

414056; 16 Tatishcheva str.; Astrakhan


Competing Interests:

The authors declare the absence of obvious and potential conflicts of interest related to the publication of this article



I. O. Bondareva
Astrakhan State Technical University
Russian Federation

Irina O. Bondareva, Cand. of Sci. (Engineering),
Associate Professor, Head of Department

414056; 16 Tatishcheva str.; Astrakhan


Competing Interests:

The authors declare the absence of obvious and potential conflicts of interest related to the publication of this article



A. A. Oleynikov
Astrakhan State Technical University
Russian Federation

Aleksandr A. Oleynikov, Cand. of Sci. (Engineering), Associate Professor

Applied Informatics Department

414056; 16 Tatishcheva str.; Astrakhan


Competing Interests:

The authors declare the absence of obvious and potential conflicts of interest related to the publication of this article



References

1. Gonchar A. A. Using predictive analytics to improve business efficiency. Aktual`ny`e issledovaniya = Current research. 2023; (50-4): 22-46 (In Russ.).

2. Gromov N. D. Comparative analysis of tools and platforms for automating ETL processes in modern data warehouses. Mezhdunarodnyi zhurnal gumanitarnykh i estestvennykh nauk = International journal of humanitarian and natural sciences. 2023; 11-4: 46-48. DOI: 10.24412/2500-1000-2023-11-4-46-48 (In Russ.).

3. Dryankova D. A. Data visualization using Pandas and Matplotlib libraries for the Python programming language. Dnevnik nauki = Science Diary. 2023; 6. (In Russ.). DOI: 10.51691/2541-8327_2023_6_10

4. Dyakonov N. A., Logunova O. S. Process control systems based on predictive analytics: design. Elektrotekhnicheskie sistemy i kompleksy = Electrical systems and complexes. 2021; (1): 58-64. (In Russ.). DOI: 10.18503/2311-8318-2021-1(50)-58-64

5. Ilyichev V. Yu., Yurik E. A. Analysis of data arrays using the Pandas library for Python. Nauchnoe obozrenie. Texnicheskie nauki. = Scientific Review. Technical sciences. 2020; (4): 41-45 (In Russ.).

6. Leskova V. Yu., Solov'ev V. A. Analysis of ETL methods. In: Science and education: current research and development . Collection of articles of the III All-Russian scientific and practical conference, Chita, April 29-30, 2020. Chita: Zabaikal'skii gosudarstvennyi universitet; 2020. P. 36-40 (In Russ.).

7. Nosyreva A. A., Abramov V. I. Predictive analytics - the basis for the digital transformation of companies. In: Current problems of economics, accounting, auditing and analysis in modern conditions. Collection of scientific articles of the International Scientific and Practical Conference, Kursk, 28–29 April 2021. Kursk: Kursk State University; 2021. P. 179-182 (In Russ.).

8. Solomonov A. A. Optimization of ETL processes for big data. Vestnik nauki = Bulletin of Science. 2024; 3(9): 390-396 (In Russ.).

9. Kislyakov A. N. Selection of features for use in predictive analytics models of foreignт economic activity of regions. Prikladnaya matematika i voprosy upravleniya = Applied Mathematics and Management Issues. 2022; (1): 176-195. (In Russ.). DOI: 10.15593/2499-9873/2022.1.09.

10. Sudarikov G. V., Ashmarov I. A. Using the Pandas library for data analysis. Mir obrazovaniya - obrazovanie v mire = The world of education - education in the world. 2023; (1): 184-188. (In Russ.). DOI: 10.51944/20738536_2023_1_184

11. Tereshina V. V. Application of predictive analytics and predictive modeling systems. Innovatsionnoe razvitie ekonomiki = Innovative development of the economy. 2022; (5): 243-246. (In Russ.). DOI: 10.51832/2223798420225243.

12. Terentyeva V. S., Loginova I. M., Eshelioglu R. I. Working with dates in pandas. In: Scientific research of young scientists : Proceedings of the I International scientific and practical conference dedicated to the memory of Doctor of Economics, Professor L. M. Rabinovich, Kazan, February 25-26, 2022. Kazan: Kazanskii gosudarstvennyi agrarnyi universitet. 2022; 2. P. 285-291 (In Russ.).

13. Ambrajei A. N., Golovin N. M., Valyukhova A. V., Rybakova N. A. Using SAP Predictive Analytics to Analyze Individual Student Profiles in LMS Moodle. Communications in Computer and Information Science, 2022; 1539: 66-77. DOI: 10.1007/978-3-030-95494-9_6.

14. Bushuev S. Application of AI for monitoring and optimizing IT infrastructure: economic prospects for implementing predictive analytics in enterprise operations. International Journal of Humanities and Natural Sciences. 2024: (8-3): 125-129. DOI: 10.24412/2500-1000-2024-8-3-125-129.

15. Qaiser A., Farooq M. U., Nabeel Mustafa S. M., Abrar N. Comparative Analysis of ETL Tools in Big Data Analytics. Pakistan Journal of Engineering and Technology. 2023; 6 (1): 7-12. DOI: 10.51846/vol6iss1pp7-12.

16. Singh M. M. Extraction Transformation and Loading (ETL) of Data Using ETL Tools. International Journal for Research in Applied Science and Engineering Technology. 2022; 10(6): 4415-4420. DOI: 10.22214/ijraset.2022.44939.

17. Vagizov M., Potapov A., Konzhgoladze K., et al. Prepare and analyze taxation data using the Python Pandas library. IOP Conference Series: Earth and Environmental Science: 6, Politics, Industry, Science, Education. St. Petersburg, May 26–28, 2021, St. Petersburg; 2021. P. 1-8. DOI: 10.1088/1755-1315/876/1/012078.

18. Godé C., Brion S. The affordance-actualization process of predictive analytics: Towards a configurational framework of a predictive policing system. Technological Forecasting and Social Change. 2024; 204:123452. DOI: 10.1016/j.techfore.2024.123452.

19. Kovalev S. M., Olgeizer I. A., Sukhanov A. V., Kornienko K. I. Identification of Critical States of Technological Processes Based on Predictive Analytics Methods. Automation and Remote Control. 2023; 84 (4): 424-433. DOI: 10.1134/S0005117923040100.

20. Mamedova N. A., Staroverova O. V., Epifanov A. M., et al. Software Solution for the Implementation of a Predictive Analytics System for Investment Instruments. WSEAS Transactions on Systems and Control. 2023; 18: 18-25. DOI: 10.37394/23203.2022.18.2.


Review

For citations:


Oleynikova A.V., Bondareva I.O., Oleynikov A.A. ETL Process Efficiency for Predictive Analytics. Proceedings of the Southwest State University. 2024;28(4):67-85. (In Russ.) https://doi.org/10.21869/2223-1560-2024-28-4-67-85

Views: 126


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2223-1560 (Print)
ISSN 2686-6757 (Online)