International Journal For Multidisciplinary Research

E-ISSN: 2582-2160     Impact Factor: 9.24

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

Call for Paper Volume 6 Issue 3 May-June 2024 Submit your research before last 3 days of June to publish your research paper in the issue of May-June.

The Making of an Data Pipeline

Author(s) Harsh Kaushik, Avnish Rai, Gaurav Kapasiya, Jai Prakash Bhati
Country India
Abstract This paper details the development and implementation of a data engineering pipeline designed for the extraction, transformation, and loading (ETL) of data from a web-based directory. The project involves using asynchronous web scraping techniques to gather user details from a local business directory, transforming the data into a structured format, and loading it into a storage solution. The pipeline utilises Python, the HTTPX library for asynchronous HTTP requests, BeautifulSoup for HTML parsing, and Amazon S3 for data storage. By leveraging these technologies, the pipeline demonstrates an efficient approach to handling large-scale web data extraction and processing, significantly reducing the time required to gather and organise data from multiple web pages. This paper provides insights into the architecture, implementation, and performance of the ETL pipeline, highlighting the benefits and challenges of using asynchronous programming in data engineering.
Keywords ETL, Data engineering , Python, Async, Web Scraping,
Field Engineering
Published In Volume 6, Issue 3, May-June 2024
Published On 2024-05-21
Cite This The Making of an Data Pipeline - Harsh Kaushik, Avnish Rai, Gaurav Kapasiya, Jai Prakash Bhati - IJFMR Volume 6, Issue 3, May-June 2024. DOI 10.36948/ijfmr.2024.v06i03.20849
Short DOI

Share this