Header

Search

A Generalized Pipeline for Blockchain Data ETL

A Generalized Pipeline for Blockchain Data ETL

Level: MAP
Responsible person: Tao Yan
Keywords: Blockchain data, ETL

Blockchain data is fundamental for monitoring, analyzing, and comparing blockchain ecosystems. However, despite being publicly accessible, accurately collecting and organizing blockchain data remains challenging due to the lack of standardized tools and domain-specific expertise. 

This project aims to design and implement a generalized framework for blockchain data extraction, transformation, and loading (ETL). Its goal is to provide a unified and scalable infrastructure that enables to collection, processing, and analysis of data from different blockchains consistently and efficiently. The framework will support multiple blockchain architectures, particularly EVM-compatible networks, and will standardize the retrieval of key data elements such as transactions, blocks, and logs. It will integrate essential components for data streaming, storage, caching, and synchronization. More importantly, it needs to be highly scalable to integrate new data. 

Students with strong programming skills in Python, database technologies, and are interested in exploring how to build reliable and accessible blockchain data infrastructure are welcome to participate. 

References: 

[1] Yan, T., Li, S., Kraner, B., Zhang, L., & Tessone, C. J. (2025). A Data Engineering Framework for Ethereum Beacon Chain Rewards: From Data Collection to Decentralization Metrics. Scientific Data, 12(1), 519. 

[2] https://github.com/uzh-eth-mp/app