A Reproducible Polygon Dataset for Address Clustering
A Reproducible Polygon Dataset for Address Clustering
Level: MA
Responsible person: Syed Muhammad Yasir
Keywords: Polygon, EVM, data collection, execution traces, ERC-20/721, reproducibility
This thesis builds a small, fully reproducible dataset for the Polygon PoS chain. The
student will select a short, fixed block range and collect: blocks, transactions, receipts,
logs, execution traces (internal calls via Bor/Geth-compatible debug/trace where
available), and ERC-20/721 transfer and approval events. All records will be normalized
into consistent, analysis-ready CSV/Parquet tables, accompanied by a machinereadable manifest (block range, chain ID, client/RPC details, JSON-RPC methods used,
schema version, file hashes) so that the slice can be regenerated exactly at any later
date. If time allows, optional scripts may export simple graph views, but the core
deliverable is the reproducible dataset itself.
References:
• Polygon Labs. Polygon PoS Architecture (Heimdall & Bor). Docs.
https://docs.polygon.technology/pos/architecture/overview/
• go-ethereum (Geth). debug namespace (trace APIs). Docs.
https://geth.ethereum.org/docs/interacting-with-geth/rpc/ns-debug
• Erigon. trace module (transaction/internal-call tracing). Docs.
https://docs.erigon.tech/advanced/JSONRPC-trace-module
• EIP-20 (ERC-20): Token Standard. https://eips.ethereum.org/EIPS/eip-20
• EIP-721 (ERC-721): Non-Fungible Token Standard.
https://eips.ethereum.org/EIPS/eip-721
• Blockchain-ETL. ethereum-etl (export Ethereum to CSV/DB). GitHub.
https://github.com/blockchain-etl/ethereum-etl
• Blockchain-ETL. polygon-etl (export Polygon to CSV/DB). GitHub.
https://github.com/blockchain-etl/polygon-etl
• Thürkauf, D. (2023). Address Clustering Heuristics for Account-Based Blockchain
Networks: An Analysis based on a Decentraland User Set
• Victor, F., & Lüders, B. K. (2019). Measuring Ethereum-Based ERC-20 Token
Networks. In FC 2019, LNCS 11598, 113–129.