Polkadot Data platform with deep indexed data and staking reports

Team Name: P2P.ORG Validator
Payment Address: 0xE22211Ba98213c866CC5DC8d7D9493b1e7EFD25A (USDC)
Level: 2

Project Overview 📄

Current proposal is the solution from P2P.org for the problem described in Analytics Website/Data Platform RFP.

Overview and Ecosystem Fit

Implementation of the data-driven culture was a huge boosting for web2 which allowed the fin-tech industry to speed up the decision-making and the development of new products, also decrease the financial fraud and provide flexible risk-management.

Dune analytics success and arising 15+ Dune-like platforms are the evidence that data analysis tools are especially acute for web3 today and the demand for community driven data insights continue to grow.

P2P.org is one of the biggest validators that is participate in more than 40 POS chains. That is why the need of data, monitoring, performance analysis are the core problems in decision making for us. We’ve started to develop our data function from using external platforms such as Dune but soon we found out that there is no such platform or platforms that could help with even the most common challenges according to data needs:

data covering: no Polkadot data and off-chain data about deFi protocols such as Chainlink;
absent of the streaming doesn’t allow to receive the data close-to-real time which makes it unacceptable for pro-reactive tasks such as performance monitoring;
threshold for entering analytics: we have high skilled data analyst who can work with any data but for business people it is impossible to get answer about new metric very quick.

Solving these problem internally we understood that our experience and data insights could be helpful for crypto community, especially in new projects such as new cosmos chains where there are not so many users and the economy has just started to evolve. We can help new project with data-driven shilling and go deeper with data-analysis in the chains that we already covered by existing data platforms.

According the web3 fund need from RFP Polkadot ecosystem needs:

extract on-chain data and transform it to be available for SQL querying by data analysts
provide public access through query engine to build data sets on-top of indexed data
provide UI interface for building dashboards on-top of data sets to visualize data insights for decision makers

We saw some teams that already building such solutions for the RFP and want to describe our key features which could provide the best user experience according to our vision:

modular architecture allows you
- to use any indexer such as Subsquid, Subquery or custom indexer (Mbelt in P2P.org case)
- to use any database as a DWH for storing public data (in current RFP we will use BigQuery as DWH)
- to customize ETL process for any indexer/DWH solution via Airflow DAGs
- to customize the solution for real-time data enrichment process via direct streaming to DWH
dashboard with sli/slo for data quality of ETL pipeline - we provide transparent and clear data quality metrics to estimate the quality of ETL
P2P.org infrastructure (P2P RPC for indexation)

Potential users

We define the users of the platform as data people who generates insights from the data to support data-driven base for decision makers. These people have:

tech skills to write SQL queries
basic statistics to define calculate and estimate metrics for the business
expertise of the particular domain (such as NFT, particular deFi)

Current state of Dune platform (as the Leader of the market) is about 20k users. Dune is focusing on Ethereum data and DeFi in Ethereum ecosystem. Current number of active wallets in Ethereum is 400k. So, we see that current state of the crypto industry demonstrates a high proportion of data-people to all token holders with is 5%.

We believe that in case of mass adoption we will not see this proportion but we need to involve more people still. Our global vision for this problem is to provide ability to work with the data without SQL knowledge. That is why we develop Kolmo - AI assistant.

Our concern is that we can first attract users from the Polkadot ecosystem, which hasn’t been covered by the current data-platforms. After that we are going to provide an innovative opportunity to enter in data-driven decisions without SQL skills.

Free usage

The only limitation for all users is number of TB for free executing of SQL query. We would be able to charge users for executing more data.

Project Details

We already mentioned modular architecture as key feature of the solution. The next chapter will describe it deeply.

Modular architecture

Indexing & ETL module consist of:

indexer which is Software to extract the data
orchestration service: you can use any source to receive the data from RPC and write it to any DWH

DWH module is analytical data base which stores all the data and provides public access. It consist of:

raw layer which is simply the response from indexer - structured/decoded data from RPC nodes
domain layer which is the reports and data-marts that were written by the users. By default we would provide the “staking domain data” with the data marts about staking

UI module consist of forking existing BI tool with the ability:

get access to DWH data (both layers)
interface to write the query
execute the SQL statement
save and store SQL statements (provided by users manually) to use it for visualizing
BI interface to build charts and dashboards on-top of DWH data

Tech stack

module	component	solution	tech stack	comment
Indexing & ETL	indexer	Mbelt	js	We already develop the indexer for web3 RFP. It can extract all data like Subquery but more over it can extract data about staking process.
We also share sli/slo for indexer and our experience showed that currently it is the best solution according to data quality and flexible enough to support custom needs like staking.
	orchestration	Airflow	python	flexible, popular ETL orchestration
DWH	raw layer + domain layer	bigQuery	cloud sql	current solution is DWH agnostic but we are using bigQuery
UI	forked BI tool	forking Superset	- backend: python

frontend: rest | - auth in the box
already has all functions to write, save, execute query and build dashboards
open-source
flexible role-model
audit of logs for security
ability to make dashboards public |

Data schema

Raw layer (indexer’s data) which is MBELT data scheme described here: https://github.com/p2p-org/polkadot-profit-transformer/blob/main/db/000001_init.sql

The solution comes with the second layer of data which is data marts layer.

There are two ways to create report based on raw layer:

via BI where the query will be store as request to DWH
PR with data mart that should be re-calculated every time

Second option is for popular/the most important reports which are accepted by admin users (p2p.org DWH team).

Initially the solution will contain reports about staking in Polkadot/Kusama/Evm parachains:

active set of validators per epoch
stake amount between nominator and validator for every epoch
total validator stake per epoch
the validator commission for all active validators
the public identity of validator
the rewards (tokens, rewards points) earned by nominator with particular validator per epoch
total rewards (tokens, rewards points) earned by nominator per epoch
total rewards (tokens, rewards points) earned by validator per epoch
ability to calculate APR/APY on top of this reports

Data quality

Polkadot SLI (based on history for internal consumption):

Block processing mean time: 1073 msec (min - 238 msec, max- 35 sec)
Mean latency between block in RPC node to block in database: ~ 18 сек (min - 13 sec, max 99 sec)
Blocks processed correctly in first attempt (others are fixed automatically): ~ 99%
Staking SLI: 100% (no problems detected)

Parachains sli:

Block processing mean time: 914 msec (min - 81 msec, max- 98 sec)
Mean latency between block in RPC node to block in database: ~ 95 сек (min - 12 sec, max 240 sec)
Blocks processed correctly in first attempt (others are fixed automatically) ~ 96%
Staking SLI: 95%

We also have an SLI dashboard here, but thresholds used in this dashboard were set by internal team and could be different for final public solution.

Team 👥

Team members

Developers:
- Ivan Safronov (developer of Mbelt)
- Artem Kiselev (team lead of indexation team in P2P.org)
DWH engineers:
- Kovalev German - (data engineer)
- Kamalov Sergey (team lead of DWH team in P2P.org)
Product owner: Kotenko Nick

Team Code Repos

Indexer: Mbelt

Public DWH: “p2p-data-platform” in bigQuery you can find in public access though bigQ console

Team's experience

We believe that proposed product can bring not only a new user experience for validators and NO managers but also provide true decentralization with the performance improvement. P2P.org has long-term experience in more then 25 blockchains since 2018. We can bring the high level expertise to help new participants perform better by providing recommendations on-top of the core solution.

Our data chapter is successfully implementing data-driven culture in P2P.org during 2 years.

Our developers have vast experience in data-indexation (Mbelt, Cosmos chains indexer for Agoric);
Data engineers support external and internal solutions for Data warehouse with more than 200 TB of on-chain data;
Data analyst with developers are finishing new stacking flow solution with Delegators’s custom dashboards which are build on-top of DWH.

Contact

Contact Name: Nick Kotenko
Contact Email: nikolay.kotenko@p2p.org
Website: p2p.org

Development Status 📖

Current solution from P2P.org has already implemented:

mbelt indexer as open-source solution
private Polkadot datasets
private Superset without public user’s access and public auth
dashboard with sli/slo for ETL of the platform

We suggest proposal with 1 milestone to:

copy Mbelt data (raw indexed data) to public
provide public dataset about staking (example user’s domain data/reports)
fork Superset for public usage
provide staking dashboards (example user’s domain data/reports)
wrapping to end-2-end open-source solution: indexer deploy + orchestration + DWH connector + BI tool

Overview

Total Estimated Duration: 1.5 month
Full-Time Equivalent (FTE): 200 FTE hours [100 USD/hr]
Total Costs: $20,000

Milestone 1 — Forked Superset with indexed data

Estimated duration: 1.5 month
FTE: 3.5
Costs: 20,000 USD

Number	Deliverable	Specification
0a.	License	Apache 2.0
0b.	Documentation	We will provide a documentation page about how to launch docker’s files for ETL (indexer + orchestration), DWH connection, launching UI (BI tool).
0c.	Testing and Testing Guide	Core functions will be fully covered by comprehensive unit tests to ensure functionality and robustness. In the guide, we will describe how to run these tests.
0d.	Docker	We will provide a Dockerfile(s) that can be used to run indexer locally, set up data base (DWH), and run BI.
1.	Develop the ETL component	The result of developing ETL component contains the docker with Mbelt (indexer) and Airflow with DAGs to write indexer’s output into DWH (data base).
2.	Publish DWH data	Our goal for publishing data is: providing raw (indexer’s data) data into public bigQ project with the ability to query and use this data for public users. The chains are Polkadot, Kusama, EVM parachains. We would also provide reports about staking on top of raw data in chains above. The additional result is SLI/SLO dashboard of ETL process to provide more transparency for users.
3.	Forking Superset	Forking Superset will allow us to modify sign-in page to provide new users to join. We will make some changes on UI side to improve user’s experience such as: Main page modifications to make it dash-oriented, Dashboard tab & Chart tab modification to provide user-friendly filters, SQL lab tab modification to make it more useful to search for sources, User’s settings modification to exclude useless bottoms

Future plans

The functionality that we want to test and implement out of scope the RFP development:

materialized views for user
upload user’s off-chain data
anchor modeling as data layer for analysts
AI assistant implementation (closed-source)
community driven fine-tuning of the AI assistant

Referral Program (optional) 💰

Not applicable****

Additional Information ➕

How did you hear about the Grants Program?

Web3 Foundation Website and previous grants: Multiblockchain ETL

How did you hear about the Grants Program?

Web3 Foundation Website and previous grants: Multiblockchain ETL

Project Overview 📄​

Overview and Ecosystem Fit​

Potential users​

Project Details​

Team 👥​

Team members​

Team Code Repos​

Team's experience​

Contact​

Development Status 📖​

Overview​

Milestone 1 — Forked Superset with indexed data​

Future plans​

Referral Program (optional) 💰​

Additional Information ➕​