davidgasquez / gitcoin-grants-data-portal Goto Github PK
View Code? Open in Web Editor NEW๐ฒ Open source, serverless, and local-first data hub for Gitcoin Grants data!
Home Page: https://grantsdataportal.xyz/
License: MIT License
๐ฒ Open source, serverless, and local-first data hub for Gitcoin Grants data!
Home Page: https://grantsdataportal.xyz/
License: MIT License
CI setup allows execution to proceed even though some Dagster jobs have failed.
I would like to use Github Run Summary to make it more obvious that asset failed to materialize without needing to open the log.
Solution is to modify equivalent of make run
step to produce logfile and parse that in new CI step responsible for parsing logs like so.
Will open PR, but first I want to check if I can capture STEP_SKIPPED this way.
The portal website is using relative paths for /data
, which makes /data
accessible via any gateway... but makes it error on GitHub Pages.
Working on this one as we speak.
Some resources we want to grab:
data
folder)We can pin those files on IPFS to simplify fetching logic and ensure they don't break in the future:
IPFS_Resource
configured with gateway.assets/static.py
with assets representing archival datasetsExamples of historic datasets:
RPGF3
data OSO-snapshotcGrants
(old Gitcoin data from Umar) (see also: OSO-funding)Octant Epoch 0
(Only epoch to use snapshot
for allocation)The current indexer endpoint is deprecated. We should get it from https://public.scorer.gitcoin.co/passport_scores/registry_score.jsonl
.
dbt allows for schema
testing, where we can declare what we want to see, and run dbt test
to see if reality conforms to expectations.
models:
- name: orders
columns:
- name: order_id
tests:
- unique
- not_null
I think it is worthwile to create those for some important models. Since we are already describing schema it would be nice to also:
addresses
so we know at glance if for example round_id
is actual address or just random string.is_lowercased
test for consistency.Ultimately, we could have a dbt test
added as CI step to make sure that runs with malformed or missing assets do not proceed to IPFS upload stage.
Instructions given in readme are sufficient to play with data portal in interactive manner:
make dev
to spin-up dagster.BUT, what steps are needed to run publish & deploy
Github Actions workflow from a fork?
Here is the summary of steps I had to take, may be slightly outdated:
${{ secrets.FILEBASE_KEY }}
+ ${{ secrets.FILEBASE_SECRET }}
gitcoin-grants-data-portal
won't work = bucket names have to be unique site-wide)filebaseBucket: gitcoin-grants-data-portal
inside /.github/workflows/run.yml
to new bucket nameThis is for bucket, now we may also need to:
Once I have a moment I would put those steps in small HOW TO document
on the off-chance that there is a single person somehow interested in running 100% independent replica of a portal (I found it handy for development to test CI changes + preview how portal-site would look like).
Saw the recent and very cool @junta submission for Arbitrum and was thinking we could reuse some of those assets for Gitcoin stuff.
Offer a "pet name" way for people to access the latest tables.
Basically, copy some of the GH Actions I was using before into the new one.
After talking with some Dagster employees and looking at Dagster own project, feels like moving out of IO Manager will make it easier to create both sensors and partitioned assets for #28.
Basically, follow the same pattern that is running on Datadex.
We should expose both schemas and samples for all the curated datasets.
This will improve UX and make choosing datasets easier!
Surface simple ways to start using the latest datasets.
We should have the portal at gitcoin-data.eth
.
Besides all the tabs, Notebook (should be Reports or something like that), Catalog, dbt Docs, ... we should add a Data
that redirects the user to the IPNS hash of the latest datasets. Makes it smoother for users that don't want to read! ๐
I wanted to make high-level diagram showing how the project is set-up at the moment.
Open to suggestion if something could be tweaked.
Excalidraw: https://excalidraw.com/#json=2w27DpwY7oqSSwgdpyUT5,8zyALKj1K5pEKNAOvNPPuQ
I want to make another one focused on three kinds of data we have (Governance + Chain Data + Donations).
Two options where to put it, dealer's choice:
architecture.md
Compress and sort them!
Not sure if it can be done for the DuckDB database.
I made the logo with Stable Difussion back in the day. Would be great to have it done properly as SVG!
Currently, we rely on the Allo Indexer API Data. We should add an option to pull data straight from chains using something like cryo
or subsquids
. This way, we don't need to trust the Allo API data is that's what we want.
Making changes to either infrastructure or website can result in broken links on portal website.
As final step of Publish and Deploy
workflow we can add "link-checker" github action that visits portal and reports if any broken link was encountered.
Having outdated examples is tolerable, but I want to make sure that Get the Data
link on main portal website is always working as intended.
Site links:
ScholliYT/Broken-Links-Crawler-Action@v3
to ignore twitterrun_summary
with broken links.File links:
We can push datasets to Flipside and Dune via Dagster assets.
For bulk data (e.g. all votes) it is probably best to use RPC approach outlined in #1 but for events/transactions that are less frequent, free tier of Covalent API could work as lightweight alternative.
As proof of concept I would start with Project Registry
contract as it has few transactions so free covalent API can handle fetching them in reasonable time, across all important chains and without consuming too many free monthly credits.
Implementation outline would be:
ProjectRegistry
contract on each interesting chain.CovalentAPI
resource configured with covalent_api_key
secret.Contract Registry
contracts.End goal is knowing how much gas was spent on creating and updating project profiles on different chains + reconciling whatever Indexer is telling us with another source.
Indexer has several objects (projects, rounds, applications
) with metadata
served as well-formed, nested json.
"metadata": {
"signature": "0xacb3be5c327477a5[...]",
"application": {
"round": "0xddc627acc685c2a3fa67bc311a5318d1ae2ce899",
...
When inspecting database tables raw_projects, raw_rounds, raw_round_applications
from the latest release (dbt.duckdb
file) somewhere along the pipeline metadata (varchar)
column in each of those tables was transformed into pythonic format that can no longer satisfy JSON parser, but needs to be passed to python ast.literal_eval()
:
{'title': 'FTM Test Project', 'description': "Just a description here ๐ซฃ don't mind me. "...
I wanted to be able to run a simple Duckdb query to extract interesting fields like one below. This won't work as-is, because single quotes break JSON parser:
select json_extract_string(metadata,'$.title') from public.raw_projects;
To achieve that quickly I modified asset generation step as follows (ggdp/assets.py
). Goal is to ensure that Pandas dataframe generated by dagster contains JSON string instead of a collection of Python objects. This seems to be working on my fork.
For future reference I should probably just replace metadata
column instead of duplicating it.
@asset
def raw_projects() -> pd.DataFrame:
projects = chain_file_aggregator("projects.json")
projects['json_metadata'] = projects['metadata'].apply(json.dumps)
return projects
I am writing this to suggest that upstream could also benefit from having metadata
for raw_
tables in JSON format.
Moving the #77 tasks here @DistributedDoge!
anchor_address
)is_ipfs_cid
testI think it is handy to to have pet name
for data portal tables so we can use it like so. This can be achieved with DNS configuration, and no server.
import pandas
pandas.read_parquet('http://grant-data.xyz/projects.parquet')
Which domain registrars actually provide path redirect
feature needed for this setup to work?
Would be interesting to try official-sounding-domain
with Cloudflare to see how that would work.
Should reduce file size!
I think root of the problem when merging #27 was this, and now that repository is using secrets, it may also affect all future pull requests:
With the exception of GITHUB_TOKEN, secrets are not passed to the runner when a workflow is triggered from a forked repository.
Looking online, I see suggestions that replacing pull_request
with pull_request_target
might give the Github CI runner access to secrets of host (i.e. this) repository when a pull request is made.
EDIT: Might also need some changes to @.checkout step to make sure we execute the code that is supplied inside pull request.
name: CI
on:
pull_request_target:
branches:
- main
The downside of allowing secrets in pull requests, malicious PRs, is partially mitigated by GitHub Actions: Maintainers must approve first time contributor workflow runs
.
EDIT: Two posssible alternatives
CI
workflow.CI
, then it won't need secretsAfter merge of #22 multiple tables silently failed to materialize breaking the portal as shown in run logs which contain RUN_FAILURE
event during make run
step.
dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "raw_passport_scores"::
ValueError: Unexpected character found when decoding array value (2)
Subsequent styling commit re-triggered the CI building all the tables. Since error had occured in table passport_scores
which wasn't touched by either of the commits it is possible it was caused by something going on with the datasource.
To prevent such silent regressions, we could use exit status of make run
so that github CI run is interrupted before IPFS upload occurs. Problem is, that dagster CLI always returns 0 exit code, even if run concludes with RUN_FAILURE
event in the logs.
dagster asset materialize --select * -m ggdp
echo $?
Best solution I can think of at the moment is searching logs of make run
with grep
and throwing exit 1
if RUN_FAILURE
is encountered.
Opening this one to brainstorm in case there was some less-janky way of coercing dagster asset materialize
command to signal that something went wrong during the process.
Perhaps also from Torrent and Iroh?
The main idea is to rely on the latest portal data and run smaller incremental on CI. We should provide a --full-refresh
flag ala dbt
to make data from scratch.
This is a big one!
Extracting metadata
table to standarized columns is very handy, thanks for resolving #10 which does that for projects
table. It feels natural to do the same for the remaining tables.
I took liberty to do the same for rounds
here #11 mainly to get round titles.
There is also round_applications
which could use a similar treatment. If you want I can open a PR for that model as well.
We're putting Parquet files on IPFS. Every release will be under different a IPFS hash and that is both confusing and hard to navigate from the UX side.
Will aim to rely on Filebase and offer an static IPNS endpoint for the exported Parquet files.
Useful for grants and "marketing" ๐
We can do something like this to run the "push" to IPFS.
# Conditional step that runs only if it's a PR against the main branch
- name: Conditional step for PR against main
if: ${{ github.event_name == 'pull_request' && github.base_ref == 'main' }}
run: |
echo "This step runs only for PRs against the main branch"
Karma GAP is using on-chain attestations to track self-reported updates, milestones of various Gitcoin projects.
I can grab data from there with attest.sh
GraphQLAPI.
Bonus is that logic can be re-used to grab different kinds of attestation
from other platforms.
https://docs.attest.sh/docs/developer-tools/api
TODO:
profiles
/ grants
/ milestones
EDIT: Dependencies go, Profile => Grant = >Milestones
For folks that want to tinker with the data quickly, we should have a Colab Notebook that has one Pandas DataFrame for each dataset we publish and a couple of usage examples.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.