davidgasquez / gitcoin-grants-data-portal Goto Github PK

View Code? Open in Web Editor NEW

25.0 2.0 3.0 487 KB

🌲 Open source, serverless, and local-first data hub for Gitcoin Grants data!

Home Page: https://grantsdataportal.xyz/

License: MIT License

Dockerfile 0.21% Makefile 0.39% Python 8.18% Jupyter Notebook 91.19% SCSS 0.03%

dbt duckdb gitcoin

gitcoin-grants-data-portal's Introduction

Hi there 👋

Data Engineer passionate about Open Source, Open Data, and, Open Protocols.

🔭 Data at Protocol Labs. Previously at Buffer.
✨ Interested in Open Source (tools), Open Data (Knowledge) and Open Protocols (Systems/Processes)!
🌱 Learning about decentralized systems, governance, mechanism design, database internals.
💬 Ask me about Data, Knowledge Management and Remote Work!
📝 Digital Handbook: I maintain a personal Handbook where I store some learnings and other interesting stuff.

More about me on my website!

gitcoin-grants-data-portal's People

Contributors

Stargazers

Watchers

Forkers

christiancasazza distributeddoge poupou-web3

gitcoin-grants-data-portal's Issues

Draw diagrams to improve docs

I wanted to make high-level diagram showing how the project is set-up at the moment.

Open to suggestion if something could be tweaked.

Excalidraw: https://excalidraw.com/#json=2w27DpwY7oqSSwgdpyUT5,8zyALKj1K5pEKNAOvNPPuQ

I want to make another one focused on three kinds of data we have (Governance + Chain Data + Donations).

Two options where to put it, dealer's choice:

put it in README.md (maybe replace asset graph, which isn't very legible)
add some text and put it in architecture.md

Document how to run `publish & deploy` from fork

Instructions given in readme are sufficient to play with data portal in interactive manner:

Clone on local machine OR enter codespace, install dependencies
make dev to spin-up dagster.
Enter dagster instance, run whichever models you want.

BUT, what steps are needed to run publish & deploy Github Actions workflow from a fork?

Here is the summary of steps I had to take, may be slightly outdated:

Register filebase account
Create filebase key and add it to repository- wide ${{ secrets.FILEBASE_KEY }} + ${{ secrets.FILEBASE_SECRET }}
Create filebase bucket with new name ( gitcoin-grants-data-portal won't work = bucket names have to be unique site-wide)
Change filebaseBucket: gitcoin-grants-data-portal inside /.github/workflows/run.yml to new bucket name

This is for bucket, now we may also need to:

Setup github pages.
Register Covalent Account + add key to repository secrets

Once I have a moment I would put those steps in small HOW TO document on the off-chance that there is a single person somehow interested in running 100% independent replica of a portal (I found it handy for development to test CI changes + preview how portal-site would look like).

Extract metadata from `rounds` and `round_applications`

Extracting metadata table to standarized columns is very handy, thanks for resolving #10 which does that for projects table. It feels natural to do the same for the remaining tables.

I took liberty to do the same for rounds here #11 mainly to get round titles.

There is also round_applications which could use a similar treatment. If you want I can open a PR for that model as well.

Add contributor guides

Update website on CI

Support ENS

We should have the portal at gitcoin-data.eth.

Publish Parquet files under an static endpoint

We're putting Parquet files on IPFS. Every release will be under different a IPFS hash and that is both confusing and hard to navigate from the UX side.

Will aim to rely on Filebase and offer an static IPNS endpoint for the exported Parquet files.

Add Quickstart section in README

Surface simple ways to start using the latest datasets.

Pandas
DuckDB Shell
Colab

Publish datasets to BigQuery

Sort final files

Should reduce file size!

Add IPNS support

Offer a "pet name" way for people to access the latest tables.

IPNS
ENS

Attestations from Karma GAP

Karma GAP is using on-chain attestations to track self-reported updates, milestones of various Gitcoin projects.

I can grab data from there with attest.sh GraphQLAPI.

Bonus is that logic can be re-used to grab different kinds of attestation from other platforms.

https://docs.attest.sh/docs/developer-tools/api

TODO:

see what to grab from Karma.Gap schema
grab KarmaGap updates from EAS
Split Karma Gap profiles / grants / milestones

EDIT: Dependencies go, Profile => Grant = >Milestones

Profile schema on Optimism (?)

Fix `/data` in website

The portal website is using relative paths for /data, which makes /data accessible via any gateway... but makes it error on GitHub Pages.

Deploy website with `dbt docs`

Basically, follow the same pattern that is running on Datadex.

Handling for static/historic datasets

Working on this one as we speak.

Some resources we want to grab:

have contents that stay the same
are hosted in strange places (e.g. Google Sheets, Notion, Github data folder)

We can pin those files on IPFS to simplify fetching logic and ensure they don't break in the future:

create IPFS_Resource configured with gateway.
create new module assets/static.py with assets representing archival datasets
create example assets using this strategy
actually submit the PR

Examples of historic datasets:

Carl's RPGF3 data OSO-snapshot
cGrants (old Gitcoin data from Umar) (see also: OSO-funding)
Octant Epoch 0 (Only epoch to use snapshot for allocation)

Pull infrequent Transactions from Covalent API

For bulk data (e.g. all votes) it is probably best to use RPC approach outlined in #1 but for events/transactions that are less frequent, free tier of Covalent API could work as lightweight alternative.

As proof of concept I would start with Project Registry contract as it has few transactions so free covalent API can handle fetching them in reasonable time, across all important chains and without consuming too many free monthly credits.

Implementation outline would be:

Find address of ProjectRegistry contract on each interesting chain.
Create CovalentAPI resource configured with covalent_api_key secret.
Fetch all transactions targeting Contract Registry contracts.

End goal is knowing how much gas was spent on creating and updating project profiles on different chains + reconciling whatever Indexer is telling us with another source.

Migrate out of Dagster IO Managers

After talking with some Dagster employees and looking at Dagster own project, feels like moving out of IO Manager will make it easier to create both sensors and partitioned assets for #28.

Pull data directly from chain

Currently, we rely on the Allo Indexer API Data. We should add an option to pull data straight from chains using something like cryo or subsquids. This way, we don't need to trust the Allo API data is that's what we want.

Support Allo `v2`

CI step to find broken links

Making changes to either infrastructure or website can result in broken links on portal website.

As final step of Publish and Deploy workflow we can add "link-checker" github action that visits portal and reports if any broken link was encountered.

Having outdated examples is tolerable, but I want to make sure that Get the Data link on main portal website is always working as intended.

Site links:

configure ScholliYT/Broken-Links-Crawler-Action@v3 to ignore twitter
modify action to produce run_summary with broken links.

File links:

visit IPNS bucket using portal navigation link and try to download something.

Add `Data` tab in the portal

Besides all the tabs, Notebook (should be Reports or something like that), Catalog, dbt Docs, ... we should add a Data that redirects the user to the IPNS hash of the latest datasets. Makes it smoother for users that don't want to read! 🙈

Retry with Dagster asset retrying

https://docs.dagster.io/concepts/assets/software-defined-assets#retrying-failed-assets

Create dbt tests to ensure data quality

dbt allows for schema testing, where we can declare what we want to see, and run dbt test to see if reality conforms to expectations.

models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null

I think it is worthwile to create those for some important models. Since we are already describing schema it would be nice to also:

tag all addresses so we know at glance if for example round_id is actual address or just random string.
make sure all addresses conform to is_lowercased test for consistency.

Ultimately, we could have a dbt test added as CI step to make sure that runs with malformed or missing assets do not proceed to IPFS upload stage.

Consider storing `metadata` columns as well-formed JSON?

Indexer has several objects (projects, rounds, applications) with metadata served as well-formed, nested json.

      "metadata": {
            "signature": "0xacb3be5c327477a5[...]",
            "application": {
                "round": "0xddc627acc685c2a3fa67bc311a5318d1ae2ce899",
...

When inspecting database tables raw_projects, raw_rounds, raw_round_applications from the latest release (dbt.duckdb file) somewhere along the pipeline metadata (varchar) column in each of those tables was transformed into pythonic format that can no longer satisfy JSON parser, but needs to be passed to python ast.literal_eval():

{'title': 'FTM Test Project', 'description': "Just a description here 🫣 don't mind me.  "...

I wanted to be able to run a simple Duckdb query to extract interesting fields like one below. This won't work as-is, because single quotes break JSON parser:

select json_extract_string(metadata,'$.title') from public.raw_projects;

To achieve that quickly I modified asset generation step as follows (ggdp/assets.py). Goal is to ensure that Pandas dataframe generated by dagster contains JSON string instead of a collection of Python objects. This seems to be working on my fork.
For future reference I should probably just replace metadata column instead of duplicating it.

@asset
def raw_projects() -> pd.DataFrame:
    projects = chain_file_aggregator("projects.json")
    projects['json_metadata'] = projects['metadata'].apply(json.dumps)
    return projects

I am writing this to suggest that upstream could also benefit from having metadata for raw_ tables in JSON format.

Publish GitHub release with DuckDB table

Basically, copy some of the GH Actions I was using before into the new one.

Optimize Parquet files

Compress and sort them!

Not sure if it can be done for the DuckDB database.

Flipside/Dune Integration

We can push datasets to Flipside and Dune via Dagster assets.

Add CI to PRs

Disable GitHub Pages

Create a project metadata table

Iterate dbt schema

Moving the #77 tasks here @DistributedDoge!

write description for every dataset
write description for all columns where meaning is not-obvious (e.g. anchor_address)
create generic is_ipfs_cid test

Improve Portal Documentation (Data Catalog, How it Works, ...)

We should expose both schemas and samples for all the curated datasets.

This will improve UX and make choosing datasets easier!

Interrupt CI run if `make run` fails

After merge of #22 multiple tables silently failed to materialize breaking the portal as shown in run logs which contain RUN_FAILURE event during make run step.

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "raw_passport_scores"::
ValueError: Unexpected character found when decoding array value (2)

Subsequent styling commit re-triggered the CI building all the tables. Since error had occured in table passport_scores which wasn't touched by either of the commits it is possible it was caused by something going on with the datasource.

To prevent such silent regressions, we could use exit status of make run so that github CI run is interrupted before IPFS upload occurs. Problem is, that dagster CLI always returns 0 exit code, even if run concludes with RUN_FAILURE event in the logs.

dagster asset materialize --select * -m ggdp
echo $?

Best solution I can think of at the moment is searching logs of make run with grep and throwing exit 1 if RUN_FAILURE is encountered.

Opening this one to brainstorm in case there was some less-janky way of coercing dagster asset materialize command to signal that something went wrong during the process.

Make pipeline incremental

The main idea is to rely on the latest portal data and run smaller incremental on CI. We should provide a --full-refresh flag ala dbt to make data from scratch.

This is a big one!

Unify CI files

We can do something like this to run the "push" to IPFS.

    # Conditional step that runs only if it's a PR against the main branch
    - name: Conditional step for PR against main
      if: ${{ github.event_name == 'pull_request' && github.base_ref == 'main' }}
      run: |
        echo "This step runs only for PRs against the main branch"

Replace logo with SVG version

I made the logo with Stable Difussion back in the day. Would be great to have it done properly as SVG!

Add known users / impact section

Useful for grants and "marketing" 🙈

DNS-setup for portal short-name

I think it is handy to to have pet name for data portal tables so we can use it like so. This can be achieved with DNS configuration, and no server.

import pandas
pandas.read_parquet('http://grant-data.xyz/projects.parquet')

Using IPNS/IPFS hash feels safer so we should still encourage this path
Using domain is good alternative for interactive sessions, one-off scripts etc.

Which domain registrars actually provide path redirect feature needed for this setup to work?

Namecheap does provide this feature, configuration is very simple.
Cloudflare should according to a blog, did not try it.
GoDaddy does not.

Would be interesting to try official-sounding-domain with Cloudflare to see how that would work.

Make sure CI runs on correct branch

I think root of the problem when merging #27 was this, and now that repository is using secrets, it may also affect all future pull requests:

With the exception of GITHUB_TOKEN, secrets are not passed to the runner when a workflow is triggered from a forked repository.

Looking online, I see suggestions that replacing pull_request with pull_request_target might give the Github CI runner access to secrets of host (i.e. this) repository when a pull request is made.

EDIT: Might also need some changes to @.checkout step to make sure we execute the code that is supplied inside pull request.

name: CI
on: 
  pull_request_target:
    branches:
      - main

The downside of allowing secrets in pull requests, malicious PRs, is partially mitigated by GitHub Actions: Maintainers must approve first time contributor workflow runs.

EDIT: Two posssible alternatives

find other way to trigger for the CI workflow.
don't run assets using API keys as part of CI, then it won't need secrets

Add Run Summaries to CI

CI setup allows execution to proceed even though some Dagster jobs have failed.

I would like to use Github Run Summary to make it more obvious that asset failed to materialize without needing to open the log.

Solution is to modify equivalent of make run step to produce logfile and parse that in new CI step responsible for parsing logs like so.

Will open PR, but first I want to check if I can capture STEP_SKIPPED this way.