Giter VIP home page Giter VIP logo

gitcoin-grants-data-portal's Introduction

Hi there ๐Ÿ‘‹

Data Engineer passionate about Open Source, Open Data, and, Open Protocols.

  • ๐Ÿ”ญ Data at Protocol Labs. Previously at Buffer.
  • โœจ Interested in Open Source (tools), Open Data (Knowledge) and Open Protocols (Systems/Processes)!
  • ๐ŸŒฑ Learning about decentralized systems, governance, mechanism design, database internals.
  • ๐Ÿ’ฌ Ask me about Data, Knowledge Management and Remote Work!
  • ๐Ÿ“ Digital Handbook: I maintain a personal Handbook where I store some learnings and other interesting stuff.

More about me on my website!

gitcoin-grants-data-portal's People

Contributors

davidgasquez avatar distributeddoge avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

gitcoin-grants-data-portal's Issues

Draw diagrams to improve docs

I wanted to make high-level diagram showing how the project is set-up at the moment.

Open to suggestion if something could be tweaked.

Excalidraw: https://excalidraw.com/#json=2w27DpwY7oqSSwgdpyUT5,8zyALKj1K5pEKNAOvNPPuQ

image

I want to make another one focused on three kinds of data we have (Governance + Chain Data + Donations).

Two options where to put it, dealer's choice:

  • put it in README.md (maybe replace asset graph, which isn't very legible)
  • add some text and put it in architecture.md

Document how to run `publish & deploy` from fork

Instructions given in readme are sufficient to play with data portal in interactive manner:

  1. Clone on local machine OR enter codespace, install dependencies
  2. make dev to spin-up dagster.
  3. Enter dagster instance, run whichever models you want.

BUT, what steps are needed to run publish & deploy Github Actions workflow from a fork?

Here is the summary of steps I had to take, may be slightly outdated:

  1. Register filebase account
  2. Create filebase key and add it to repository- wide ${{ secrets.FILEBASE_KEY }} + ${{ secrets.FILEBASE_SECRET }}
  3. Create filebase bucket with new name ( gitcoin-grants-data-portal won't work = bucket names have to be unique site-wide)
  4. Change filebaseBucket: gitcoin-grants-data-portal inside /.github/workflows/run.yml to new bucket name

This is for bucket, now we may also need to:

  1. Setup github pages.
  2. Register Covalent Account + add key to repository secrets

Once I have a moment I would put those steps in small HOW TO document on the off-chance that there is a single person somehow interested in running 100% independent replica of a portal (I found it handy for development to test CI changes + preview how portal-site would look like).

Extract metadata from `rounds` and `round_applications`

Extracting metadata table to standarized columns is very handy, thanks for resolving #10 which does that for projects table. It feels natural to do the same for the remaining tables.

I took liberty to do the same for rounds here #11 mainly to get round titles.

There is also round_applications which could use a similar treatment. If you want I can open a PR for that model as well.

Support ENS

We should have the portal at gitcoin-data.eth.

Publish Parquet files under an static endpoint

We're putting Parquet files on IPFS. Every release will be under different a IPFS hash and that is both confusing and hard to navigate from the UX side.

Will aim to rely on Filebase and offer an static IPNS endpoint for the exported Parquet files.

Add IPNS support

Offer a "pet name" way for people to access the latest tables.

  • IPNS
  • ENS

Attestations from Karma GAP

Karma GAP is using on-chain attestations to track self-reported updates, milestones of various Gitcoin projects.

I can grab data from there with attest.sh GraphQLAPI.

Bonus is that logic can be re-used to grab different kinds of attestation from other platforms.

https://docs.attest.sh/docs/developer-tools/api

TODO:

  • see what to grab from Karma.Gap schema
  • grab KarmaGap updates from EAS
  • Split Karma Gap profiles / grants / milestones

EDIT: Dependencies go, Profile => Grant = >Milestones

Profile schema on Optimism (?)

Fix `/data` in website

The portal website is using relative paths for /data, which makes /data accessible via any gateway... but makes it error on GitHub Pages.

Handling for static/historic datasets

Working on this one as we speak.

Some resources we want to grab:

  • have contents that stay the same
  • are hosted in strange places (e.g. Google Sheets, Notion, Github data folder)

We can pin those files on IPFS to simplify fetching logic and ensure they don't break in the future:

  • create IPFS_Resource configured with gateway.
  • create new module assets/static.py with assets representing archival datasets
  • create example assets using this strategy
  • actually submit the PR

Examples of historic datasets:

  • Carl's RPGF3 data OSO-snapshot
  • cGrants (old Gitcoin data from Umar) (see also: OSO-funding)
  • Octant Epoch 0 (Only epoch to use snapshot for allocation)

Pull infrequent Transactions from Covalent API

For bulk data (e.g. all votes) it is probably best to use RPC approach outlined in #1 but for events/transactions that are less frequent, free tier of Covalent API could work as lightweight alternative.

As proof of concept I would start with Project Registry contract as it has few transactions so free covalent API can handle fetching them in reasonable time, across all important chains and without consuming too many free monthly credits.

Implementation outline would be:

  • Find address of ProjectRegistry contract on each interesting chain.
  • Create CovalentAPI resource configured with covalent_api_key secret.
  • Fetch all transactions targeting Contract Registry contracts.

End goal is knowing how much gas was spent on creating and updating project profiles on different chains + reconciling whatever Indexer is telling us with another source.

Pull data directly from chain

Currently, we rely on the Allo Indexer API Data. We should add an option to pull data straight from chains using something like cryo or subsquids. This way, we don't need to trust the Allo API data is that's what we want.

CI step to find broken links

Making changes to either infrastructure or website can result in broken links on portal website.

As final step of Publish and Deploy workflow we can add "link-checker" github action that visits portal and reports if any broken link was encountered.

Having outdated examples is tolerable, but I want to make sure that Get the Data link on main portal website is always working as intended.

Site links:

  • configure ScholliYT/Broken-Links-Crawler-Action@v3 to ignore twitter
  • modify action to produce run_summary with broken links.

File links:

  • visit IPNS bucket using portal navigation link and try to download something.

Add `Data` tab in the portal

Besides all the tabs, Notebook (should be Reports or something like that), Catalog, dbt Docs, ... we should add a Data that redirects the user to the IPNS hash of the latest datasets. Makes it smoother for users that don't want to read! ๐Ÿ™ˆ

Create dbt tests to ensure data quality

dbt allows for schema testing, where we can declare what we want to see, and run dbt test to see if reality conforms to expectations.

models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null

I think it is worthwile to create those for some important models. Since we are already describing schema it would be nice to also:

  • tag all addresses so we know at glance if for example round_id is actual address or just random string.
  • make sure all addresses conform to is_lowercased test for consistency.

Ultimately, we could have a dbt test added as CI step to make sure that runs with malformed or missing assets do not proceed to IPFS upload stage.

Consider storing `metadata` columns as well-formed JSON?

Indexer has several objects (projects, rounds, applications) with metadata served as well-formed, nested json.

      "metadata": {
            "signature": "0xacb3be5c327477a5[...]",
            "application": {
                "round": "0xddc627acc685c2a3fa67bc311a5318d1ae2ce899",
...

When inspecting database tables raw_projects, raw_rounds, raw_round_applications from the latest release (dbt.duckdb file) somewhere along the pipeline metadata (varchar) column in each of those tables was transformed into pythonic format that can no longer satisfy JSON parser, but needs to be passed to python ast.literal_eval():

{'title': 'FTM Test Project', 'description': "Just a description here ๐Ÿซฃ don't mind me.  "...

I wanted to be able to run a simple Duckdb query to extract interesting fields like one below. This won't work as-is, because single quotes break JSON parser:

select json_extract_string(metadata,'$.title') from public.raw_projects;

To achieve that quickly I modified asset generation step as follows (ggdp/assets.py). Goal is to ensure that Pandas dataframe generated by dagster contains JSON string instead of a collection of Python objects. This seems to be working on my fork.
For future reference I should probably just replace metadata column instead of duplicating it.

@asset
def raw_projects() -> pd.DataFrame:
    projects = chain_file_aggregator("projects.json")
    projects['json_metadata'] = projects['metadata'].apply(json.dumps)
    return projects

I am writing this to suggest that upstream could also benefit from having metadata for raw_ tables in JSON format.

Iterate dbt schema

Moving the #77 tasks here @DistributedDoge!

  • write description for every dataset
  • write description for all columns where meaning is not-obvious (e.g. anchor_address)
  • create generic is_ipfs_cid test

Interrupt CI run if `make run` fails

After merge of #22 multiple tables silently failed to materialize breaking the portal as shown in run logs which contain RUN_FAILURE event during make run step.

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "raw_passport_scores"::
ValueError: Unexpected character found when decoding array value (2)

Subsequent styling commit re-triggered the CI building all the tables. Since error had occured in table passport_scores which wasn't touched by either of the commits it is possible it was caused by something going on with the datasource.

To prevent such silent regressions, we could use exit status of make run so that github CI run is interrupted before IPFS upload occurs. Problem is, that dagster CLI always returns 0 exit code, even if run concludes with RUN_FAILURE event in the logs.

dagster asset materialize --select * -m ggdp
echo $?

Best solution I can think of at the moment is searching logs of make run with grep and throwing exit 1 if RUN_FAILURE is encountered.

Opening this one to brainstorm in case there was some less-janky way of coercing dagster asset materialize command to signal that something went wrong during the process.

Make pipeline incremental

The main idea is to rely on the latest portal data and run smaller incremental on CI. We should provide a --full-refresh flag ala dbt to make data from scratch.

This is a big one!

Unify CI files

We can do something like this to run the "push" to IPFS.

    # Conditional step that runs only if it's a PR against the main branch
    - name: Conditional step for PR against main
      if: ${{ github.event_name == 'pull_request' && github.base_ref == 'main' }}
      run: |
        echo "This step runs only for PRs against the main branch"

DNS-setup for portal short-name

I think it is handy to to have pet name for data portal tables so we can use it like so. This can be achieved with DNS configuration, and no server.

import pandas
pandas.read_parquet('http://grant-data.xyz/projects.parquet')
  • Using IPNS/IPFS hash feels safer so we should still encourage this path
  • Using domain is good alternative for interactive sessions, one-off scripts etc.

Which domain registrars actually provide path redirect feature needed for this setup to work?

  • Namecheap does provide this feature, configuration is very simple.
  • Cloudflare should according to a blog, did not try it.
  • GoDaddy does not.

Would be interesting to try official-sounding-domain with Cloudflare to see how that would work.

Make sure CI runs on correct branch

I think root of the problem when merging #27 was this, and now that repository is using secrets, it may also affect all future pull requests:

With the exception of GITHUB_TOKEN, secrets are not passed to the runner when a workflow is triggered from a forked repository.

Looking online, I see suggestions that replacing pull_request with pull_request_target might give the Github CI runner access to secrets of host (i.e. this) repository when a pull request is made.

EDIT: Might also need some changes to @.checkout step to make sure we execute the code that is supplied inside pull request.

name: CI
on: 
  pull_request_target:
    branches:
      - main

The downside of allowing secrets in pull requests, malicious PRs, is partially mitigated by GitHub Actions: Maintainers must approve first time contributor workflow runs.

EDIT: Two posssible alternatives

  • find other way to trigger for the CI workflow.
  • don't run assets using API keys as part of CI, then it won't need secrets

Add Run Summaries to CI

CI setup allows execution to proceed even though some Dagster jobs have failed.

I would like to use Github Run Summary to make it more obvious that asset failed to materialize without needing to open the log.

image

Solution is to modify equivalent of make run step to produce logfile and parse that in new CI step responsible for parsing logs like so.

Will open PR, but first I want to check if I can capture STEP_SKIPPED this way.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.