Indexer has several objects (projects, rounds, applications
) with metadata
served as well-formed, nested json.
"metadata": {
"signature": "0xacb3be5c327477a5[...]",
"application": {
"round": "0xddc627acc685c2a3fa67bc311a5318d1ae2ce899",
...
When inspecting database tables raw_projects, raw_rounds, raw_round_applications
from the latest release (dbt.duckdb
file) somewhere along the pipeline metadata (varchar)
column in each of those tables was transformed into pythonic format that can no longer satisfy JSON parser, but needs to be passed to python ast.literal_eval()
:
{'title': 'FTM Test Project', 'description': "Just a description here ๐ซฃ don't mind me. "...
I wanted to be able to run a simple Duckdb query to extract interesting fields like one below. This won't work as-is, because single quotes break JSON parser:
select json_extract_string(metadata,'$.title') from public.raw_projects;
To achieve that quickly I modified asset generation step as follows (ggdp/assets.py
). Goal is to ensure that Pandas dataframe generated by dagster contains JSON string instead of a collection of Python objects. This seems to be working on my fork.
For future reference I should probably just replace metadata
column instead of duplicating it.
@asset
def raw_projects() -> pd.DataFrame:
projects = chain_file_aggregator("projects.json")
projects['json_metadata'] = projects['metadata'].apply(json.dumps)
return projects
I am writing this to suggest that upstream could also benefit from having metadata
for raw_
tables in JSON format.