pangeo-forge / user-stories Goto Github PK
View Code? Open in Web Editor NEWUser stories to guide PF development
User stories to guide PF development
As a recipe maintainer
I want to re-run recipes in my feedstock (either manually or on a schedule) to append newly released data to my dataset
So that I can keep the dataset built by my feedstock up-to-date with the latest releases from the data provider without needing to re-run the entire recipe
The ability to trigger append-only production runs (manually or on a schedule) from a feedstock. This might be inferred from the recipe itself, or perhaps specified by a new property in the meta.yaml
As a project owner
I want all images and repos which are affected by releases of pangeo-forge-recipes
to be automatically updated with each release of pangeo-forge-recipes
So that I do not have to devote manual toil to syncing all parts of the platform following every release of pangeo-forge-recipes
To start, I thought it would be useful to brain dump a list of everything that we'd want to happen automatically following a pangeo-forge-recipes
release, in order of dependency:
0.8.2
was the latest pangeo-forge-recipes
release. Since then, a Conda Forge bot noticed that 0.8.3
was available on PyPI, and opened โ๏ธ that PR. We have not yet merged that PR. I am unclear if a manual merge is required on the pangeo-forge-recipes-feedstock
for every release.0.8.2
rather then having to install from pip like this PR for 0.8.3
BAKERY_IMAGES
dict, xref https://github.com/pangeo-forge/registrar/issues/37
See above
As a project owner
I want to know how how to prioritize user stories
So that I can drive growth of key metrics for Pangeo Forge
A process for linking user stories to key metrics we'd like to achieve for the platform, as described in this tweet:
- Is this the group of users whose needs we want to address right now? Why? Are we trying to improve a particular metric for that specific type of user? Are they particularly underserved by the product or important to our business or other goals?
No response
As a project owner
I want to reach a consensus with other project owners regarding best security practices for importing contributed recipes
So that I know what security guardrails to observe while to developing new features on Pangeo Forge Cloud
An internal document and/or mutual understanding regarding best practices for importing nominally "untrusted" recipe modules. More details regarding motivating cases in Linked Issues section below.
By way of background, there are two currently two places in the Registrar where we automatically create recipe runs in response to a push event:
In the second case, we can assume some Pangeo Forge maintainer (either a project owner or the maintainer of a feedstock) has looked at the code already. There may be risks here due to inattentiveness, etc. but we can leave those for another day.
What I'd like to discuss here is first case, wherein the submitted code is truly untrusted in the sense that literally anyone in the whole world can make a PR to /staged-recipes
, and if it has a properly formatted and complete meta.yaml
, then recipe runs will be created for all recipes listed in the meta.yaml
. For this reason, I've assumed thus far that we should never actually import the recipe module when automatically creating recipe runs, and that is how the Registrar currently operates.
Certain open User Stories challenge this model, however. Namely:
In both of these cases, without importing the recipe module, we don't have enough information to create recipe runs. Specifically, as #3 is currently conceived, to determine whether or not to re-run a given recipe we would need to call self.sha256()
on each of the recipes, in order to compare the resulting hashes to those of the prior run (if any) for the recipe. If the hashes match, we wouldn't create recipe runs at all. And for #10, we wouldn't know the names of the individual recipes within a dict_object
without importing the recipe module and introspecting the specified dictionary.
Both of these User Stories have real, already-existing contributors that would like to use them, and from a design perspective would be big improvements to the platform. They would also be specifically useful for the low trust case of creating recipe runs for PRs, so simply saying "we don't support these features on PRs" seems far from ideal.
A few further questions/possibilities to kick off discussion:
importlib
equivalent to yaml.safe_load
which might be useful in this case?As a recipe maintainer
I want to be able to see where the data produced by my deployed recipe has been deposited
so that I can perform data-proximate analysis on the data.
For a particular feedstock repo (e.g. https://github.com/pangeo-forge/WOA_1degree_monthly-feedstock), after the recipe has been run in production mode, the following should be possible
No response
As the project owner
I want to get an accurate estimate of how many production datasets and recipe runs have been executed cumulatively and on a weekly basis
so that I can track progress of the project.
No response
As a recipe contributor
I want to be able to use dict objects as defined in ADR-2
So that I can dynamically generate recipe instances in my recipe module using dictionary comprehensions
As a Pangeo Forge developer
I want to know how to reproduce scale-related execution failures with a cluster deployed from a local (or cloud-based) personal notebook and/or Python session
So that I can debug scale-related problems outside Pangeo Forge Cloud infrastructure
As a recipe maintainer
I want to push commits to the default branch of my feedstock repository, and have the resulting production deployment only rerun new recipes or those that have changed, and not rerun unchanged recipes
So that I can add or update certain recipes in my feedstock without rerunning all of them.
A mechanism to check the hash of all recipes at deployment time, and skip re-running if the hash matches the hash for the same recipe in the last production deployment
In the order in which they should be merged:
As a recipe contributor
I want pangeo-forge-recipes
to support more than one ConcatDim
So that I can write recipes for datasets which require concatenation along more than one dimension
A feature in pangeo-forge-recipes
to support more than one ConcatDim
pangeo-forge/pangeo-forge-recipes#140
pangeo-forge/pangeo-forge-recipes#348
As a feedstock contributor
I want one click between my feedstock's deployment page and the dataset_public_url
for my dataset
So that, starting from feedstock's GitHub repo, I can easily find a dataset built by my feedstock
From the deployments page of a feedstock repo, an average user should be able find a dataset_public_url
for successful production deployments in one click, without needing to read any documentation or any prior specialized knowledge.
No response
As recipe contributor
I want to test my recipe on the command line before submitting a pull request
so that I can avoid a slow debugging cycle talking to the pangeo forge bot on github. (Example: pangeo-forge/staged-recipes#150)
I run a command like
$ pangeo-forge recipe validate recipe_folder/
and see output like
It looks like your meta.yaml does not conform to the specification.
1 validation error for MetaYaml
pangeo_notebook_version
field required (type=value_error.missing)
or
When I tried to import your recipe module, I encountered this error
line 17, in <module>
fs.ls(url_base + str(year), detail=False)
NameError: name 'fs' is not defined
Please correct your recipe module so that it's importable.
No response
As a data library user
I want to be able to see detailed information (e.g. variables, dimensions, attributes) about the datasets in the catalog
So that I can decide whether I want to use a particular dataset.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.