nsidc / .github Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 37 KB

.github's People

Contributors

Stargazers

Watchers

.github's Issues

Deploy structurizr

https://www.structurizr.com/products

Without hack weeks, I don't know when we'd ever have an opportunity to do this.

Structurizr is a tool for representing C4 architecture diagrams as code, i.e. generating multiple views of architecture from a single code representation.

It can also do diagram reviews, architecture decision records, and more. On-premises deployments have LDAP support.

Ease adoption and maintenance of GitHub Actions with org-level re-usable workflows / starter workflows

https://docs.github.com/en/actions/using-workflows/reusing-workflows#reusable-workflows-and-starter-workflows

Revive the NSIDC technical blog

Did you know we have an NSIDC Technology Blog [src]? Nobody's contributed to it since before I started working here >7 years ago, but I happened upon it when I found the repo in GitHub. Only one real post.

I'm thinking about restarting this blog next hack week with a series of posts on new scientific Python / open-source ecosystem tools over the last few years (mamba, ruff, pre-commit, conda-lock, growth of type annotations/checking) . I'd like to write a post on being a conda-forge feedstock maintainer (with @betolink? 🙏).

I'd like to replace Jekyll with Quarto for a simpler authoring workflow with more technical publishing features (e.g. running code, producing figures at build-time)

Explore workflow management solutions for data processing

We currently frequently use Luigi for managing data processing pipelines. Pipelines are represented as DAGs in Python code and are executed on a single host with a configurable number of workers. Luigi is fairly stagnant and feature-light (e.g. its support for retry exists but has been found to be lacking for various projects). We often deploy dedicated VMs to run Luigi workloads, and those VMs sit idle a large portion of the time. It would be better if we had a production cluster of machines that is dedicated to running data workflows, which would enable easier management of processing resources.

I think we should try to avoid a workflow system that requires a cluster, and fully supports local execution for ease of development and testing. When we deploy to production, using a single workflow management system gives us:

Broad observability of multiple workflows from a "single pane of glass"
Ability to more effectively utilize our compute resources
Ability to distribute workloads across multiple machines

Some open-source tools:

Prefect
Dagster
Airflow
...?

GitHub Action to auto-add from many repos

GitHub Project Workflow tools allow items from repositories to be automatically added to a Project. We would like to be able to automatically add items from many repos, both inside our Org and outside our Org. DUE team manages 5+ repos in our organization and several in a NASA org

On our current GH Plan, you are allowed one (1) auto-add workflow per project.

I think we can replicate this with GH Actions. For repos in our org, add an Action to each repo that pushes Issues to the project. For external repos, add an Action to a meta-repo that runs on cron to find new Issues and pull them in

Pull in Issues from a list of repositories in our organization
Pull in Issues from a list of repositories outside our organization

Certify GitHub as an approved Project Management tool

Work to gain formal approval for GitHub to be used as a Project Management tool, document that approval.

NSIDC JupyterHub

We used a 2i2c-managed JupyterHub for our QGreenland workshop a couple weeks ago, and had a great experience. A JupyterHub is a cloud-based system for provisioning dedicated JupyterLab instances for users.

That left me wondering about how NSIDC could benefit from a similar set up. With the addition of real-time-collaboration (like Google Docs, but in Jupyter Notebooks), so many possibilities open up: small groups working on the same notebook together in a tutorial setting, offering live user support in a collaborative computing environment, pair programming between developers and scientists to explore a problem space.

JupyterLab 4 was announced the other day. This release also includes real-time-collaboration extension at 1.0.0.

Explore data warehousing tools for NSIDC data

For a typical business, a data warehouse might help business analysts answer questions like "how many sales did we make last week?" via a query to an API or database. Our business is science data, so we may want to ask "How many MB of data did we ingest last week?" or "what files are available for dataset X between dates Y and Z?" or "Where was dataset X migrated to/from and when?" A data warehouse would be the source of truth for information about our dataset inventory.

We have a "data warehouse" system for ECS data in the form of CMR, but the rest of our data is managed solely as items on disk, and it's often not predictable where on disk that data will be or how to find the data you're interested in. Was it migrated to a new datapool recently? How do we determine the date a particular file corresponds with (we currently have to know or discover ourselves "where in the filename is the time?", "what format, e.g. YYYYMMDD, or YYYYDOY, or something else?", "Where are the gaps in coverage?")

It would be useful to provide a service that enables users to:

Create new records when adding new data files to a dataset
Update records when changing files in a dataset, e.g. migrating from one datapool to another.
Query for records in a dataset by date/time/other dimensions
Perform analysis and generate reports on stored datasets, e.g. dataset rate of growth, gaps in coverage, etc.

If we ran a tool like minio in front of all of our datapools, a data warehouse could return S3 URLs for datasets and access could be done over the s3 protocol instead of requiring disk mounts. This would make the transition to the cloud more transparent for our apps.

Review users, team, roles, and permissions

We should to a survey of the current state and document a pattern to reuse.

Determine useful labels for GitHub projects and develop automation to set them up

GitHub's labels are useful, but the default set is poor ~~and GitHub lacks any ability to set org-level defaults or share labels between projects~~ (looks like this is no longer the case: https://docs.github.com/en/organizations/managing-organization-settings/managing-default-labels-for-repositories-in-your-organization). On the plus side, it is possible to programmatically add labels to a repository using the API, so we could develop a repo that defines the shared label set and applies them to our projects.

Some suggestions for labeling best practices or schemes

I personally like <type>: <info> label style, e.g. Priority: high, Priority: low, Status: Help needed, Status: Not ready, Bug: Critical. Cpython project uses something like this: https://github.com/python/cpython/issues

Tools for setting labels

https://github.com/himynameisdave/git-labelmaker

Inventory source code

We want an inventory of all our source code to make sure that we are meeting our own expectations. Those include:

License
README
Code of Conduct
SLO statement

Create org-level default PR/issue templates

https://www.rhysmills.com/post/2021/09/07/set-a-default-pr-template-for-a-github-organisation.html

Explore JIRA/GitHub integrations

Use cases:

JIRA for task management, GitHub for source code control
Syncing of tasks between GitHub and JIRA

Create a "Quarto project -> GH Pages" publishing action for the GitHub Actions Marketplace

This would enable people who want to publish a Quarto project to Pages to set up their workflow quickly and easily from the GitHub marketplace instead of having to write some YAML from scratch. We can also support some features that may be out of scope for the official Quarto actions, e.g.: Setting up Jupyter/Knitr; setting up a fully custom dependency environment with Conda, Pip, or other package managers;

We should open a ticket with the Quarto team to see if configuring dependency environments is in scope for their project, in which case we can contribute our efforts there. Also, the current Quarto actions do not meet the requirements for publication to the Marketplace, and I'm not sure if that's something they consider a priority.

Create README.md as GH organization "landing page"

NSIDC GitHub organization should have a README landing page.

This is a special .github repository that allows NSIDC to stylize their GitHub landing page.

Create a public repository called .github. Once created, add a /profile/README.md. The README will appear on your organization's profile, visible to anyone.

Documentation: https://docs.github.com/en/organizations/collaborating-with-groups-in-organizations/customizing-your-organizations-profile

Example: https://github.com/GEUS-Glaciology-and-Climate/ (for an Org; nothing fancy) or https://github.com/rougier/ (for a person; fancy).

Develop Version Control (Git/GitHub) training for scientists and research support roles

We have a git-training repository, but it's currently in very early stages. It needs work to become a usable reference/cheat sheet/knowledge repository for its intended audience.

Create default community health files (CODE_OF_CONDUCT.md, CONTRIBUTING.md, SUPPORT.md) for whole organization

Our codes of conduct (and some other things) are always the same. We can apply org-level defaults by putting certain files in this repo. Read more: https://docs.github.com/en/communities/setting-up-your-project-for-healthy-contributions/creating-a-default-community-health-file

https://docs.github.com/en/communities/setting-up-your-project-for-healthy-contributions/creating-a-default-community-health-file

Explore logging as a service

Discussion: https://nsidc.slack.com/archives/C4UCJ1NAF/p1666721015728609

Some things to look at:

ELK Stack (ElasticSearch, Logstash, Kibana): Maybe too heavy and brittle for our use case?
Graylog: Uses ElasticSearch under-the-hood for searching.
Grafana Loki: "Like prometheus, but for logs" (We already run Prometheus and Grafana on NSIDC hardware for monitoring VMs). Requires special labels on log messages to effectively do its indexing.

Questions

How can we make our logging system resilient to changes in log storage backends? E.g. using Vector (https://github.com/vectordotdev/vector) would allow us to log to Vector and broadcast those logs to multiple back-ends, or switch backends in-flight.
Can our tool of choice ingest from our existing text log files to populate a history?
Can it ingest from live text files being written to by existing apps not configured for network logging? (e.g. ELK Stack's "filebeat" forwarder)
What protocol(s) is/are used to pass logs to the service?
How good/easy is the LDAP auth integration setup?

References/Links

https://github.com/SigNoz/logs-benchmark

Gather and report metrics for repos

GitHb Stars
Forks
Traffic stats