Giter VIP home page Giter VIP logo

.github's People

Contributors

mfisher87 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

.github's Issues

GitHub Action to auto-add from many repos

GitHub Project Workflow tools allow items from repositories to be automatically added to a Project. We would like to be able to automatically add items from many repos, both inside our Org and outside our Org. DUE team manages 5+ repos in our organization and several in a NASA org

On our current GH Plan, you are allowed one (1) auto-add workflow per project.

I think we can replicate this with GH Actions. For repos in our org, add an Action to each repo that pushes Issues to the project. For external repos, add an Action to a meta-repo that runs on cron to find new Issues and pull them in

  • Pull in Issues from a list of repositories in our organization
  • Pull in Issues from a list of repositories outside our organization

Deploy structurizr

https://www.structurizr.com/products

Without hack weeks, I don't know when we'd ever have an opportunity to do this.

Structurizr is a tool for representing C4 architecture diagrams as code, i.e. generating multiple views of architecture from a single code representation.

It can also do diagram reviews, architecture decision records, and more. On-premises deployments have LDAP support.

NSIDC JupyterHub

We used a 2i2c-managed JupyterHub for our QGreenland workshop a couple weeks ago, and had a great experience. A JupyterHub is a cloud-based system for provisioning dedicated JupyterLab instances for users.

That left me wondering about how NSIDC could benefit from a similar set up. With the addition of real-time-collaboration (like Google Docs, but in Jupyter Notebooks), so many possibilities open up: small groups working on the same notebook together in a tutorial setting, offering live user support in a collaborative computing environment, pair programming between developers and scientists to explore a problem space.

JupyterLab 4 was announced the other day. This release also includes real-time-collaboration extension at 1.0.0.

Determine useful labels for GitHub projects and develop automation to set them up

GitHub's labels are useful, but the default set is poor and GitHub lacks any ability to set org-level defaults or share labels between projects (looks like this is no longer the case: https://docs.github.com/en/organizations/managing-organization-settings/managing-default-labels-for-repositories-in-your-organization). On the plus side, it is possible to programmatically add labels to a repository using the API, so we could develop a repo that defines the shared label set and applies them to our projects.

Some suggestions for labeling best practices or schemes

I personally like <type>: <info> label style, e.g. Priority: high, Priority: low, Status: Help needed, Status: Not ready, Bug: Critical. Cpython project uses something like this: https://github.com/python/cpython/issues

Tools for setting labels

Explore data warehousing tools for NSIDC data

For a typical business, a data warehouse might help business analysts answer questions like "how many sales did we make last week?" via a query to an API or database. Our business is science data, so we may want to ask "How many MB of data did we ingest last week?" or "what files are available for dataset X between dates Y and Z?" or "Where was dataset X migrated to/from and when?" A data warehouse would be the source of truth for information about our dataset inventory.

We have a "data warehouse" system for ECS data in the form of CMR, but the rest of our data is managed solely as items on disk, and it's often not predictable where on disk that data will be or how to find the data you're interested in. Was it migrated to a new datapool recently? How do we determine the date a particular file corresponds with (we currently have to know or discover ourselves "where in the filename is the time?", "what format, e.g. YYYYMMDD, or YYYYDOY, or something else?", "Where are the gaps in coverage?")

It would be useful to provide a service that enables users to:

  • Create new records when adding new data files to a dataset
  • Update records when changing files in a dataset, e.g. migrating from one datapool to another.
  • Query for records in a dataset by date/time/other dimensions
  • Perform analysis and generate reports on stored datasets, e.g. dataset rate of growth, gaps in coverage, etc.

If we ran a tool like minio in front of all of our datapools, a data warehouse could return S3 URLs for datasets and access could be done over the s3 protocol instead of requiring disk mounts. This would make the transition to the cloud more transparent for our apps.

Explore workflow management solutions for data processing

We currently frequently use Luigi for managing data processing pipelines. Pipelines are represented as DAGs in Python code and are executed on a single host with a configurable number of workers. Luigi is fairly stagnant and feature-light (e.g. its support for retry exists but has been found to be lacking for various projects). We often deploy dedicated VMs to run Luigi workloads, and those VMs sit idle a large portion of the time. It would be better if we had a production cluster of machines that is dedicated to running data workflows, which would enable easier management of processing resources.

I think we should try to avoid a workflow system that requires a cluster, and fully supports local execution for ease of development and testing. When we deploy to production, using a single workflow management system gives us:

  • Broad observability of multiple workflows from a "single pane of glass"
  • Ability to more effectively utilize our compute resources
  • Ability to distribute workloads across multiple machines

Some open-source tools:

  • Prefect
  • Dagster
  • Airflow
  • ...?

Revive the NSIDC technical blog

Did you know we have an NSIDC Technology Blog [src]? Nobody's contributed to it since before I started working here >7 years ago, but I happened upon it when I found the repo in GitHub. Only one real post.

I'm thinking about restarting this blog next hack week with a series of posts on new scientific Python / open-source ecosystem tools over the last few years (mamba, ruff, pre-commit, conda-lock, growth of type annotations/checking) . I'd like to write a post on being a conda-forge feedstock maintainer (with @betolink? ๐Ÿ™).

I'd like to replace Jekyll with Quarto for a simpler authoring workflow with more technical publishing features (e.g. running code, producing figures at build-time)

Inventory source code

We want an inventory of all our source code to make sure that we are meeting our own expectations. Those include:

  • License
  • README
  • Code of Conduct
  • SLO statement

Create README.md as GH organization "landing page"

NSIDC GitHub organization should have a README landing page.

This is a special .github repository that allows NSIDC to stylize their GitHub landing page.

Create a public repository called .github. Once created, add a /profile/README.md. The README will appear on your organization's profile, visible to anyone.

Documentation: https://docs.github.com/en/organizations/collaborating-with-groups-in-organizations/customizing-your-organizations-profile

Example: https://github.com/GEUS-Glaciology-and-Climate/ (for an Org; nothing fancy) or https://github.com/rougier/ (for a person; fancy).

Create default community health files (CODE_OF_CONDUCT.md, CONTRIBUTING.md, SUPPORT.md) for whole organization

Explore logging as a service

Discussion: https://nsidc.slack.com/archives/C4UCJ1NAF/p1666721015728609

Some things to look at:

  • ELK Stack (ElasticSearch, Logstash, Kibana): Maybe too heavy and brittle for our use case?
  • Graylog: Uses ElasticSearch under-the-hood for searching.
  • Grafana Loki: "Like prometheus, but for logs" (We already run Prometheus and Grafana on NSIDC hardware for monitoring VMs). Requires special labels on log messages to effectively do its indexing.

Questions

  • How can we make our logging system resilient to changes in log storage backends? E.g. using Vector (https://github.com/vectordotdev/vector) would allow us to log to Vector and broadcast those logs to multiple back-ends, or switch backends in-flight.
  • Can our tool of choice ingest from our existing text log files to populate a history?
  • Can it ingest from live text files being written to by existing apps not configured for network logging? (e.g. ELK Stack's "filebeat" forwarder)
  • What protocol(s) is/are used to pass logs to the service?
  • How good/easy is the LDAP auth integration setup?

References/Links

Create a "Quarto project -> GH Pages" publishing action for the GitHub Actions Marketplace

This would enable people who want to publish a Quarto project to Pages to set up their workflow quickly and easily from the GitHub marketplace instead of having to write some YAML from scratch. We can also support some features that may be out of scope for the official Quarto actions, e.g.: Setting up Jupyter/Knitr; setting up a fully custom dependency environment with Conda, Pip, or other package managers;

We should open a ticket with the Quarto team to see if configuring dependency environments is in scope for their project, in which case we can contribute our efforts there. Also, the current Quarto actions do not meet the requirements for publication to the Marketplace, and I'm not sure if that's something they consider a priority.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.