Giter VIP home page Giter VIP logo

pangeo-cloud-federation's People

Contributors

amanda-tan avatar apawloski avatar arbennett avatar bartnijssen avatar brian-rose avatar consideratio avatar cspencerjones avatar dependabot[bot] avatar dsludwig avatar jcrist avatar jrbourbeau avatar lsetiawan avatar nicwayand avatar ocefpaf avatar rabernat avatar raphaeldussin avatar rsignell-usgs avatar salvis2 avatar scottyhq avatar shanicetbailey avatar tjcrone avatar tomaugspurger avatar xjonjos avatar yuvipanda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pangeo-cloud-federation's Issues

Example of why we need to pin?

I tried my new non-pinned environment on dev.pangeo.io, and when I try to start a cluster, I get:

distributed.core - ERROR - add_worker() got an unexpected keyword argument 'cpu'
Traceback (most recent call last):
  File "/srv/conda/lib/python3.6/site-packages/distributed/core.py", line 340, in handle_comm
    result = yield result
  File "/srv/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/srv/conda/lib/python3.6/site-packages/tornado/gen.py", line 307, in wrapper
    result = func(*args, **kwargs)

So I guess the latest conda-forge packages distributed=1.23.3 and tornado=5.1.1 don't play nice together, right?

So I should retreat to distributed=1.22.1 and tornado=5.0.2 and try again, right?

Just checking this is the right approach. I won't ask this every time, I promise.

aws authentication

@apawloski, @scottyhq, and I have been working on getting this repo setup with a new nasa deployment running on AWS. We've been making some changes to hubploy in berkeley-dsep-infra/hubploy#14 and have added most of the necessary bits to this repo.

We're currently stuck getting the eks authentication to work. This could just be how we've setup the IAM account but I'm posting here to coordinate the final pieces. The current error is:

could not get token: NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors
Error: Get https://958CB6CD107C87EEDAA83BFFEE9EEAFA.sk1.us-east-1.eks.amazonaws.com/api/v1/namespaces/kube-system/pods?labelSelector=app%3Dhelm%2Cname%3Dtiller: getting credentials: exec: exit status 1

The command aws eks update-kubeconfig --name pangeo succeeds but is not giving us sufficient privileges. I'm hoping @apawloski knows what to do here.

rename this repo?

We're going to start consolidating a number of separate Jupyterhubs into this repo. The repo name dev.pangeo.io-deploy will soon fail to describe what is here? What should we rename this repo to?

Make sure contents of image are present in homedir

When building images with repo2docker, we currently put everything into $HOME by default. This works great for Binder.

However, when running with persistent JupyterHub, $HOME gets mounted over by the persistent home directory. This means everything in the repo is not visible to the users, and stuff like 'start' doesn't work.

how much cpu / memory can notebook pods use?

The biggest possible pod we allow in ocean.pangeo.io is defined by the profile_list entry:

'display_name': 'x-large (n1-highmem-16 | 16 cores, 96GB RAM)',
'kubespawner_override': {
'cpu_limit': 16,
'cpu_guarantee': 14,
'mem_limit': '100G',
'mem_guarantee': '96G',

We have a nodepool with n1-highmem-16 (16 vCPUs, 104 GB memory) nodes. However, when I try to launch the x-large profile, the event log shows

Server requested
2019-03-06 17:49:34+00:00 [Warning] 0/3 nodes are available: 2 Insufficient cpu, 3 Insufficient memory.
2019-03-06 17:49:48+00:00 [Normal] pod didn't trigger scale-up (it wouldn't fit if a new node is added)

How much headroom do we need between the pod resource requests and the node capacity? I would think that 14 cpus and 96GB of memory would fit on a 16 vCPU / 104GB memory node. How can we debug this?

admin users don't work on globus

This is how the ocean staging.yml secrets look

pangeo:
  jupyterhub:
    proxy:
      secretToken: XXX
    auth:
      type: globus
      globus:
        clientId: "XXX"
        clientSecret: "XXX"
        callbackUrl: "https://staging.ocean.pangeo.io/hub/oauth_callback"
        identityProvider: "orcid.org"
        admin:
          access: true
          users:
            - 0000-0001-7479-8439 # Joe Hamman
            - 0000-0001-5999-4917 # Ryan Abernathey
            - 0000-0003-4004-4553 # Raphael Dussin

The hub startup log says

Loading /etc/jupyterhub/config/values.yaml
Loading /etc/jupyterhub/secret/values.yaml
Loading extra config: customPodHook
Loading extra config: profile_list
[I 2019-03-07 00:08:39.981 JupyterHub app:1673] Using Authenticator: oauthenticator.globus.GlobusOAuthenticator-0.8.1
[I 2019-03-07 00:08:39.981 JupyterHub app:1673] Using Spawner: kubespawner.spawner.KubeSpawner
[I 2019-03-07 00:08:39.981 JupyterHub app:1016] Loading cookie_secret from /srv/jupyterhub/jupyterhub_cookie_secret
[W 2019-03-07 00:08:40.079 JupyterHub app:1131] JupyterHub.hub_connect_port is deprecated as of 0.9. Use JupyterHub.hub_connect_url to fully specify the URL for connecting to the Hub.
[W 2019-03-07 00:08:40.081 JupyterHub app:1173] No admin users, admin interface will be unavailable.
[W 2019-03-07 00:08:40.082 JupyterHub app:1174] Add any administrative users to `c.Authenticator.admin_users` in config.
[I 2019-03-07 00:08:40.082 JupyterHub app:1201] Not using whitelist. Any authenticated user will be allowed.

Note the No admin users. What's wrong?

Hubploy and multiple user images

Cross posting from pangeo-data/pangeo#348

We should try this out here. I'm curious if @dsludwig has any idea of how hubploy / repo2docker could handle this. I'm wondering if we'll need to reconfigure things a bit in hubploy to support this. Is anyone interested in giving this a go?

KubeCluster can't start workers on staging.ocean.pangeo.io

Finally fixed #148 and now we can use dask labextension to start KubeCluster schedulers on staging.ocean.pangeo.io.

The next problem is that apparently launching dask workers from KubeClusters doesn't work at all, whether I try to start them from the lab extension or from notebook code. kubectl -n ocean-staging get pods shows no pending recent dask-jovyan- pods. It does however have some older dask-jovyan- pods (e.g. dask-jovyan-bc51065a-9nhsmh) for which the GCP console tells me:

  • PodUnschedulable
    Cannot schedule pods: Insufficient cpu.
  • PodUnschedulable
    Cannot schedule pods: Insufficient memory.
  • PodUnschedulable
    Cannot schedule pods: node(s) didn't match node selector.

Could the be related to node selector?

nodeSelector:
alpha.eksctl.io/nodegroup-name: dask-worker

This is a pretty big problem, since these clusters are our killer feature.

move nfs configuration to deployments instead of under pangeo-deploy

we're running into an issue with helm upgrade for deploying to aws that is related to nfs configuration settings under the shared pangeo-deploy directory. This is resolved by deleting these settings, but doing so will presumably affect GCS deployments:
scottyhq@6dd4f18

Should we ensure whatever is under pangeo-deploy is as bare-bones as possible and not linked to specific cloud providers or deployments?

helm upgrade --wait --install --namespace nasa-staging nasa-staging pangeo-deploy -f deployments/nasa/config/common.yaml -f deployments/nasa/config/staging.yaml -f deployments/nasa/secrets/staging.yaml --set jupyterhub.singleuser.image.tag=a7ff12a --set jupyterhub.singleuser.image.name=pangeo/nasa-pangeo-io-not
Error: release nasa-staging failed: PersistentVolume "nasa-staging-home-nfs" is invalid: spec.nfs.server: Required value

Any issues moving pangeo.esipfed.org to this framework?

I'm currently running pangeo.esipfed.org on the pangeo-access AWS kops cluster the old way: manually executing the docker build, pushing to dockerhub, and re-upping the helm chart.

I would like to move to the new approach. Anything I should be aware of?

ocean docker images are not getting pushed

Hubploy is not pushing my notebook docker images. Consequently, I am getting errors in my hub like:

Failed to pull image "us.gcr.io/pangeo-181919/ocean-pangeo-io-notebook:f1112be": rpc error: code = Unknown desc = Error response from daemon: manifest for us.gcr.io/pangeo-181919/ocean-pangeo-io-notebook:f1112be not found

It seems like the commit f1112be is getting built on the PR branch (#97), but, on the deploy job, it determines nothing needs to be done.

#!/bin/bash -eo pipefail
hubploy build ocean --commit-range ${COMMIT_RANGE} --push
Activated service account credentials for: [[email protected]]
WARNING: `docker` not in system PATH.
`docker` and `docker-credential-gcloud` need to be in the same PATH in order to work correctly together.
gcloud's Docker credential helper can be configured but it will not work until this is corrected.
gcloud credential helpers already registered correctly.
Image us.gcr.io/pangeo-181919/ocean-pangeo-io-notebook:f1112be: already up to date

As a result, the image is never pushed.

Relevant circleci config is here:
https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/.circleci/config.yml#L125-L129

Setting up "IMAGE_NAME"

I am in the process of setting this up on Circle CI and the one thing that I am still a bit lost about is what are the exact steps that I would need to take to create the image that goes into IMAGE_NAME.

Is there a Github repo somewhere with the Dockerfile (or multiple) for some generic pangeo images that I can customize?

recap from today's work

@raphaeldussin, @rabernat and I worked on this repo a bit today. Here's my notes on what we worked on and how we did a few pieces:

  1. shutdown example.pangeo.io
    • done
    • how did I do this:
      • First I deleted the kubernetes cluster from gcp
      • Then I removed all the pvcs from the compute-engine/disks menu on gcp
      • Then I archived the repo
  2. create new hubs/namespaces in dev.pangeo.io-deploy for ocean/atmos/hydro/astro/polar Pangeos
    • Q1: How to add new namespaces/hubs?
      • cp a existing deployment (under deployments)
      • change the deployment configs
      • add lines to the circle ci config
      • common yaml is the same except for nfs subPath
    • Q2: How to use the Pangeo chart
      • First, we need to get Pangeo chart current with z2jh (v0.8)
      • Then we can replace the jupyterhub requirement with the pangeo chart
      • Then all our configs need to get indented under pangeo:
      • We can remove some pieces from the pangeo-deploy directory
  3. set up all of these hubs to use existing nfs service
    • this is in progress: #50
  4. How to migrate the existing ocean.pangeo.io home spaces to the NFS server. Raphael, can you figure out how many users we have currently
    • Raphael archived existing users home spaces
  5. Customize the look and feel of the hubs. (@rabernat) This includes
    • Custom log / welcome message on the landing page
    • Custom links in the jupyterlab menu (e.g. to Pangeo documentation and github)

Thanks @yuvipanda for popping in and giving us some super valuable feedback


Yuvi also shared this repo: https://github.com/yuvipanda/datahub/tree/external

NFS issue - Mount failed for NFS V3 even after running rpcBind mount.nfs

I've tried out using Google Filestore and the setup suggested by @yuvipanda with success! I've also enjoyed the benefits by being able to smoothly recover when my k8s cluster crashed beyond repair while upgrading from 1.11 to 1.12 due to a GKE TPU related issue. Not having one copy of GCP-PD/PV/PVC for each users, this was doable, so thank you all for guiding the path!!

Anyhow I have run into an issue with the setup though that probably will affect you as I copied your setup solution. The issue arise for me when autoscaling up in the morning and two user pods are starting up at the same time on a node that is about to become ready. It will work fine if they arrive one at the time though! I'm not confident about what and when the pods make things fail by being two attempting to do something at the same time though. It seems like when two user pods arriving within a minute of each other while both waiting for images etc to be pulled since the node is freshly created, the issue strikes!

I think I can mitegate most of this issue by having a quick startup of pods, but when it happens I'm forced to drain the node to recover!

This is the error as found in the events of the pods.

Events:Type Reason Age From Message
---- ------ ---- ---- -------
Normal TriggeredScaleUp 9m13s cluster-autoscaler pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/ds-platform/zones/europe-west4-a/instanceGroups/gke-ds-platform-users-352836a1-grp 0->1 (max: 3)}]

Warning FailedScheduling 8m32s (x25 over 9m37s) jupyterhub-user-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector.

Warning FailedMount 7m9s kubelet, gke-ds-platform-users-352836a1-7lb1 MountVolume.SetUp failed for volume "home-nfs" : mount failed: exit status 1
Mounting command: systemd-runMounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/180b5b91-34e6-11e9-bc8b-42010a401004/volumes/kubernetes.io~nfs/home-nfs --scope – /home/kubernetes/containerized_mounter/mounter mount -t nfs 10.64.16.18:/home /var/lib/kubelet/pods/180b5b91-34e6-11e9-bc8b-42010a401004/volumes/kubernetes.io~nfs/home-nfsOutput: Running scope as unit: run-r8fdfd62f64e44eb995557473092b3ab5.scopeMount failed: Mount failed for NFS V3 even after running rpcBind mount.nfs: rpc.statd is not running but is required for remote locking.mount.nfs: Either use '-o nolock' to keep locks local, or start statd.mount.nfs: an incorrect mount option was specified, exit status 32

Warning FailedMount 7m9s kubelet, gke-ds-platform-users-352836a1-7lb1 MountVolume.SetUp failed for volume "home-nfs" : mount failed: exit status 1Mounting command: systemd-runMounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/180b5b91-34e6-11e9-bc8b-42010a401004/volumes/kubernetes.io~nfs/home-nfs --scope – /home/kubernetes/containerized_mounter/mounter mount -t nfs 10.64.16.18:/home /var/lib/kubelet/pods/180b5b91-34e6-11e9-bc8b-42010a401004/volumes/kubernetes.io~nfs/home-nfsOutput: Running scope as unit: run-r811263fce8b34ac7a5389196e9458cdc.scopeMount failed: Mount issued for NFS V3 but unable to run rpcbind:Output: rpcbind: another rpcbind is already running. Aborting

Hmm so note that what fails does not relate to whats within the init-container or container, but the pod's volumes section.

  # From the jupyter-my-user pod's spec (not nested under a specific (init-)container)
  # As generated by the helm chart options `storage.type: static`
  volumes:
  - name: home
    persistentVolumeClaim:
      claimName: home-nfs

Note that this section was created due to:

storage:
type: static
static:
pvcName: home-nfs
subPath: "home/hub.pangeo.io/{username}"

Related

https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/pangeo-deploy/templates/home-storage.yaml
#25
#28

potentially merging atmos and ocean hubs

As we work on consolidating the many hubs we once had running, do we feel there is a need for a distinct hub dedicated to atmospheric science - i.e. a new Atmos deployment to match the one created for Ocean?

Initially getting involved with this project, my intention was to upload data generated as part of TRACMIP to Pangeo's cloud bucket and play around with it in a hub deployed specifically for atmospheric sciences. However, it seems like we may be able to get by using the hub for oceanography.

Is there interest within the community to continue maintaining a hub specifically for atmospheric sciences?

How would this differ for an AWS deployment?

While most of this should be cloud agnostic (because it's running on top of an existing kubernetes deployment), there seem to be GCP-specific components described in the README and circleci tasks.

A few questions:

Does anybody currently use this for an AWS Pangeo deployment?
Do we have an idea of what would need to change to support AWS instead of GCP?
How should we support other cloud providers for these deployment repos? (Different repos? Branches? Support both with a configuration switch?)

Particularly interested in @dsludwig, @jacobtomlinson, @yuvipanda thoughts

logging for production clusters

We need to decide what to log and how. Ideally we could keep track of

  • when each user's notebook pod is running
  • when users start and stop dask clusters
  • how much file storage they are using

what else?

The default option on google cloud is JupyterHub Logs -> Stackdriver -> BigQuery

NFS mounting issue

I am getting this in my event log as it tries to start my server

2019-02-21 21:23:40+00:00 [Warning] MountVolume.SetUp failed for volume "ocean-staging-home-nfs" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/a6af3b91-361e-11e9-94e8-42010a800086/volumes/kubernetes.io~nfs/ocean-staging-home-nfs --scope -- /home/kubernetes/containerized_mounter/mounter mount -t nfs <nil>:/ /var/lib/kubelet/pods/a6af3b91-361e-11e9-94e8-42010a800086/volumes/kubernetes.io~nfs/ocean-staging-home-nfs Output: Running scope as unit: run-r9a5ddc860e034ff9ac1e10447920c217.scope Mount failed: mount failed: exit status 32 Mounting command: chroot Mounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs <nil>:/ /var/lib/kubelet/pods/a6af3b91-361e-11e9-94e8-42010a800086/volumes/kubernetes.io~nfs/ocean-staging-home-nfs] Output: mount.nfs: Failed to resolve server <nil>: Name or service not known

nfs home directory mounting on notebook and dask pods

We currently have a 'hack' solution for mounting user home directories on efs (see

) which differs a bit from the zero2jupyterhub docs: https://zero-to-jupyterhub.readthedocs.io/en/latest/amazon/efs_storage.html

what is the current best practice for mounting efs home directories?

this is also relevant b/c we'd like users to be "bring their own conda environment.yml" to our deployed image that would work with dask KubeCluster. it seems there are quite a few github issues out there and i’m not clear on if that is possible

quoting @yuvipanda "...one way to do it is to share $HOME between workers and your notebook pod. That way, this turns into 'have local conda enviornments'. IMO, I like this more than having conda run on each worker forever to update environment"

so the other thing to sort out is how to get the shared $HOME into our dask_config.yaml https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/nasa/image/binder/dask_config.yaml

getting globus auth to work

I am playing around with using globus for auth. I followed the instructions here:

https://zero-to-jupyterhub.readthedocs.io/en/latest/authentication.html#globus

My hub pod is giving this error

    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 1955, in launch_instance_async
        await self.initialize(argv)
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 1639, in initialize
        self.load_config_file(self.config_file)
      File "<decorator-gen-5>", line 2, in load_config_file
      File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 87, in catch_config_error
        return method(app, *args, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 598, in load_config_file
        raise_config_file_errors=self.raise_config_file_errors,
      File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 562, in _load_config_files
        config = loader.load_config()
      File "/usr/local/lib/python3.6/dist-packages/traitlets/config/loader.py", line 457, in load_config
        self._read_file_as_dict()
      File "/usr/local/lib/python3.6/dist-packages/traitlets/config/loader.py", line 489, in _read_file_as_dict
        py3compat.execfile(conf_filename, namespace)
      File "/usr/local/lib/python3.6/dist-packages/ipython_genutils/py3compat.py", line 198, in execfile
        exec(compiler(f.read(), fname, 'exec'), glob, loc)
      File "/srv/jupyterhub_config.py", line 288, in <module>
        set_config_if_not_none(c.GlobusOAuthenticator, trait, 'auth.globus.' + cfg_key)
    TypeError: must be str, not NoneType

I think this is related to these issues and PRs:

I wonder if @consideRatio can confirm that this is related to his recent PR.

If so, how do we point at the very latest chart?

[GKE] Add node selectors to https/proxy pods

We should add node selectors to the https/proxy pods for all of our GKE clusters (dev/ocean/hyrdo). This will make scaling of the notebook and dask pools far more efficient. For example, the proxy pod for ocean-prod is sitting in the highmem pool right now.

 ~/workdir/pangeo-cloud-federation   staging  kubectl get pods --namespace ocean-prod --output wide                                                                                                                            ✔  10437  08:27:02
NAME                         READY   STATUS    RESTARTS   AGE   IP           NODE                                                  NOMINATED NODE
autohttps-6555b4fd9c-bgqp5   2/2     Running   0          16h   10.32.1.22   gke-dev-pangeo-io-cluste-default-pool-c2a8b6ac-52hg   <none>
hub-7bbd7c5d-rzcmw           1/1     Running   0          9h    10.32.2.8    gke-dev-pangeo-io-cluste-default-pool-c2a8b6ac-xz3g   <none>
proxy-65c5d54b94-fg68j       1/1     Running   0          7h    10.32.8.19   gke-dev-pangeo-io-clust-n1-highmem-16-a99509de-lhtp   <none>

PersistentVolumeClaim "home-nfs" not found

The biggest possible pod we allow in ocean.pangeo.io is defined by the profile_list entry:

'display_name': 'x-large (n1-highmem-16 | 16 cores, 96GB RAM)',
'kubespawner_override': {
'cpu_limit': 16,
'cpu_guarantee': 14,
'mem_limit': '100G',
'mem_guarantee': '96G',

We have a nodepool with n1-highmem-16 (16 vCPUs, 104 GB memory) nodes. However, when I try to launch the x-large profile, I get the error

[Normal] pod didn't trigger scale-up (it wouldn't fit if a new node is added)

How much headroom do we need between the pod resource requests and the node capacity? How can we debug this?

Move to current master of hubploy

Currently, this is using hubploy from @dsludwig's fork. We've incorporated all the changes from the fork to hubploy master / repo2docker. We should try move this back to using hubploy master.

This should ideally happen at the same time as consolidating all hubs into one repo

use jupyterhub latest master (>0.9.4)

In order to make my custom logo work on ocean.pangeo.io., I need a jupyterhub with this PR in it. The latest release of jupyterhub was in September, 0.9.4, and does not include that PR.

We currently point to jupyterhub help chart version 0.9-e120fda. I assumed that would be pulling in a very recent master, since it is a devel release tagged on March 1. (https://jupyterhub.github.io/helm-chart/) But apparently this is not the case. I believe our hubs are using 0.9.4.

user conda environments

We've now set-up staging.nasa.pangeo.io to allow users to create their own conda environments
(see https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/nasa/config/common.yaml#L34).

I'm running into "The environment is inconsistent" and hanging "solving environment" issues with conda currently though in our image. I noticed that /srv/conda/.condarc has the following config:

channels:
  - conda-forge
  - defaults
auto_update_conda: false
show_channel_urls: true
update_dependencies: false

I'm wondering about the update_dependencies: false causing trouble. It comes from repo2docker (https://github.com/jupyter/repo2docker/blob/9099def40a331df04ba3ed862ee27a8e4a77fe43/repo2docker/buildpacks/conda/install-miniconda.bash#L39).

I also noticed we end up with a mix of packages from conda-forge, defaults, and pypi currently, which I guess is originating from pangeo-stacks:
https://github.com/pangeo-data/pangeo-stacks/blob/master/base-notebook/binder/environment.yml

So... @yuvipanda , @jhamman

  1. Why is update_dependencies: false?
  2. Should we change pangeo-stacks to just use conda-forge?

auth for production clusters

Before launching these new clusters, we should decide what do do for auth. My wishlist of features includes the ability to

  • associate cluster users to people in the real world
  • associate cluster users to academic institutions where possible
  • send mass email to all cluster users or sub-groups of cluster users (e.g. ocean, atmos, etc.)
  • revoke and suspend accounts on a user-by-user basis
  • track user login statistics over time

Some possible options are:

  • keep using github
  • use google auth with google groups
  • use a third party service like okta

get-commit-range.py is failing on PR builds

This seems to be a replay of #52! I thought we squashed this with #54?

#!/bin/bash -eo pipefail
# CircleCI doesn't have equivalent to Travis' COMMIT_RANGE
COMMIT_RANGE=$(./.circleci/get-commit-range.py)
echo ${COMMIT_RANGE}
echo "export COMMIT_RANGE='${COMMIT_RANGE}'" >> ${BASH_ENV}
Traceback (most recent call last):
  File "./.circleci/get-commit-range.py", line 90, in <module>
    main()
  File "./.circleci/get-commit-range.py", line 84, in main
    print(from_branch(args.project, args.repo, branch_name))
  File "./.circleci/get-commit-range.py", line 29, in from_branch
    raise ValueError(f'No PR from branch {branch_name} in upstream repo found')
ValueError: No PR from branch tweak-docker in upstream repo found
Exited with code 1

cc @yuvipanda

Authentication errors on CircleCI/GCP

I set up things according to the README, and still getting an authentication error on CircleCI in the "Build primary image if needed", when it is trying to run hubploy-image-builder:

#!/bin/bash -eo pipefail
hubploy-image-builder \
  --push \
  --registry-url https://us.gcr.io \
  --registry-username _json_key \
  --registry-password "${GCR_READWRITE_KEY}" \
  --repo2docker \
  deployments/${DEPLOYMENT}/image/ ${IMAGE_NAME}
Traceback (most recent call last):
  File "/root/repo/venv/lib/python3.6/site-packages/docker/api/client.py", line 256, in _raise_for_status
    response.raise_for_status()
  File "/root/repo/venv/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://35.237.221.205:2376/v1.35/distribution/us.gcr.io/learning-2-learn-221016/example-pangeo-io-notebook:2b306e1/json

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/repo/venv/bin/hubploy-image-builder", line 11, in <module>
    load_entry_point('hubploy==0.1.0', 'console_scripts', 'hubploy-image-builder')()
  File "/root/repo/venv/lib/python3.6/site-packages/hubploy/imagebuilder.py", line 151, in main
    if needs_building(client, args.path, args.image_name):
  File "/root/repo/venv/lib/python3.6/site-packages/hubploy/imagebuilder.py", line 22, in needs_building
    image_manifest = client.images.get_registry_data(image_spec)
  File "/root/repo/venv/lib/python3.6/site-packages/docker/models/images.py", line 333, in get_registry_data
    attrs=self.client.api.inspect_distribution(name),
  File "/root/repo/venv/lib/python3.6/site-packages/docker/utils/decorators.py", line 34, in wrapper
    return f(self, *args, **kwargs)
  File "/root/repo/venv/lib/python3.6/site-packages/docker/utils/decorators.py", line 19, in wrapped
    return f(self, resource_id, *args, **kwargs)
  File "/root/repo/venv/lib/python3.6/site-packages/docker/api/image.py", line 266, in inspect_distribution
    self._get(self._url("/distribution/{0}/json", image)), True
  File "/root/repo/venv/lib/python3.6/site-packages/docker/api/client.py", line 262, in _result
    self._raise_for_status(response)
  File "/root/repo/venv/lib/python3.6/site-packages/docker/api/client.py", line 258, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/root/repo/venv/lib/python3.6/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error: Internal Server Error ("unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication")
Exited with code 1

Do I need to edit the secrets/staging.yaml file in some way? I am using my own GCP account, not the one y'all have, so maybe that's the issue?

User-level permissions for pod access to S3 buckets

As a user, I'd like to use an S3 bucket (or a prefix within a shared bucket) as a storage option for my work. Ideally, that would be something that had access control such that only users with correct permissions can interact with it.

This is definitely possible from an AWS IAM policy perspective. For example: https://aws.amazon.com/premiumsupport/knowledge-center/iam-s3-user-specific-folder/

The challenge is that while we can give this permission at an instance level (via IAM Instance Profiles), multiple users' pods may end up on the same underlying instance. Thus a pod could access any co-resident pod's S3 bucket/prefix.

Another option would be to use string credentials for users. It would be important for these to be scoped to S3 actions/conditions only, and only from our cluster's CIDR block. Then we could inject the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY env vars into a users' pod. But I'm unsure of how the actual implementation would work -- specifically, what would inject those env vars to a pod, and how might it do that?

There may be other options as well. I'm especially curious about @yuvipanda and @jacobtomlinson thoughts on this.

notebook pod complains jupyter is not installed

I have replaced the ocean image with a passthrough docker file

# Note that there must be a tag
FROM pangeo/pangeo-ocean:2019.03.12

That image lives over in https://github.com/pangeo-data/pangeo-stacks, where it is built by repo2docker. It is already being used by binder via https://github.com/pangeo-data/pangeo_ocean_examples/ in a similar way, and it seems to work.

However, here the notebook pod won't start, and I get these errors:

Traceback (most recent call last):
  File "/srv/conda/lib/python3.6/site-packages/jupyterlab/labhubapp.py", line 5, in <module>
    from jupyterhub.singleuser import SingleUserNotebookApp
ModuleNotFoundError: No module named 'jupyterhub'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/bin/jupyter-labhub", line 7, in <module>
    from jupyterlab.labhubapp import main
  File "/srv/conda/lib/python3.6/site-packages/jupyterlab/labhubapp.py", line 8, in <module>
    raise ImportError('You must have jupyterhub installed for this to work.')
ImportError: You must have jupyterhub installed for this to work.

What is going on here?

Server startup failure on staging.pangeo.io

Spawn failed: HTTPSConnectionPool(host='10.4.0.1', port=443): Max retries exceeded with url: /api/v1/namespaces/staging/persistentvolumeclaims (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7ff25d601438>: Failed to establish a new connection: [Errno 111] Connection refused',))

2018-10-24_16-03-51

Use Google Cloud Filestore for shared storage

Google Cloud now has a managed NFS provider (Filestore) that would be great for home directories. Currently, each user gets their own disk, which is expensive and rigid (you can't easily change sizes up or down after creation). It also makes some sharing scenarios harder.

Steps to use filestore:

  1. Create a filestore
  2. Configure z2jh to use NFS as the backing store for home directories. We use one filestore for all users, and use subPath to give them rw access to each directory. This is used to scope users to directories, rather than traditional Unix user permissions
  3. Use an initContainer to make sure the home directory created for the user has right permissions and ownership. See https://serverfault.com/questions/906083/how-to-mount-volume-with-specific-uid-in-kubernetes-pod for example.

https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/doc/source/amazon/efs_storage.rst is some info on doing something like this with EFS, which is used on AWS. Step (3) would be different.

Once we have a good idea on how to set this up, this can be contributed back to the z2jh docs.

refactor branding templates for login page

Clusters on GCE are currently using a gitRepo volume to mount pangeo styling templates for custom jupyterhub login pages. We're having trouble getting this to work on AWS due to lack of write permissions at /usr/local/share/jupyterhub/, and it seems that gitRepo is deprecated according to kubernetes docs -https://kubernetes.io/docs/concepts/storage/volumes/#gitrepo.

It seems like the recommended approach would be to use initContainers under our hub: configuration, here is a nice example of that approach:
https://gist.github.com/tallclair/849601a16cebeee581ef2be50c351841

But... As far as I can tell, this would require adding the initContainers configuration option under hub::
https://zero-to-jupyterhub.readthedocs.io/en/latest/reference.html#hub

So we may want to suggest this change in a new issue here:
https://github.com/jupyterhub/zero-to-jupyterhub-k8s

Wanted to post here first to make sure there is not an easier approach that I'm overlooking... @jhamman, @yuvipanda

Hubploy COMMIT_RANGE failing in this repo

We have a CICD failure on CircleCI now:

#!/bin/bash -eo pipefail
# CircleCI doesn't have equivalent to Travis' COMMIT_RANGE
COMMIT_RANGE=$(./.circleci/get-commit-range.py)
echo ${COMMIT_RANGE}
echo "export COMMIT_RANGE='${COMMIT_RANGE}'" >> ${BASH_ENV}
Traceback (most recent call last):
  File "./.circleci/get-commit-range.py", line 90, in <module>
    main()
  File "./.circleci/get-commit-range.py", line 84, in main
    print(from_branch(args.project, args.repo, branch_name))
  File "./.circleci/get-commit-range.py", line 29, in from_branch
    raise ValueError(f'No PR from branch {branch_name} in upstream repo found')
ValueError: No PR from branch staging in upstream repo found
Exited with code 1

I know @yuvipanda was mentioning this is a bit of a tricky part of the current setup. I think we just need someone to look into this and figure out what isn't working.

cc @rabernat and @raphaeldussin

ocean staging notebooks won't launch

I had a notebook pod die spontaneously. Now I can't start any more.

2019-03-24 02:47:40+00:00 [Warning] 0/11 nodes are available: 1 node(s) had disk pressure, 10 Insufficient memory, 2 Insufficient cpu.
2019-03-24 02:47:48+00:00 [Normal] pod didn't trigger scale-up (it wouldn't fit if a new node is added)

One possible related point is that I was downloading O(5 GB) of data to the /tmp directory. I thought that this was sitting on a 100 GB SSD. But it might be that I filled up some disk somewhere.

It's weird that the node pools won't just scale up to accommodate a new pod.

multiple notebook docker images or kernels

How do we ensure that notebooks created on our clusters are always run-able, even as the notebook images evolve? The only choice I see is to have some sort of versioning system, which allows users to select past versions of their environments. There are two ways this could work:

  • At the notebook docker image level (i.e. use ProfileList to provide a choice of images)
  • At the kernel level: we somehow make available many different kernels within a single notebook image, and notebooks created with a certain kernel will always open with that kernel

Has anyone thought about how to solve this problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.