pangeo-data / pangeo-cloud-federation Goto Github PK

View Code? Open in Web Editor NEW

56.0 16.0 32.0 1.6 MB

Deployment automation for Pangeo JupyterHubs on AWS, Google, and Azure

Home Page: https://pangeo.io/cloud.html

Shell 18.28% Python 18.37% Dockerfile 2.74% Makefile 22.17% JavaScript 32.29% Smarty 6.15%

pangeo-cloud-federation's People

Contributors

Stargazers

Watchers

pangeo-cloud-federation's Issues

Example of why we need to pin?

I tried my new non-pinned environment on dev.pangeo.io, and when I try to start a cluster, I get:

distributed.core - ERROR - add_worker() got an unexpected keyword argument 'cpu'
Traceback (most recent call last):
  File "/srv/conda/lib/python3.6/site-packages/distributed/core.py", line 340, in handle_comm
    result = yield result
  File "/srv/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/srv/conda/lib/python3.6/site-packages/tornado/gen.py", line 307, in wrapper
    result = func(*args, **kwargs)

So I guess the latest conda-forge packages distributed=1.23.3 and tornado=5.1.1 don't play nice together, right?

So I should retreat to distributed=1.22.1 and tornado=5.0.2 and try again, right?

Just checking this is the right approach. I won't ask this every time, I promise.

aws authentication

@apawloski, @scottyhq, and I have been working on getting this repo setup with a new nasa deployment running on AWS. We've been making some changes to hubploy in berkeley-dsep-infra/hubploy#14 and have added most of the necessary bits to this repo.

We're currently stuck getting the eks authentication to work. This could just be how we've setup the IAM account but I'm posting here to coordinate the final pieces. The current error is:

could not get token: NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors
Error: Get https://958CB6CD107C87EEDAA83BFFEE9EEAFA.sk1.us-east-1.eks.amazonaws.com/api/v1/namespaces/kube-system/pods?labelSelector=app%3Dhelm%2Cname%3Dtiller: getting credentials: exec: exit status 1

The command aws eks update-kubeconfig --name pangeo succeeds but is not giving us sufficient privileges. I'm hoping @apawloski knows what to do here.

rename this repo?

We're going to start consolidating a number of separate Jupyterhubs into this repo. The repo name dev.pangeo.io-deploy will soon fail to describe what is here? What should we rename this repo to?

Is CI properly updating the deployment?

It looks like the PR worked it's way through CI successfully: https://circleci.com/gh/pangeo-data/dev.pangeo.io-deploy/14
but is there something I need to do additionally or differently?
When I start my server on dev.pangeo.io I just see the same old pre-PR environment -- it doesn't seem to be updated.

What jupyterlab extensions should we install?

There are so many cool ones: https://github.com/topics/jupyterlab-extension

Some ideas:

Ping @lheagy and @ian-r-rose for some suggestions on how to trick out our jupyterlabs!

Make sure contents of image are present in homedir

When building images with repo2docker, we currently put everything into $HOME by default. This works great for Binder.

However, when running with persistent JupyterHub, $HOME gets mounted over by the persistent home directory. This means everything in the repo is not visible to the users, and stuff like 'start' doesn't work.

how much cpu / memory can notebook pods use?

The biggest possible pod we allow in ocean.pangeo.io is defined by the profile_list entry:

pangeo-cloud-federation/deployments/ocean/config/common.yaml

Lines 53 to 58 in cfe8275

  'display_name': 'x-large (n1-highmem-16 | 16 cores, 96GB RAM)', 

  'kubespawner_override': { 

  'cpu_limit': 16, 

  'cpu_guarantee': 14, 

  'mem_limit': '100G', 

  'mem_guarantee': '96G',

We have a nodepool with n1-highmem-16 (16 vCPUs, 104 GB memory) nodes. However, when I try to launch the x-large profile, the event log shows

Server requested
2019-03-06 17:49:34+00:00 [Warning] 0/3 nodes are available: 2 Insufficient cpu, 3 Insufficient memory.
2019-03-06 17:49:48+00:00 [Normal] pod didn't trigger scale-up (it wouldn't fit if a new node is added)

How much headroom do we need between the pod resource requests and the node capacity? I would think that 14 cpus and 96GB of memory would fit on a 16 vCPU / 104GB memory node. How can we debug this?

Are notebooks in deployments/dev/image usable?

There are a lot of notebooks in https://github.com/pangeo-data/pangeo-cloud-federation/tree/staging/deployments/dev/image which comes from https://github.com/pangeo-data/pangeo-example-notebooks.

However, in the Readme, the binder button still points to pangeo-example-notebooks repo.

Could we launch a binder from pangeo-cloud-federation repo? Or should the notebooks be deleted and only maintained in the original repo?

admin users don't work on globus

This is how the ocean staging.yml secrets look

pangeo:
  jupyterhub:
    proxy:
      secretToken: XXX
    auth:
      type: globus
      globus:
        clientId: "XXX"
        clientSecret: "XXX"
        callbackUrl: "https://staging.ocean.pangeo.io/hub/oauth_callback"
        identityProvider: "orcid.org"
        admin:
          access: true
          users:
            - 0000-0001-7479-8439 # Joe Hamman
            - 0000-0001-5999-4917 # Ryan Abernathey
            - 0000-0003-4004-4553 # Raphael Dussin

The hub startup log says

Loading /etc/jupyterhub/config/values.yaml
Loading /etc/jupyterhub/secret/values.yaml
Loading extra config: customPodHook
Loading extra config: profile_list
[I 2019-03-07 00:08:39.981 JupyterHub app:1673] Using Authenticator: oauthenticator.globus.GlobusOAuthenticator-0.8.1
[I 2019-03-07 00:08:39.981 JupyterHub app:1673] Using Spawner: kubespawner.spawner.KubeSpawner
[I 2019-03-07 00:08:39.981 JupyterHub app:1016] Loading cookie_secret from /srv/jupyterhub/jupyterhub_cookie_secret
[W 2019-03-07 00:08:40.079 JupyterHub app:1131] JupyterHub.hub_connect_port is deprecated as of 0.9. Use JupyterHub.hub_connect_url to fully specify the URL for connecting to the Hub.
[W 2019-03-07 00:08:40.081 JupyterHub app:1173] No admin users, admin interface will be unavailable.
[W 2019-03-07 00:08:40.082 JupyterHub app:1174] Add any administrative users to `c.Authenticator.admin_users` in config.
[I 2019-03-07 00:08:40.082 JupyterHub app:1201] Not using whitelist. Any authenticated user will be allowed.

Note the No admin users. What's wrong?

Hubploy and multiple user images

Cross posting from pangeo-data/pangeo#348

We should try this out here. I'm curious if @dsludwig has any idea of how hubploy / repo2docker could handle this. I'm wondering if we'll need to reconfigure things a bit in hubploy to support this. Is anyone interested in giving this a go?

Should (can) we move domain specific hubs here?

In the weekly Pangeo developers call today, this topic came up. Should we move the domain specific deployments (atmos, ocean, polar, hydroshare, etc.) to this repo? We seem to be a bit fragmented in our development and maintenance of our various jupyterhubs and this could be a way to centralize some of our knowledge/resources.

cc @dsludwig @rabernat @raphaeldussin @NicWayand @bartnijssen @rsignell-usgs

Is the dask dashboard working as intended?

Are we supposed to manually drop the dashboard URL into JupyterLab, or is this supposed to happen automagically?

KubeCluster can't start workers on staging.ocean.pangeo.io

Finally fixed #148 and now we can use dask labextension to start KubeCluster schedulers on staging.ocean.pangeo.io.

The next problem is that apparently launching dask workers from KubeClusters doesn't work at all, whether I try to start them from the lab extension or from notebook code. kubectl -n ocean-staging get pods shows no pending recent dask-jovyan- pods. It does however have some older dask-jovyan- pods (e.g. dask-jovyan-bc51065a-9nhsmh) for which the GCP console tells me:

PodUnschedulable
Cannot schedule pods: Insufficient cpu.
PodUnschedulable
Cannot schedule pods: Insufficient memory.
PodUnschedulable
Cannot schedule pods: node(s) didn't match node selector.

Could the be related to node selector?

pangeo-cloud-federation/deployments/ocean/image/binder/dask_config.yaml

Lines 16 to 17 in 110712d

 nodeSelector: 

 alpha.eksctl.io/nodegroup-name: dask-worker

This is a pretty big problem, since these clusters are our killer feature.

ocean github auth doesn't work

It is trying to redirect back to staging.pangeo.io

move nfs configuration to deployments instead of under pangeo-deploy

we're running into an issue with helm upgrade for deploying to aws that is related to nfs configuration settings under the shared pangeo-deploy directory. This is resolved by deleting these settings, but doing so will presumably affect GCS deployments:
scottyhq@6dd4f18

Should we ensure whatever is under pangeo-deploy is as bare-bones as possible and not linked to specific cloud providers or deployments?

helm upgrade --wait --install --namespace nasa-staging nasa-staging pangeo-deploy -f deployments/nasa/config/common.yaml -f deployments/nasa/config/staging.yaml -f deployments/nasa/secrets/staging.yaml --set jupyterhub.singleuser.image.tag=a7ff12a --set jupyterhub.singleuser.image.name=pangeo/nasa-pangeo-io-not

Error: release nasa-staging failed: PersistentVolume "nasa-staging-home-nfs" is invalid: spec.nfs.server: Required value

Any issues moving pangeo.esipfed.org to this framework?

I'm currently running pangeo.esipfed.org on the pangeo-access AWS kops cluster the old way: manually executing the docker build, pushing to dockerhub, and re-upping the helm chart.

I would like to move to the new approach. Anything I should be aware of?

ocean docker images are not getting pushed

Hubploy is not pushing my notebook docker images. Consequently, I am getting errors in my hub like:

Failed to pull image "us.gcr.io/pangeo-181919/ocean-pangeo-io-notebook:f1112be": rpc error: code = Unknown desc = Error response from daemon: manifest for us.gcr.io/pangeo-181919/ocean-pangeo-io-notebook:f1112be not found

It seems like the commit f1112be is getting built on the PR branch (#97), but, on the deploy job, it determines nothing needs to be done.

#!/bin/bash -eo pipefail
hubploy build ocean --commit-range ${COMMIT_RANGE} --push
Activated service account credentials for: [[email protected]]
WARNING: `docker` not in system PATH.
`docker` and `docker-credential-gcloud` need to be in the same PATH in order to work correctly together.
gcloud's Docker credential helper can be configured but it will not work until this is corrected.
gcloud credential helpers already registered correctly.
Image us.gcr.io/pangeo-181919/ocean-pangeo-io-notebook:f1112be: already up to date

As a result, the image is never pushed.

Relevant circleci config is here:
https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/.circleci/config.yml#L125-L129

Where do we add packages like "vim" and "nano"?

I could not figure out how to get system packages like "vim" and "nano" into the notebook environment.

Setting up "IMAGE_NAME"

I am in the process of setting this up on Circle CI and the one thing that I am still a bit lost about is what are the exact steps that I would need to take to create the image that goes into IMAGE_NAME.

Is there a Github repo somewhere with the Dockerfile (or multiple) for some generic pangeo images that I can customize?

recap from today's work

@raphaeldussin, @rabernat and I worked on this repo a bit today. Here's my notes on what we worked on and how we did a few pieces:

shutdown example.pangeo.io
- done
- how did I do this:
  - First I deleted the kubernetes cluster from gcp
  - Then I removed all the pvcs from the compute-engine/disks menu on gcp
  - Then I archived the repo
create new hubs/namespaces in dev.pangeo.io-deploy for ocean/atmos/hydro/astro/polar Pangeos
- Q1: How to add new namespaces/hubs?
  - cp a existing deployment (under deployments)
  - change the deployment configs
  - add lines to the circle ci config
  - common yaml is the same except for nfs subPath
- Q2: How to use the Pangeo chart
  - First, we need to get Pangeo chart current with z2jh (v0.8)
  - Then we can replace the jupyterhub requirement with the pangeo chart
  - Then all our configs need to get indented under pangeo:
  - We can remove some pieces from the pangeo-deploy directory
set up all of these hubs to use existing nfs service
- this is in progress: #50
How to migrate the existing ocean.pangeo.io home spaces to the NFS server. Raphael, can you figure out how many users we have currently
- Raphael archived existing users home spaces
Customize the look and feel of the hubs. (@rabernat) This includes
- Custom log / welcome message on the landing page
- Custom links in the jupyterlab menu (e.g. to Pangeo documentation and github)

Thanks @yuvipanda for popping in and giving us some super valuable feedback

Yuvi also shared this repo: https://github.com/yuvipanda/datahub/tree/external

NFS issue - Mount failed for NFS V3 even after running rpcBind mount.nfs

I've tried out using Google Filestore and the setup suggested by @yuvipanda with success! I've also enjoyed the benefits by being able to smoothly recover when my k8s cluster crashed beyond repair while upgrading from 1.11 to 1.12 due to a GKE TPU related issue. Not having one copy of GCP-PD/PV/PVC for each users, this was doable, so thank you all for guiding the path!!

Anyhow I have run into an issue with the setup though that probably will affect you as I copied your setup solution. The issue arise for me when autoscaling up in the morning and two user pods are starting up at the same time on a node that is about to become ready. It will work fine if they arrive one at the time though! I'm not confident about what and when the pods make things fail by being two attempting to do something at the same time though. It seems like when two user pods arriving within a minute of each other while both waiting for images etc to be pulled since the node is freshly created, the issue strikes!

I think I can mitegate most of this issue by having a quick startup of pods, but when it happens I'm forced to drain the node to recover!

This is the error as found in the events of the pods.

Events:Type Reason Age From Message
---- ------ ---- ---- -------
Normal TriggeredScaleUp 9m13s cluster-autoscaler pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/ds-platform/zones/europe-west4-a/instanceGroups/gke-ds-platform-users-352836a1-grp 0->1 (max: 3)}]

Warning FailedScheduling 8m32s (x25 over 9m37s) jupyterhub-user-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector.

Warning FailedMount 7m9s kubelet, gke-ds-platform-users-352836a1-7lb1 MountVolume.SetUp failed for volume "home-nfs" : mount failed: exit status 1
Mounting command: systemd-runMounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/180b5b91-34e6-11e9-bc8b-42010a401004/volumes/kubernetes.io~nfs/home-nfs --scope – /home/kubernetes/containerized_mounter/mounter mount -t nfs 10.64.16.18:/home /var/lib/kubelet/pods/180b5b91-34e6-11e9-bc8b-42010a401004/volumes/kubernetes.io~nfs/home-nfsOutput: Running scope as unit: run-r8fdfd62f64e44eb995557473092b3ab5.scopeMount failed: Mount failed for NFS V3 even after running rpcBind mount.nfs: rpc.statd is not running but is required for remote locking.mount.nfs: Either use '-o nolock' to keep locks local, or start statd.mount.nfs: an incorrect mount option was specified, exit status 32

Warning FailedMount 7m9s kubelet, gke-ds-platform-users-352836a1-7lb1 MountVolume.SetUp failed for volume "home-nfs" : mount failed: exit status 1Mounting command: systemd-runMounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/180b5b91-34e6-11e9-bc8b-42010a401004/volumes/kubernetes.io~nfs/home-nfs --scope – /home/kubernetes/containerized_mounter/mounter mount -t nfs 10.64.16.18:/home /var/lib/kubelet/pods/180b5b91-34e6-11e9-bc8b-42010a401004/volumes/kubernetes.io~nfs/home-nfsOutput: Running scope as unit: run-r811263fce8b34ac7a5389196e9458cdc.scopeMount failed: Mount issued for NFS V3 but unable to run rpcbind:Output: rpcbind: another rpcbind is already running. Aborting

Hmm so note that what fails does not relate to whats within the init-container or container, but the pod's volumes section.

  # From the jupyter-my-user pod's spec (not nested under a specific (init-)container)
  # As generated by the helm chart options `storage.type: static`
  volumes:
  - name: home
    persistentVolumeClaim:
      claimName: home-nfs

Note that this section was created due to:

pangeo-cloud-federation/deployments/dev/config/common.yaml

Lines 14 to 18 in 2b18049

 storage: 

 type: static 

 static: 

 pvcName: home-nfs 

 subPath: "home/hub.pangeo.io/{username}"

https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/pangeo-deploy/templates/home-storage.yaml
#25
#28

potentially merging atmos and ocean hubs

As we work on consolidating the many hubs we once had running, do we feel there is a need for a distinct hub dedicated to atmospheric science - i.e. a new Atmos deployment to match the one created for Ocean?

Initially getting involved with this project, my intention was to upload data generated as part of TRACMIP to Pangeo's cloud bucket and play around with it in a hub deployed specifically for atmospheric sciences. However, it seems like we may be able to get by using the hub for oceanography.

Is there interest within the community to continue maintaining a hub specifically for atmospheric sciences?

How would this differ for an AWS deployment?

While most of this should be cloud agnostic (because it's running on top of an existing kubernetes deployment), there seem to be GCP-specific components described in the README and circleci tasks.

A few questions:

Does anybody currently use this for an AWS Pangeo deployment?
Do we have an idea of what would need to change to support AWS instead of GCP?
How should we support other cloud providers for these deployment repos? (Different repos? Branches? Support both with a configuration switch?)

Particularly interested in @dsludwig, @jacobtomlinson, @yuvipanda thoughts

logging for production clusters

We need to decide what to log and how. Ideally we could keep track of

when each user's notebook pod is running
when users start and stop dask clusters
how much file storage they are using

what else?

The default option on google cloud is JupyterHub Logs -> Stackdriver -> BigQuery

gitcrypt key

@jhamman @rabernat can you give me the key for git-crypt?

Replace readme?

Should we replace the generic readme with something that explains this setup?

Branch	Deployed at
develop	https://dev.pangeo.io
staging	https://staging.pangeo.io
prod	https://hub.pangeo.io

and perhaps something about the process and how long it takes changes in the environment.yml or other config to take effect?

NFS mounting issue

I am getting this in my event log as it tries to start my server

2019-02-21 21:23:40+00:00 [Warning] MountVolume.SetUp failed for volume "ocean-staging-home-nfs" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/a6af3b91-361e-11e9-94e8-42010a800086/volumes/kubernetes.io~nfs/ocean-staging-home-nfs --scope -- /home/kubernetes/containerized_mounter/mounter mount -t nfs <nil>:/ /var/lib/kubelet/pods/a6af3b91-361e-11e9-94e8-42010a800086/volumes/kubernetes.io~nfs/ocean-staging-home-nfs Output: Running scope as unit: run-r9a5ddc860e034ff9ac1e10447920c217.scope Mount failed: mount failed: exit status 32 Mounting command: chroot Mounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs <nil>:/ /var/lib/kubelet/pods/a6af3b91-361e-11e9-94e8-42010a800086/volumes/kubernetes.io~nfs/ocean-staging-home-nfs] Output: mount.nfs: Failed to resolve server <nil>: Name or service not known

nfs home directory mounting on notebook and dask pods

We currently have a 'hack' solution for mounting user home directories on efs (see

pangeo-cloud-federation/deployments/nasa/config/common.yaml

Line 5 in 8c14dd7

- name: volume-mount-hack

) which differs a bit from the zero2jupyterhub docs: https://zero-to-jupyterhub.readthedocs.io/en/latest/amazon/efs_storage.html

what is the current best practice for mounting efs home directories?

this is also relevant b/c we'd like users to be "bring their own conda environment.yml" to our deployed image that would work with dask KubeCluster. it seems there are quite a few github issues out there and i’m not clear on if that is possible

quoting @yuvipanda "...one way to do it is to share $HOME between workers and your notebook pod. That way, this turns into 'have local conda enviornments'. IMO, I like this more than having conda run on each worker forever to update environment"

so the other thing to sort out is how to get the shared $HOME into our dask_config.yaml https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/nasa/image/binder/dask_config.yaml

notebook container won't start because binder/start is not executable

Error: failed to start container "notebook": Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: \"binder/./start\": stat binder/./start: no such file or directory"

getting globus auth to work

I am playing around with using globus for auth. I followed the instructions here:

https://zero-to-jupyterhub.readthedocs.io/en/latest/authentication.html#globus

My hub pod is giving this error

    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 1955, in launch_instance_async
        await self.initialize(argv)
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 1639, in initialize
        self.load_config_file(self.config_file)
      File "<decorator-gen-5>", line 2, in load_config_file
      File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 87, in catch_config_error
        return method(app, *args, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 598, in load_config_file
        raise_config_file_errors=self.raise_config_file_errors,
      File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 562, in _load_config_files
        config = loader.load_config()
      File "/usr/local/lib/python3.6/dist-packages/traitlets/config/loader.py", line 457, in load_config
        self._read_file_as_dict()
      File "/usr/local/lib/python3.6/dist-packages/traitlets/config/loader.py", line 489, in _read_file_as_dict
        py3compat.execfile(conf_filename, namespace)
      File "/usr/local/lib/python3.6/dist-packages/ipython_genutils/py3compat.py", line 198, in execfile
        exec(compiler(f.read(), fname, 'exec'), glob, loc)
      File "/srv/jupyterhub_config.py", line 288, in <module>
        set_config_if_not_none(c.GlobusOAuthenticator, trait, 'auth.globus.' + cfg_key)
    TypeError: must be str, not NoneType

I think this is related to these issues and PRs:

jupyterhub/zero-to-jupyterhub-k8s#1167
jupyterhub/zero-to-jupyterhub-k8s#1171 (only 21 hours old!)

I wonder if @consideRatio can confirm that this is related to his recent PR.

If so, how do we point at the very latest chart?

Sharing git-crypt keys

@dsludwig - @yuvipanda is looking for the git-crypt keys for this repo. Can you communicate those to him over a secure channel?

[GKE] Add node selectors to https/proxy pods

We should add node selectors to the https/proxy pods for all of our GKE clusters (dev/ocean/hyrdo). This will make scaling of the notebook and dask pools far more efficient. For example, the proxy pod for ocean-prod is sitting in the highmem pool right now.

 ~/workdir/pangeo-cloud-federation   staging  kubectl get pods --namespace ocean-prod --output wide                                                                                                                            ✔  10437  08:27:02
NAME                         READY   STATUS    RESTARTS   AGE   IP           NODE                                                  NOMINATED NODE
autohttps-6555b4fd9c-bgqp5   2/2     Running   0          16h   10.32.1.22   gke-dev-pangeo-io-cluste-default-pool-c2a8b6ac-52hg   <none>
hub-7bbd7c5d-rzcmw           1/1     Running   0          9h    10.32.2.8    gke-dev-pangeo-io-cluste-default-pool-c2a8b6ac-xz3g   <none>
proxy-65c5d54b94-fg68j       1/1     Running   0          7h    10.32.8.19   gke-dev-pangeo-io-clust-n1-highmem-16-a99509de-lhtp   <none>

PersistentVolumeClaim "home-nfs" not found

The biggest possible pod we allow in ocean.pangeo.io is defined by the profile_list entry:

pangeo-cloud-federation/deployments/ocean/config/common.yaml

Lines 53 to 58 in cfe8275

  'display_name': 'x-large (n1-highmem-16 | 16 cores, 96GB RAM)', 

  'kubespawner_override': { 

  'cpu_limit': 16, 

  'cpu_guarantee': 14, 

  'mem_limit': '100G', 

  'mem_guarantee': '96G',

We have a nodepool with n1-highmem-16 (16 vCPUs, 104 GB memory) nodes. However, when I try to launch the x-large profile, I get the error

[Normal] pod didn't trigger scale-up (it wouldn't fit if a new node is added)

How much headroom do we need between the pod resource requests and the node capacity? How can we debug this?

Move to current master of hubploy

Currently, this is using hubploy from @dsludwig's fork. We've incorporated all the changes from the fork to hubploy master / repo2docker. We should try move this back to using hubploy master.

This should ideally happen at the same time as consolidating all hubs into one repo

use jupyterhub latest master (>0.9.4)

In order to make my custom logo work on ocean.pangeo.io., I need a jupyterhub with this PR in it. The latest release of jupyterhub was in September, 0.9.4, and does not include that PR.

We currently point to jupyterhub help chart version 0.9-e120fda. I assumed that would be pulling in a very recent master, since it is a devel release tagged on March 1. (https://jupyterhub.github.io/helm-chart/) But apparently this is not the case. I believe our hubs are using 0.9.4.

user conda environments

We've now set-up staging.nasa.pangeo.io to allow users to create their own conda environments
(see https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/nasa/config/common.yaml#L34).

I'm running into "The environment is inconsistent" and hanging "solving environment" issues with conda currently though in our image. I noticed that /srv/conda/.condarc has the following config:

channels:
  - conda-forge
  - defaults
auto_update_conda: false
show_channel_urls: true
update_dependencies: false

I'm wondering about the update_dependencies: false causing trouble. It comes from repo2docker (https://github.com/jupyter/repo2docker/blob/9099def40a331df04ba3ed862ee27a8e4a77fe43/repo2docker/buildpacks/conda/install-miniconda.bash#L39).

I also noticed we end up with a mix of packages from conda-forge, defaults, and pypi currently, which I guess is originating from pangeo-stacks:
https://github.com/pangeo-data/pangeo-stacks/blob/master/base-notebook/binder/environment.yml

So... @yuvipanda , @jhamman

Why is update_dependencies: false?
Should we change pangeo-stacks to just use conda-forge?

auth for production clusters

Before launching these new clusters, we should decide what do do for auth. My wishlist of features includes the ability to

associate cluster users to people in the real world
associate cluster users to academic institutions where possible
send mass email to all cluster users or sub-groups of cluster users (e.g. ocean, atmos, etc.)
revoke and suspend accounts on a user-by-user basis
track user login statistics over time

Some possible options are:

keep using github
use google auth with google groups
use a third party service like okta

get-commit-range.py is failing on PR builds

This seems to be a replay of #52! I thought we squashed this with #54?

#!/bin/bash -eo pipefail
# CircleCI doesn't have equivalent to Travis' COMMIT_RANGE
COMMIT_RANGE=$(./.circleci/get-commit-range.py)
echo ${COMMIT_RANGE}
echo "export COMMIT_RANGE='${COMMIT_RANGE}'" >> ${BASH_ENV}
Traceback (most recent call last):
  File "./.circleci/get-commit-range.py", line 90, in <module>
    main()
  File "./.circleci/get-commit-range.py", line 84, in main
    print(from_branch(args.project, args.repo, branch_name))
  File "./.circleci/get-commit-range.py", line 29, in from_branch
    raise ValueError(f'No PR from branch {branch_name} in upstream repo found')
ValueError: No PR from branch tweak-docker in upstream repo found
Exited with code 1

cc @yuvipanda

Authentication errors on CircleCI/GCP

I set up things according to the README, and still getting an authentication error on CircleCI in the "Build primary image if needed", when it is trying to run hubploy-image-builder:

#!/bin/bash -eo pipefail
hubploy-image-builder \
  --push \
  --registry-url https://us.gcr.io \
  --registry-username _json_key \
  --registry-password "${GCR_READWRITE_KEY}" \
  --repo2docker \
  deployments/${DEPLOYMENT}/image/ ${IMAGE_NAME}
Traceback (most recent call last):
  File "/root/repo/venv/lib/python3.6/site-packages/docker/api/client.py", line 256, in _raise_for_status
    response.raise_for_status()
  File "/root/repo/venv/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://35.237.221.205:2376/v1.35/distribution/us.gcr.io/learning-2-learn-221016/example-pangeo-io-notebook:2b306e1/json

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/repo/venv/bin/hubploy-image-builder", line 11, in <module>
    load_entry_point('hubploy==0.1.0', 'console_scripts', 'hubploy-image-builder')()
  File "/root/repo/venv/lib/python3.6/site-packages/hubploy/imagebuilder.py", line 151, in main
    if needs_building(client, args.path, args.image_name):
  File "/root/repo/venv/lib/python3.6/site-packages/hubploy/imagebuilder.py", line 22, in needs_building
    image_manifest = client.images.get_registry_data(image_spec)
  File "/root/repo/venv/lib/python3.6/site-packages/docker/models/images.py", line 333, in get_registry_data
    attrs=self.client.api.inspect_distribution(name),
  File "/root/repo/venv/lib/python3.6/site-packages/docker/utils/decorators.py", line 34, in wrapper
    return f(self, *args, **kwargs)
  File "/root/repo/venv/lib/python3.6/site-packages/docker/utils/decorators.py", line 19, in wrapped
    return f(self, resource_id, *args, **kwargs)
  File "/root/repo/venv/lib/python3.6/site-packages/docker/api/image.py", line 266, in inspect_distribution
    self._get(self._url("/distribution/{0}/json", image)), True
  File "/root/repo/venv/lib/python3.6/site-packages/docker/api/client.py", line 262, in _result
    self._raise_for_status(response)
  File "/root/repo/venv/lib/python3.6/site-packages/docker/api/client.py", line 258, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/root/repo/venv/lib/python3.6/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error: Internal Server Error ("unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication")
Exited with code 1

Do I need to edit the secrets/staging.yaml file in some way? I am using my own GCP account, not the one y'all have, so maybe that's the issue?

User-level permissions for pod access to S3 buckets

As a user, I'd like to use an S3 bucket (or a prefix within a shared bucket) as a storage option for my work. Ideally, that would be something that had access control such that only users with correct permissions can interact with it.

This is definitely possible from an AWS IAM policy perspective. For example: https://aws.amazon.com/premiumsupport/knowledge-center/iam-s3-user-specific-folder/

The challenge is that while we can give this permission at an instance level (via IAM Instance Profiles), multiple users' pods may end up on the same underlying instance. Thus a pod could access any co-resident pod's S3 bucket/prefix.

Another option would be to use string credentials for users. It would be important for these to be scoped to S3 actions/conditions only, and only from our cluster's CIDR block. Then we could inject the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY env vars into a users' pod. But I'm unsure of how the actual implementation would work -- specifically, what would inject those env vars to a pod, and how might it do that?

There may be other options as well. I'm especially curious about @yuvipanda and @jacobtomlinson thoughts on this.

hub.pangeo.io gives 500: Internal Server Error

https://hub.pangeo.io is the current go-to place for public pangeo-pydata members (http://pangeo.io/deployments.html),

unfortunately, logging into the site gives a 500: Internal Server Error.

https://dev.pangeo.io appears to be working fine.

notebook pod complains jupyter is not installed

I have replaced the ocean image with a passthrough docker file

pangeo-cloud-federation/deployments/ocean/image/binder/Dockerfile

Lines 1 to 2 in 78f5f9f

 # Note that there must be a tag 

 FROM pangeo/pangeo-ocean:2019.03.12

That image lives over in https://github.com/pangeo-data/pangeo-stacks, where it is built by repo2docker. It is already being used by binder via https://github.com/pangeo-data/pangeo_ocean_examples/ in a similar way, and it seems to work.

However, here the notebook pod won't start, and I get these errors:

Traceback (most recent call last):
  File "/srv/conda/lib/python3.6/site-packages/jupyterlab/labhubapp.py", line 5, in <module>
    from jupyterhub.singleuser import SingleUserNotebookApp
ModuleNotFoundError: No module named 'jupyterhub'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/bin/jupyter-labhub", line 7, in <module>
    from jupyterlab.labhubapp import main
  File "/srv/conda/lib/python3.6/site-packages/jupyterlab/labhubapp.py", line 8, in <module>
    raise ImportError('You must have jupyterhub installed for this to work.')
ImportError: You must have jupyterhub installed for this to work.

What is going on here?

Server startup failure on staging.pangeo.io

Spawn failed: HTTPSConnectionPool(host='10.4.0.1', port=443): Max retries exceeded with url: /api/v1/namespaces/staging/persistentvolumeclaims (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7ff25d601438>: Failed to establish a new connection: [Errno 111] Connection refused',))

Use Google Cloud Filestore for shared storage

Google Cloud now has a managed NFS provider (Filestore) that would be great for home directories. Currently, each user gets their own disk, which is expensive and rigid (you can't easily change sizes up or down after creation). It also makes some sharing scenarios harder.

Steps to use filestore:

Create a filestore
Configure z2jh to use NFS as the backing store for home directories. We use one filestore for all users, and use subPath to give them rw access to each directory. This is used to scope users to directories, rather than traditional Unix user permissions
Use an initContainer to make sure the home directory created for the user has right permissions and ownership. See https://serverfault.com/questions/906083/how-to-mount-volume-with-specific-uid-in-kubernetes-pod for example.

https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/doc/source/amazon/efs_storage.rst is some info on doing something like this with EFS, which is used on AWS. Step (3) would be different.

Once we have a good idea on how to set this up, this can be contributed back to the z2jh docs.

refactor branding templates for login page

Clusters on GCE are currently using a gitRepo volume to mount pangeo styling templates for custom jupyterhub login pages. We're having trouble getting this to work on AWS due to lack of write permissions at /usr/local/share/jupyterhub/, and it seems that gitRepo is deprecated according to kubernetes docs -https://kubernetes.io/docs/concepts/storage/volumes/#gitrepo.

It seems like the recommended approach would be to use initContainers under our hub: configuration, here is a nice example of that approach:
https://gist.github.com/tallclair/849601a16cebeee581ef2be50c351841

But... As far as I can tell, this would require adding the initContainers configuration option under hub::
https://zero-to-jupyterhub.readthedocs.io/en/latest/reference.html#hub

So we may want to suggest this change in a new issue here:
https://github.com/jupyterhub/zero-to-jupyterhub-k8s

Wanted to post here first to make sure there is not an easier approach that I'm overlooking... @jhamman, @yuvipanda

Hubploy COMMIT_RANGE failing in this repo

We have a CICD failure on CircleCI now:

#!/bin/bash -eo pipefail
# CircleCI doesn't have equivalent to Travis' COMMIT_RANGE
COMMIT_RANGE=$(./.circleci/get-commit-range.py)
echo ${COMMIT_RANGE}
echo "export COMMIT_RANGE='${COMMIT_RANGE}'" >> ${BASH_ENV}
Traceback (most recent call last):
  File "./.circleci/get-commit-range.py", line 90, in <module>
    main()
  File "./.circleci/get-commit-range.py", line 84, in main
    print(from_branch(args.project, args.repo, branch_name))
  File "./.circleci/get-commit-range.py", line 29, in from_branch
    raise ValueError(f'No PR from branch {branch_name} in upstream repo found')
ValueError: No PR from branch staging in upstream repo found
Exited with code 1

I know @yuvipanda was mentioning this is a bit of a tricky part of the current setup. I think we just need someone to look into this and figure out what isn't working.

cc @rabernat and @raphaeldussin

move staging to production deployments

We'd like to start moving staging.[deployment].pangeo.io to production deployments. What needs to happen to do this?

for starters, it seems like we should get back to using @yuvipanda 's hubploy: https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/requirements.txt

what else?

ocean staging notebooks won't launch

I had a notebook pod die spontaneously. Now I can't start any more.

2019-03-24 02:47:40+00:00 [Warning] 0/11 nodes are available: 1 node(s) had disk pressure, 10 Insufficient memory, 2 Insufficient cpu.
2019-03-24 02:47:48+00:00 [Normal] pod didn't trigger scale-up (it wouldn't fit if a new node is added)

One possible related point is that I was downloading O(5 GB) of data to the /tmp directory. I thought that this was sitting on a 100 GB SSD. But it might be that I filled up some disk somewhere.

It's weird that the node pools won't just scale up to accommodate a new pod.

dask labextension not working in ocean

I am seeing this in my notebook pod logs:

404 GET /user/0000-0001-5999-4917/dask/clusters?1551920957641 ([email protected]) 2.41ms

after I click on the button to create a new cluster.

multiple notebook docker images or kernels

How do we ensure that notebooks created on our clusters are always run-able, even as the notebook images evolve? The only choice I see is to have some sort of versioning system, which allows users to select past versions of their environments. There are two ways this could work:

At the notebook docker image level (i.e. use ProfileList to provide a choice of images)
At the kernel level: we somehow make available many different kernels within a single notebook image, and notebooks created with a certain kernel will always open with that kernel

Has anyone thought about how to solve this problem?

	'display_name': 'x-large (n1-highmem-16 \| 16 cores, 96GB RAM)',
	'kubespawner_override': {
	'cpu_limit': 16,
	'cpu_guarantee': 14,
	'mem_limit': '100G',
	'mem_guarantee': '96G',

	storage:
	type: static
	static:
	pvcName: home-nfs
	subPath: "home/hub.pangeo.io/{username}"

	# Note that there must be a tag
	FROM pangeo/pangeo-ocean:2019.03.12

pangeo-data / pangeo-cloud-federation Goto Github PK

pangeo-cloud-federation's People

Contributors

Stargazers

Watchers

Forkers

pangeo-cloud-federation's Issues

Related

Recommend Projects

Recommend Topics

Recommend Org