pangeo-data / pangeo-cloud-federation Goto Github PK
View Code? Open in Web Editor NEWDeployment automation for Pangeo JupyterHubs on AWS, Google, and Azure
Home Page: https://pangeo.io/cloud.html
Deployment automation for Pangeo JupyterHubs on AWS, Google, and Azure
Home Page: https://pangeo.io/cloud.html
I tried my new non-pinned environment on dev.pangeo.io, and when I try to start a cluster, I get:
distributed.core - ERROR - add_worker() got an unexpected keyword argument 'cpu'
Traceback (most recent call last):
File "/srv/conda/lib/python3.6/site-packages/distributed/core.py", line 340, in handle_comm
result = yield result
File "/srv/conda/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/srv/conda/lib/python3.6/site-packages/tornado/gen.py", line 307, in wrapper
result = func(*args, **kwargs)
So I guess the latest conda-forge packages distributed=1.23.3
and tornado=5.1.1
don't play nice together, right?
So I should retreat to distributed=1.22.1
and tornado=5.0.2
and try again, right?
Just checking this is the right approach. I won't ask this every time, I promise.
@apawloski, @scottyhq, and I have been working on getting this repo setup with a new nasa
deployment running on AWS. We've been making some changes to hubploy in berkeley-dsep-infra/hubploy#14 and have added most of the necessary bits to this repo.
We're currently stuck getting the eks authentication to work. This could just be how we've setup the IAM account but I'm posting here to coordinate the final pieces. The current error is:
could not get token: NoCredentialProviders: no valid providers in chain. Deprecated.
For verbose messaging see aws.Config.CredentialsChainVerboseErrors
Error: Get https://958CB6CD107C87EEDAA83BFFEE9EEAFA.sk1.us-east-1.eks.amazonaws.com/api/v1/namespaces/kube-system/pods?labelSelector=app%3Dhelm%2Cname%3Dtiller: getting credentials: exec: exit status 1
The command aws eks update-kubeconfig --name pangeo
succeeds but is not giving us sufficient privileges. I'm hoping @apawloski knows what to do here.
We're going to start consolidating a number of separate Jupyterhubs into this repo. The repo name dev.pangeo.io-deploy
will soon fail to describe what is here? What should we rename this repo to?
It looks like the PR worked it's way through CI successfully: https://circleci.com/gh/pangeo-data/dev.pangeo.io-deploy/14
but is there something I need to do additionally or differently?
When I start my server on dev.pangeo.io I just see the same old pre-PR environment -- it doesn't seem to be updated.
There are so many cool ones: https://github.com/topics/jupyterlab-extension
Some ideas:
Ping @lheagy and @ian-r-rose for some suggestions on how to trick out our jupyterlabs!
When building images with repo2docker, we currently put everything into $HOME by default. This works great for Binder.
However, when running with persistent JupyterHub, $HOME gets mounted over by the persistent home directory. This means everything in the repo is not visible to the users, and stuff like 'start' doesn't work.
The biggest possible pod we allow in ocean.pangeo.io is defined by the profile_list entry:
pangeo-cloud-federation/deployments/ocean/config/common.yaml
Lines 53 to 58 in cfe8275
We have a nodepool with n1-highmem-16 (16 vCPUs, 104 GB memory)
nodes. However, when I try to launch the x-large profile, the event log shows
Server requested
2019-03-06 17:49:34+00:00 [Warning] 0/3 nodes are available: 2 Insufficient cpu, 3 Insufficient memory.
2019-03-06 17:49:48+00:00 [Normal] pod didn't trigger scale-up (it wouldn't fit if a new node is added)
How much headroom do we need between the pod resource requests and the node capacity? I would think that 14 cpus and 96GB of memory would fit on a 16 vCPU / 104GB memory node. How can we debug this?
There are a lot of notebooks in https://github.com/pangeo-data/pangeo-cloud-federation/tree/staging/deployments/dev/image which comes from https://github.com/pangeo-data/pangeo-example-notebooks.
However, in the Readme, the binder button still points to pangeo-example-notebooks repo.
Could we launch a binder from pangeo-cloud-federation repo? Or should the notebooks be deleted and only maintained in the original repo?
This is how the ocean staging.yml secrets look
pangeo:
jupyterhub:
proxy:
secretToken: XXX
auth:
type: globus
globus:
clientId: "XXX"
clientSecret: "XXX"
callbackUrl: "https://staging.ocean.pangeo.io/hub/oauth_callback"
identityProvider: "orcid.org"
admin:
access: true
users:
- 0000-0001-7479-8439 # Joe Hamman
- 0000-0001-5999-4917 # Ryan Abernathey
- 0000-0003-4004-4553 # Raphael Dussin
The hub startup log says
Loading /etc/jupyterhub/config/values.yaml
Loading /etc/jupyterhub/secret/values.yaml
Loading extra config: customPodHook
Loading extra config: profile_list
[I 2019-03-07 00:08:39.981 JupyterHub app:1673] Using Authenticator: oauthenticator.globus.GlobusOAuthenticator-0.8.1
[I 2019-03-07 00:08:39.981 JupyterHub app:1673] Using Spawner: kubespawner.spawner.KubeSpawner
[I 2019-03-07 00:08:39.981 JupyterHub app:1016] Loading cookie_secret from /srv/jupyterhub/jupyterhub_cookie_secret
[W 2019-03-07 00:08:40.079 JupyterHub app:1131] JupyterHub.hub_connect_port is deprecated as of 0.9. Use JupyterHub.hub_connect_url to fully specify the URL for connecting to the Hub.
[W 2019-03-07 00:08:40.081 JupyterHub app:1173] No admin users, admin interface will be unavailable.
[W 2019-03-07 00:08:40.082 JupyterHub app:1174] Add any administrative users to `c.Authenticator.admin_users` in config.
[I 2019-03-07 00:08:40.082 JupyterHub app:1201] Not using whitelist. Any authenticated user will be allowed.
Note the No admin users. What's wrong?
Cross posting from pangeo-data/pangeo#348
We should try this out here. I'm curious if @dsludwig has any idea of how hubploy / repo2docker could handle this. I'm wondering if we'll need to reconfigure things a bit in hubploy to support this. Is anyone interested in giving this a go?
In the weekly Pangeo developers call today, this topic came up. Should we move the domain specific deployments (atmos, ocean, polar, hydroshare, etc.) to this repo? We seem to be a bit fragmented in our development and maintenance of our various jupyterhubs and this could be a way to centralize some of our knowledge/resources.
cc @dsludwig @rabernat @raphaeldussin @NicWayand @bartnijssen @rsignell-usgs
Finally fixed #148 and now we can use dask labextension to start KubeCluster schedulers on staging.ocean.pangeo.io.
The next problem is that apparently launching dask workers from KubeClusters doesn't work at all, whether I try to start them from the lab extension or from notebook code. kubectl -n ocean-staging get pods
shows no pending recent dask-jovyan-
pods. It does however have some older dask-jovyan-
pods (e.g. dask-jovyan-bc51065a-9nhsmh
) for which the GCP console tells me:
Could the be related to node selector?
This is a pretty big problem, since these clusters are our killer feature.
we're running into an issue with helm upgrade
for deploying to aws that is related to nfs configuration settings under the shared pangeo-deploy
directory. This is resolved by deleting these settings, but doing so will presumably affect GCS deployments:
scottyhq@6dd4f18
Should we ensure whatever is under pangeo-deploy is as bare-bones as possible and not linked to specific cloud providers or deployments?
helm upgrade --wait --install --namespace nasa-staging nasa-staging pangeo-deploy -f deployments/nasa/config/common.yaml -f deployments/nasa/config/staging.yaml -f deployments/nasa/secrets/staging.yaml --set jupyterhub.singleuser.image.tag=a7ff12a --set jupyterhub.singleuser.image.name=pangeo/nasa-pangeo-io-not
Error: release nasa-staging failed: PersistentVolume "nasa-staging-home-nfs" is invalid: spec.nfs.server: Required value
I'm currently running pangeo.esipfed.org on the pangeo-access AWS kops cluster the old way: manually executing the docker build
, pushing to dockerhub, and re-upping the helm chart.
I would like to move to the new approach. Anything I should be aware of?
Hubploy is not pushing my notebook docker images. Consequently, I am getting errors in my hub like:
Failed to pull image "us.gcr.io/pangeo-181919/ocean-pangeo-io-notebook:f1112be": rpc error: code = Unknown desc = Error response from daemon: manifest for us.gcr.io/pangeo-181919/ocean-pangeo-io-notebook:f1112be not found
It seems like the commit f1112be
is getting built on the PR branch (#97), but, on the deploy job, it determines nothing needs to be done.
#!/bin/bash -eo pipefail
hubploy build ocean --commit-range ${COMMIT_RANGE} --push
Activated service account credentials for: [[email protected]]
WARNING: `docker` not in system PATH.
`docker` and `docker-credential-gcloud` need to be in the same PATH in order to work correctly together.
gcloud's Docker credential helper can be configured but it will not work until this is corrected.
gcloud credential helpers already registered correctly.
Image us.gcr.io/pangeo-181919/ocean-pangeo-io-notebook:f1112be: already up to date
As a result, the image is never pushed.
Relevant circleci config is here:
https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/.circleci/config.yml#L125-L129
I could not figure out how to get system packages like "vim" and "nano" into the notebook environment.
I am in the process of setting this up on Circle CI and the one thing that I am still a bit lost about is what are the exact steps that I would need to take to create the image that goes into IMAGE_NAME.
Is there a Github repo somewhere with the Dockerfile (or multiple) for some generic pangeo images that I can customize?
@raphaeldussin, @rabernat and I worked on this repo a bit today. Here's my notes on what we worked on and how we did a few pieces:
pangeo:
Thanks @yuvipanda for popping in and giving us some super valuable feedback
Yuvi also shared this repo: https://github.com/yuvipanda/datahub/tree/external
I've tried out using Google Filestore and the setup suggested by @yuvipanda with success! I've also enjoyed the benefits by being able to smoothly recover when my k8s cluster crashed beyond repair while upgrading from 1.11 to 1.12 due to a GKE TPU related issue. Not having one copy of GCP-PD/PV/PVC for each users, this was doable, so thank you all for guiding the path!!
Anyhow I have run into an issue with the setup though that probably will affect you as I copied your setup solution. The issue arise for me when autoscaling up in the morning and two user pods are starting up at the same time on a node that is about to become ready. It will work fine if they arrive one at the time though! I'm not confident about what and when the pods make things fail by being two attempting to do something at the same time though. It seems like when two user pods arriving within a minute of each other while both waiting for images etc to be pulled since the node is freshly created, the issue strikes!
I think I can mitegate most of this issue by having a quick startup of pods, but when it happens I'm forced to drain the node to recover!
This is the error as found in the events of the pods.
Events:Type Reason Age From Message
---- ------ ---- ---- -------
Normal TriggeredScaleUp 9m13s cluster-autoscaler pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/ds-platform/zones/europe-west4-a/instanceGroups/gke-ds-platform-users-352836a1-grp 0->1 (max: 3)}]
Warning FailedScheduling 8m32s (x25 over 9m37s) jupyterhub-user-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector.
Warning FailedMount 7m9s kubelet, gke-ds-platform-users-352836a1-7lb1 MountVolume.SetUp failed for volume "home-nfs" : mount failed: exit status 1
Mounting command: systemd-runMounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/180b5b91-34e6-11e9-bc8b-42010a401004/volumes/kubernetes.io~nfs/home-nfs --scope – /home/kubernetes/containerized_mounter/mounter mount -t nfs 10.64.16.18:/home /var/lib/kubelet/pods/180b5b91-34e6-11e9-bc8b-42010a401004/volumes/kubernetes.io~nfs/home-nfsOutput: Running scope as unit: run-r8fdfd62f64e44eb995557473092b3ab5.scopeMount failed: Mount failed for NFS V3 even after running rpcBind mount.nfs: rpc.statd is not running but is required for remote locking.mount.nfs: Either use '-o nolock' to keep locks local, or start statd.mount.nfs: an incorrect mount option was specified, exit status 32
Warning FailedMount 7m9s kubelet, gke-ds-platform-users-352836a1-7lb1 MountVolume.SetUp failed for volume "home-nfs" : mount failed: exit status 1Mounting command: systemd-runMounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/180b5b91-34e6-11e9-bc8b-42010a401004/volumes/kubernetes.io~nfs/home-nfs --scope – /home/kubernetes/containerized_mounter/mounter mount -t nfs 10.64.16.18:/home /var/lib/kubelet/pods/180b5b91-34e6-11e9-bc8b-42010a401004/volumes/kubernetes.io~nfs/home-nfsOutput: Running scope as unit: run-r811263fce8b34ac7a5389196e9458cdc.scopeMount failed: Mount issued for NFS V3 but unable to run rpcbind:Output: rpcbind: another rpcbind is already running. Aborting
Hmm so note that what fails does not relate to whats within the init-container or container, but the pod's volumes section.
# From the jupyter-my-user pod's spec (not nested under a specific (init-)container)
# As generated by the helm chart options `storage.type: static`
volumes:
- name: home
persistentVolumeClaim:
claimName: home-nfs
Note that this section was created due to:
pangeo-cloud-federation/deployments/dev/config/common.yaml
Lines 14 to 18 in 2b18049
https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/pangeo-deploy/templates/home-storage.yaml
#25
#28
As we work on consolidating the many hubs we once had running, do we feel there is a need for a distinct hub dedicated to atmospheric science - i.e. a new Atmos deployment to match the one created for Ocean?
Initially getting involved with this project, my intention was to upload data generated as part of TRACMIP to Pangeo's cloud bucket and play around with it in a hub deployed specifically for atmospheric sciences. However, it seems like we may be able to get by using the hub for oceanography.
Is there interest within the community to continue maintaining a hub specifically for atmospheric sciences?
While most of this should be cloud agnostic (because it's running on top of an existing kubernetes deployment), there seem to be GCP-specific components described in the README and circleci tasks.
A few questions:
Does anybody currently use this for an AWS Pangeo deployment?
Do we have an idea of what would need to change to support AWS instead of GCP?
How should we support other cloud providers for these deployment repos? (Different repos? Branches? Support both with a configuration switch?)
Particularly interested in @dsludwig, @jacobtomlinson, @yuvipanda thoughts
We need to decide what to log and how. Ideally we could keep track of
what else?
The default option on google cloud is JupyterHub Logs -> Stackdriver -> BigQuery
Should we replace the generic readme with something that explains this setup?
Branch | Deployed at |
---|---|
develop | https://dev.pangeo.io |
staging | https://staging.pangeo.io |
prod | https://hub.pangeo.io |
and perhaps something about the process and how long it takes changes in the environment.yml
or other config to take effect?
I am getting this in my event log as it tries to start my server
2019-02-21 21:23:40+00:00 [Warning] MountVolume.SetUp failed for volume "ocean-staging-home-nfs" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/a6af3b91-361e-11e9-94e8-42010a800086/volumes/kubernetes.io~nfs/ocean-staging-home-nfs --scope -- /home/kubernetes/containerized_mounter/mounter mount -t nfs <nil>:/ /var/lib/kubelet/pods/a6af3b91-361e-11e9-94e8-42010a800086/volumes/kubernetes.io~nfs/ocean-staging-home-nfs Output: Running scope as unit: run-r9a5ddc860e034ff9ac1e10447920c217.scope Mount failed: mount failed: exit status 32 Mounting command: chroot Mounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs <nil>:/ /var/lib/kubelet/pods/a6af3b91-361e-11e9-94e8-42010a800086/volumes/kubernetes.io~nfs/ocean-staging-home-nfs] Output: mount.nfs: Failed to resolve server <nil>: Name or service not known
We currently have a 'hack' solution for mounting user home directories on efs (see
) which differs a bit from the zero2jupyterhub docs: https://zero-to-jupyterhub.readthedocs.io/en/latest/amazon/efs_storage.htmlwhat is the current best practice for mounting efs home directories?
this is also relevant b/c we'd like users to be "bring their own conda environment.yml" to our deployed image that would work with dask KubeCluster. it seems there are quite a few github issues out there and i’m not clear on if that is possible
quoting @yuvipanda "...one way to do it is to share $HOME between workers and your notebook pod. That way, this turns into 'have local conda enviornments'. IMO, I like this more than having conda run on each worker forever to update environment"
so the other thing to sort out is how to get the shared $HOME into our dask_config.yaml https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/nasa/image/binder/dask_config.yaml
Error: failed to start container "notebook": Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: \"binder/./start\": stat binder/./start: no such file or directory"
I am playing around with using globus for auth. I followed the instructions here:
https://zero-to-jupyterhub.readthedocs.io/en/latest/authentication.html#globus
My hub pod is giving this error
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 1955, in launch_instance_async
await self.initialize(argv)
File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 1639, in initialize
self.load_config_file(self.config_file)
File "<decorator-gen-5>", line 2, in load_config_file
File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 87, in catch_config_error
return method(app, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 598, in load_config_file
raise_config_file_errors=self.raise_config_file_errors,
File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 562, in _load_config_files
config = loader.load_config()
File "/usr/local/lib/python3.6/dist-packages/traitlets/config/loader.py", line 457, in load_config
self._read_file_as_dict()
File "/usr/local/lib/python3.6/dist-packages/traitlets/config/loader.py", line 489, in _read_file_as_dict
py3compat.execfile(conf_filename, namespace)
File "/usr/local/lib/python3.6/dist-packages/ipython_genutils/py3compat.py", line 198, in execfile
exec(compiler(f.read(), fname, 'exec'), glob, loc)
File "/srv/jupyterhub_config.py", line 288, in <module>
set_config_if_not_none(c.GlobusOAuthenticator, trait, 'auth.globus.' + cfg_key)
TypeError: must be str, not NoneType
I think this is related to these issues and PRs:
I wonder if @consideRatio can confirm that this is related to his recent PR.
If so, how do we point at the very latest chart?
@dsludwig - @yuvipanda is looking for the git-crypt keys for this repo. Can you communicate those to him over a secure channel?
We should add node selectors to the https/proxy pods for all of our GKE clusters (dev/ocean/hyrdo). This will make scaling of the notebook and dask pools far more efficient. For example, the proxy pod for ocean-prod is sitting in the highmem pool right now.
~/workdir/pangeo-cloud-federation staging kubectl get pods --namespace ocean-prod --output wide ✔ 10437 08:27:02
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
autohttps-6555b4fd9c-bgqp5 2/2 Running 0 16h 10.32.1.22 gke-dev-pangeo-io-cluste-default-pool-c2a8b6ac-52hg <none>
hub-7bbd7c5d-rzcmw 1/1 Running 0 9h 10.32.2.8 gke-dev-pangeo-io-cluste-default-pool-c2a8b6ac-xz3g <none>
proxy-65c5d54b94-fg68j 1/1 Running 0 7h 10.32.8.19 gke-dev-pangeo-io-clust-n1-highmem-16-a99509de-lhtp <none>
The biggest possible pod we allow in ocean.pangeo.io is defined by the profile_list entry:
pangeo-cloud-federation/deployments/ocean/config/common.yaml
Lines 53 to 58 in cfe8275
We have a nodepool with n1-highmem-16 (16 vCPUs, 104 GB memory)
nodes. However, when I try to launch the x-large profile, I get the error
[Normal] pod didn't trigger scale-up (it wouldn't fit if a new node is added)
How much headroom do we need between the pod resource requests and the node capacity? How can we debug this?
Currently, this is using hubploy from @dsludwig's fork. We've incorporated all the changes from the fork to hubploy master / repo2docker. We should try move this back to using hubploy master.
This should ideally happen at the same time as consolidating all hubs into one repo
In order to make my custom logo work on ocean.pangeo.io., I need a jupyterhub with this PR in it. The latest release of jupyterhub was in September, 0.9.4, and does not include that PR.
We currently point to jupyterhub help chart version 0.9-e120fda
. I assumed that would be pulling in a very recent master, since it is a devel release tagged on March 1. (https://jupyterhub.github.io/helm-chart/) But apparently this is not the case. I believe our hubs are using 0.9.4.
We've now set-up staging.nasa.pangeo.io to allow users to create their own conda environments
(see https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/nasa/config/common.yaml#L34).
I'm running into "The environment is inconsistent" and hanging "solving environment" issues with conda currently though in our image. I noticed that /srv/conda/.condarc has the following config:
channels:
- conda-forge
- defaults
auto_update_conda: false
show_channel_urls: true
update_dependencies: false
I'm wondering about the update_dependencies: false
causing trouble. It comes from repo2docker (https://github.com/jupyter/repo2docker/blob/9099def40a331df04ba3ed862ee27a8e4a77fe43/repo2docker/buildpacks/conda/install-miniconda.bash#L39).
I also noticed we end up with a mix of packages from conda-forge, defaults, and pypi currently, which I guess is originating from pangeo-stacks:
https://github.com/pangeo-data/pangeo-stacks/blob/master/base-notebook/binder/environment.yml
So... @yuvipanda , @jhamman
update_dependencies: false
?Before launching these new clusters, we should decide what do do for auth. My wishlist of features includes the ability to
Some possible options are:
This seems to be a replay of #52! I thought we squashed this with #54?
#!/bin/bash -eo pipefail
# CircleCI doesn't have equivalent to Travis' COMMIT_RANGE
COMMIT_RANGE=$(./.circleci/get-commit-range.py)
echo ${COMMIT_RANGE}
echo "export COMMIT_RANGE='${COMMIT_RANGE}'" >> ${BASH_ENV}
Traceback (most recent call last):
File "./.circleci/get-commit-range.py", line 90, in <module>
main()
File "./.circleci/get-commit-range.py", line 84, in main
print(from_branch(args.project, args.repo, branch_name))
File "./.circleci/get-commit-range.py", line 29, in from_branch
raise ValueError(f'No PR from branch {branch_name} in upstream repo found')
ValueError: No PR from branch tweak-docker in upstream repo found
Exited with code 1
cc @yuvipanda
I set up things according to the README, and still getting an authentication error on CircleCI in the "Build primary image if needed", when it is trying to run hubploy-image-builder
:
#!/bin/bash -eo pipefail
hubploy-image-builder \
--push \
--registry-url https://us.gcr.io \
--registry-username _json_key \
--registry-password "${GCR_READWRITE_KEY}" \
--repo2docker \
deployments/${DEPLOYMENT}/image/ ${IMAGE_NAME}
Traceback (most recent call last):
File "/root/repo/venv/lib/python3.6/site-packages/docker/api/client.py", line 256, in _raise_for_status
response.raise_for_status()
File "/root/repo/venv/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://35.237.221.205:2376/v1.35/distribution/us.gcr.io/learning-2-learn-221016/example-pangeo-io-notebook:2b306e1/json
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/repo/venv/bin/hubploy-image-builder", line 11, in <module>
load_entry_point('hubploy==0.1.0', 'console_scripts', 'hubploy-image-builder')()
File "/root/repo/venv/lib/python3.6/site-packages/hubploy/imagebuilder.py", line 151, in main
if needs_building(client, args.path, args.image_name):
File "/root/repo/venv/lib/python3.6/site-packages/hubploy/imagebuilder.py", line 22, in needs_building
image_manifest = client.images.get_registry_data(image_spec)
File "/root/repo/venv/lib/python3.6/site-packages/docker/models/images.py", line 333, in get_registry_data
attrs=self.client.api.inspect_distribution(name),
File "/root/repo/venv/lib/python3.6/site-packages/docker/utils/decorators.py", line 34, in wrapper
return f(self, *args, **kwargs)
File "/root/repo/venv/lib/python3.6/site-packages/docker/utils/decorators.py", line 19, in wrapped
return f(self, resource_id, *args, **kwargs)
File "/root/repo/venv/lib/python3.6/site-packages/docker/api/image.py", line 266, in inspect_distribution
self._get(self._url("/distribution/{0}/json", image)), True
File "/root/repo/venv/lib/python3.6/site-packages/docker/api/client.py", line 262, in _result
self._raise_for_status(response)
File "/root/repo/venv/lib/python3.6/site-packages/docker/api/client.py", line 258, in _raise_for_status
raise create_api_error_from_http_exception(e)
File "/root/repo/venv/lib/python3.6/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error: Internal Server Error ("unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication")
Exited with code 1
Do I need to edit the secrets/staging.yaml
file in some way? I am using my own GCP account, not the one y'all have, so maybe that's the issue?
As a user, I'd like to use an S3 bucket (or a prefix within a shared bucket) as a storage option for my work. Ideally, that would be something that had access control such that only users with correct permissions can interact with it.
This is definitely possible from an AWS IAM policy perspective. For example: https://aws.amazon.com/premiumsupport/knowledge-center/iam-s3-user-specific-folder/
The challenge is that while we can give this permission at an instance level (via IAM Instance Profiles), multiple users' pods may end up on the same underlying instance. Thus a pod could access any co-resident pod's S3 bucket/prefix.
Another option would be to use string credentials for users. It would be important for these to be scoped to S3 actions/conditions only, and only from our cluster's CIDR block. Then we could inject the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
env vars into a users' pod. But I'm unsure of how the actual implementation would work -- specifically, what would inject those env vars to a pod, and how might it do that?
There may be other options as well. I'm especially curious about @yuvipanda and @jacobtomlinson thoughts on this.
https://hub.pangeo.io is the current go-to place for public pangeo-pydata members (http://pangeo.io/deployments.html),
unfortunately, logging into the site gives a 500: Internal Server Error.
https://dev.pangeo.io appears to be working fine.
I have replaced the ocean image with a passthrough docker file
That image lives over in https://github.com/pangeo-data/pangeo-stacks, where it is built by repo2docker. It is already being used by binder via https://github.com/pangeo-data/pangeo_ocean_examples/ in a similar way, and it seems to work.
However, here the notebook pod won't start, and I get these errors:
Traceback (most recent call last):
File "/srv/conda/lib/python3.6/site-packages/jupyterlab/labhubapp.py", line 5, in <module>
from jupyterhub.singleuser import SingleUserNotebookApp
ModuleNotFoundError: No module named 'jupyterhub'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/srv/conda/bin/jupyter-labhub", line 7, in <module>
from jupyterlab.labhubapp import main
File "/srv/conda/lib/python3.6/site-packages/jupyterlab/labhubapp.py", line 8, in <module>
raise ImportError('You must have jupyterhub installed for this to work.')
ImportError: You must have jupyterhub installed for this to work.
What is going on here?
Spawn failed: HTTPSConnectionPool(host='10.4.0.1', port=443): Max retries exceeded with url: /api/v1/namespaces/staging/persistentvolumeclaims (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7ff25d601438>: Failed to establish a new connection: [Errno 111] Connection refused',))
Google Cloud now has a managed NFS provider (Filestore) that would be great for home directories. Currently, each user gets their own disk, which is expensive and rigid (you can't easily change sizes up or down after creation). It also makes some sharing scenarios harder.
Steps to use filestore:
subPath
to give them rw access to each directory. This is used to scope users to directories, rather than traditional Unix user permissionshttps://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/doc/source/amazon/efs_storage.rst is some info on doing something like this with EFS, which is used on AWS. Step (3) would be different.
Once we have a good idea on how to set this up, this can be contributed back to the z2jh docs.
Clusters on GCE are currently using a gitRepo
volume to mount pangeo styling templates for custom jupyterhub login pages. We're having trouble getting this to work on AWS due to lack of write permissions at /usr/local/share/jupyterhub/
, and it seems that gitRepo
is deprecated according to kubernetes docs -https://kubernetes.io/docs/concepts/storage/volumes/#gitrepo.
It seems like the recommended approach would be to use initContainers
under our hub:
configuration, here is a nice example of that approach:
https://gist.github.com/tallclair/849601a16cebeee581ef2be50c351841
But... As far as I can tell, this would require adding the initContainers
configuration option under hub:
:
https://zero-to-jupyterhub.readthedocs.io/en/latest/reference.html#hub
So we may want to suggest this change in a new issue here:
https://github.com/jupyterhub/zero-to-jupyterhub-k8s
Wanted to post here first to make sure there is not an easier approach that I'm overlooking... @jhamman, @yuvipanda
We have a CICD failure on CircleCI now:
#!/bin/bash -eo pipefail
# CircleCI doesn't have equivalent to Travis' COMMIT_RANGE
COMMIT_RANGE=$(./.circleci/get-commit-range.py)
echo ${COMMIT_RANGE}
echo "export COMMIT_RANGE='${COMMIT_RANGE}'" >> ${BASH_ENV}
Traceback (most recent call last):
File "./.circleci/get-commit-range.py", line 90, in <module>
main()
File "./.circleci/get-commit-range.py", line 84, in main
print(from_branch(args.project, args.repo, branch_name))
File "./.circleci/get-commit-range.py", line 29, in from_branch
raise ValueError(f'No PR from branch {branch_name} in upstream repo found')
ValueError: No PR from branch staging in upstream repo found
Exited with code 1
I know @yuvipanda was mentioning this is a bit of a tricky part of the current setup. I think we just need someone to look into this and figure out what isn't working.
cc @rabernat and @raphaeldussin
We'd like to start moving staging.[deployment].pangeo.io
to production deployments. What needs to happen to do this?
what else?
I had a notebook pod die spontaneously. Now I can't start any more.
2019-03-24 02:47:40+00:00 [Warning] 0/11 nodes are available: 1 node(s) had disk pressure, 10 Insufficient memory, 2 Insufficient cpu.
2019-03-24 02:47:48+00:00 [Normal] pod didn't trigger scale-up (it wouldn't fit if a new node is added)
One possible related point is that I was downloading O(5 GB) of data to the /tmp directory. I thought that this was sitting on a 100 GB SSD. But it might be that I filled up some disk somewhere.
It's weird that the node pools won't just scale up to accommodate a new pod.
I am seeing this in my notebook pod logs:
404 GET /user/0000-0001-5999-4917/dask/clusters?1551920957641 ([email protected]) 2.41ms
after I click on the button to create a new cluster.
How do we ensure that notebooks created on our clusters are always run-able, even as the notebook images evolve? The only choice I see is to have some sort of versioning system, which allows users to select past versions of their environments. There are two ways this could work:
Has anyone thought about how to solve this problem?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.