vanvalenlab / kiosk-console Goto Github PK

View Code? Open in Web Editor NEW

35.0 35.0 5.0 19.11 MB

DeepCell Kiosk Distribution for Kubernetes on GKE and AWS

Home Page: https://deepcell-kiosk.readthedocs.io

License: Other

Dockerfile 5.54% Makefile 5.19% Shell 72.84% Mustache 16.43%

helm helmfile kubernetes tensorflow-serving

kiosk-console's People

Contributors

Stargazers

Watchers

Forkers

lixiaoli921 devhliu maheshjethalia rossbar cytofrank

kiosk-console's Issues

Bug Fixes

Update README to indicate developers should setup AWS credentials
aws configure writes to /root
.aws only working from /conf
running kiosk twice should error
kubernetes learning materials

"stop" or "pause" cluster

It'd be good if we could just pause all the instances, thereby putting instance costs into abeyance, without tearing down the entire cluster.

Cluster destruction partially fails

When I use the Destroy command from the main menu to destroy the cluster, it never finishes execution. It never succeeds in deleting the VPC and its associated DHCP resource.

When I look around afterwards, I see that 1) the keypair located in /localhost/.geodesic is still present, and 2) the S3 bucket that stored the cluster configuration wasn't deleted.

Perhaps this is an issue with the newer version of Geodesic?

AWS too many VPCs error

This is what you see when you have too many VPCs on your AWS account:

We need to

document this in the troubleshooting document
and, potentially,
start deleting VPCs upon AWS cluster shutdown, since each AWS cluster gets its own VPC.

This is the second time I've seen this error come up.

learn how to use REDIS

What is REDIS and how does it work?

(Or, how can we best leverage a combination of queues and hashmaps in order to 1) ensure that no job gets accidentally processed twice, and 2) the frontend is able to monitor the status of jobs?)

Kiosk deployment fails when Bucket name is already in use.

Bucket names are global, and if other users not in the organization have taken the bucket name, the deployment will fail. This should be better brought to the user's attention to have them re-configure with another bucket name.

GUI looks all wrong (lot's of q's everywhere)

Using a Ubuntu:18.04 gcloud micro instance, and using sudo for all make commands, and found the kiosk GUI to look very strange (see screenshot)

GPU scaling fails under certain circumstances

Adapted from a comment in #58.

I tested GPU scaling a lot and here's what I found:
On master right now (as of the merging of appropriate-cluster-sizing):
GKE-K80-predict -> works
AWS-K80-predict -> works
GKE-K80-train -> works
AWS-K80-train -> doesn't work
GKE-P100-predict -> doesn't work
GKE-P100-train -> works

This shows that everything that works in master works in this branch, and that there are two different cases (of those tested here) that fail. I'm going to open one issue about those cases, but I suspect they have different causes, since they involve different clouds, different GPUs, and different cluster functionality.

print cluster IP to kiosk

When a user creates a cluster, the cluster's public IP should be printed to the screen once startup is complete.

URL in GKE login process

Do we need to generate new URLs for users for their GKE logins? It seems like there's one hard-coded URL being used repeatedly, and it's unclear that this solution is actually appropriate for distribution of the kiosk.

user should be able to exit configuration process inside the kiosk without incident

The easiest implementation here is just having the user exit the config process entirely whenever they choose Cancel from within the configuration sub-menu.

Deleting instances in AWS

If AWS cluster teardown fails, users might find themselves trying to delete EC2 instances manually. Sometimes, they just keep respawning themselves, since they're part of an autoscaling group. We should include documentation on how to deal with this in the troubleshooting document. More generally, it might be a good idea to document the entire process of deleting a cluster manually.

Increment Job Parallelism for training but no pod is being deployed

This is deployed on GKE. Autoscaler is correctly incrementing the job parallelism for each key, however, the job is not being deployed as a pod

reading in configuration variables on kiosk startup

When the kiosk starts up, it doesn't automatically read in any configuration variables (from env.aws or env.gke), even though it will read env and indicate that one of the two clouds is "active". The appropriate variables should be read in on kiosk startup.

Troubleshooting doc

There are so many bugs a user can hit that I think we need to have a separate troubleshooting document with proposed fixes for specific errors.

Deprecate Terraform

what

Use aws s3 cli to provision buckets

why

Reduce number of buckets required (just one for kops)
Reduce moving pieces/complexity
Terraform is a heavy handed tool if only a single bucket is needed
Make it easier to add support for GCE

frontend image upload returns 500 error

I've observed this on GKE when starting up a cluster, destroying it, and restarting the cluster, all without restarting the kiosk.

speeding up tf-serving pod creation

It would be nice to decrease the initial wait time for predictions.

One possible strategy could be to create a volume as part of the cluster and load the tf-serving docker image onto it during cluster creation. Then, we could just mount the volume onto the GPU instance and save ourselves a minute or two in downloading the tf-serving docker image. Maybe?

Other ideas?

(This issue is not urgent. For now, it's more of a brainstorm. It would be nice to implement solutions for cutting down the initial wait time eventually, though.)

creating executable for kiosk

@osterman
@willgraf
@vanvalen
If we had an executable, users who know literally nothing about the command line could set up the kiosk.

Thoughts? Is it possible? Would it be worth the effort?

delete GKE service accounts upon cluster teardown

credentials don't always save between kiosk sessions

Configuration options should be writing to a file stored on the user's hard drive and then be loaded as defaults by the kiosk upon the next startup. This doesn't always happen. Could this be related to invoking the kiosk with sudo kiosk?

Give users the option of setting GPU type and GPU pool limits

This should be implemented in the kiosk menu, probably just by tacking on extra fields to the existing GKE and AWS configuration sequences.

Unable to Schedule Pods on GPU Nodes on AWS

remediation

so here was my triage process

spin up the cluster from scratch
deploy the addons/nvidia-test.yaml <--- known working example
observed that the pod was not getting scheduled
looked at the autoscaler logs <--- observed that the node was schedule and came up
kubectl get daemonsets --all-namespaces <-- saw daemonset deployed that installs nvidia drivers (nvidia-device-plugin) (edited)
looked at the logs for the nvidia-device-plugin
observed the following

2018/09/28 16:48:44 Could not register device plugin: rpc error: code = Unimplemented desc = unknown service deviceplugin.Registration
2018/09/28 16:48:44 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2018/09/28 16:48:44 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2018/09/28 16:48:44 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2018/09/28 16:48:44 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/09/28 16:48:44 Could not register device plugin: rpc error: code = Unimplemented d

so at this point, I assumed (incorrectly) that the node came up and the installation of the nvidia drivers failed

ssh-add /localhost/.geodesic/id_rsa <--- add the ssh key
8 ) ssh [email protected]
sudo bash
looked at the logs journalctl -u nvidia-docker-install.service <-- observed no problems (edited)
check that docker has GPU access docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi <---- it did
now i know the problem is on the kubernetes side
realized we upgraded to kubernetes 1.10 from 1.9 by upgrading geodesic (edited)
looked at plugins/nvidia-device-plugin.yaml <--- saw - image: nvidia/k8s-device-plugin:1.9 <--- assumed that was associated with the minor version of k8s (edited)
upgraded to k8s-device-plugin:1.10, deleted daemonset, kubectl apply -f plugins/nvidia-device-plugin.yaml
success

kiosk shell should use "deepcell" namespace by default

It's annoying to do kubens deepcell every time I drop to the shell.

Scaledown for training-job

The training-job may not be able to scale back up after scaling down. If someone has free time, they should investigate whether there's any issue with this.

Tensorflow-serving won't automatically detect newly-added models

We need to find a way to get Tensorflow Serving to server models that have been uploaded to the storage bucket since tf-serving's creation.

Possibilities:

Have the autoscaler watch the S3 bucket and delete all tf-serving pods whenever it detects a new model. (The pods should be immediately restarted.) This would also recommend that we adjust the redis-consumer's fault-tolerance so that it checks whether tf-serving pods exist and just waits patiently or something until they show up. (The idea is to not have the redis-consumer timeout because the tf-serving pods are busy restarting.)

GKE cluster shutdown -- delete all resources

It looks like, from poking around on the Google Cloud website, we're not deleting service accounts and disks associated with cluster. These, and all other provisioned resources and accounts, should be deleted during cluster shutdown.

give users the option of choosing the type of gpu they want

redis-master, our cluster database, is not under the control of a replicaset and will not respawn if the node it's hosted on fails

Is this desirable behavior, @osterman?

tensorboard won't deploy

The tensorboard code that we merged into master this past week isn't fully functional. I observed, when dpeloying the current master branch on GKE, that helmfile deployment would ultimately fail because the tensorboard deployment held up things long enough for helmfile to timeout. Upon closer inspection, it looks like the tensorboard container gets stuck pulling the image for up to 20 minutes... If there's no problem pulling this image (tensorflow/tensorflow:latest) in other settings, then I suspect this is some sort of cluster resource issue, perhaps insufficient disk space on some node.

TensorBoard requests are sent to the Frontend pod AND the TensorBoard pod

TensorBoard has in ingress_path set to "/tensorboard". If a user goes to hostname/tensorboard they will see several errors including JSON errors. Instead, the user must go to hostname/tensorboard/ with the trailing "/" character.

Both requests to /tensorboard and /tensorboard/ are logged in the frontend pod, though the charts resolve with the trailing "/".

The ingress needs to be improved so that tensorboard requests are not handled by the frontend pod, but only by the tensorboard pod.

Load balancer IP/FQDN is not visible from kiosk output

Instead of copy/pasting from the kiosk output the user must go into AWS/GKE and find the URL of the load balancer to access the web portal. Instead, the URL of the load balancer URL should be visible in the "deployment complete" message.

KOPS_CLUSTER_NAME unset

After creating an AWS cluster, the environmental variable KOPS_CLUSTER_NAME is set to default.k8s.local, instead of [cluster_name].k8s.local.

Resolution of this is necessary for implementing #22 on AWS.

ability to process image with multiple models in sequence

We need to have the ability to, for instance, segment and image and then apply tracking, all in one pipeline.

some users are using `sudo` to run the `kiosk`

Some users have docker installed in such a way that they don't have unprivileged access to it, leading to a whole bunch of kiosk startup commands requiring sudo and prehaps(?) leading to issues with AWS deployment.

My thought is that, ideally, we would provide a brief description in the README about how users can modify their Docker installation to no longer need to use sudo and state that, should they prefer to keep using sudo, some services may not work.

Readme updates

Make sure "All users" instructions work
Include a note addressing "New Users", so that they know what instructions to use.
Remove highlighting around "build-harness" in developers' instructions.
Address potential issues with Docker and needing to use sudo for make commands that invoke docker.

Add instructions:

need to configure for either AWS or GKE
GKE: need to make a project ahead of time if using
GKE: confirm via web interface using a google account that has access to the project you chose
GKE: the account you choose also needs to have access to the bucket you choose
GKE: once confirmed, create cluster in kiosk menu

-AWS: You need to have an AWS account and generate an access key pair by following the instructions at https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html
-AWS: the account you're using should be a member of the admin group, unless you're confident you don't need that (!!!! this may not be sufficient?)
-AWS: make sure you have a bucket with public access

cluster creation may take up to 10 minutes
cluster creation is done when you see "---COMPLETE---"
when using the Predict functionality, the first image will take a while (up to 10 minutes) because the cluster needs to requisition more computing resources. (In its resting state, the cluster is designed to use as few resources as possible.)
Currently, the most efficient way to find the public IP address of your cluster is to return to the kiosk's main menu, select Shell, and paste the following command into the terminal: kubectl describe service --namespace=kube-system ingress-nginx-ingress-controller and then find the IP address listed in the LoadBalancer Ingress field of the output.

-cluster destruction is done when you see "---COMPLETE---"

tf-serving pod scaledown

We need to implement reasonable defaults for pod destruction.

Implement automated testing

As issue #59 underscores, we need some way to definitively saying that a branch is ready to merge into master. Since this project looks like it'll be with us for a while to come, this is probably a sound time investment.

@willgraf what are your thoughts on how was should go about implementing automated testing?

.geodesic folder is not always created automatically

I'm not sure when it's supposed to be created, but it doesn't appear to always (ever?) happen. We should add the creation of a ~/.geodesic folder into the kiosk installation process, if it isn't already there somewhere.

change kiosk shell greeting

Change ASCII characters comprising deepcell logo.

Also, change intro text to
"
Welcome to the Deepcell Kiosk!

This Kiosk was developed by the Van Valen Lab at the California Institute of Technology.

https://vanvalenlab.caltech.edu
"

no .geodesic folder for new user running kiosk

When a user that did NOT install the kiosk runs the kiosk, they may not have a .geodesic folder in /localhost, causing the sshkey generation to fail

kops create cluster fails on AWS

unceremoniously

Helmfile deployment during cluster startup fails

During cluster startup, the helmfile deployment always makes it to file 0220.tf-serving-redis-interface.yaml and then fails on that file with Error: unable to move current charts to tmp dir: rename /conf/charts/tf-serving-redis-interface/charts /conf/charts/tf-serving-redis-interface/tmpcharts: invalid cross-device link.

Improve documentation

Let's assume that our audience is wet lab biologists, who might have very little knowledge regarding programming or cloud computing. We then need:

Detailed documentation, beyond what's in the README
Detailed videos of different portions of the setup process (e.g., a video of setting up Google Cloud)
Perhaps a Readthedocs

killed kiosk process. run new kiosk but helm errors out.

The kiosk process was killed but the cluster is still up. I ran make run again to start a new kiosk process, and dropped to the shell. helm list and helmfile cause the following error:

Error: Get http://localhost:8080/api/v1/namespaces/kube-system/pods?labelSelector=app%3Dhelm%2Cname%3Dtiller: dial tcp 127.0.0.1:8080: connect: connection refused

kubectl causes:

The connection to the server localhost:8080 was refused - did you specify the right host or port?

Cluster fails when large tiff image is submitted.

Redis goes into an Unknown State and is unable to recover. The kubelet can no longer get the correct status of the process and the new instances of Redis never start.

This is the submitted image

EDIT: This image should be used with the mousebrain model and CUTS set to 4.

cluster startup commands should fail gracefully

You know that feeling when you try to startup a cluster, but then it fails for whatever reason, and then you fix that problem and try to start it again, and this time it just fails because it had already executed some of the startup commands on the first try, before failing, and doesn't want to remake a bucket or whatever? We should have all of the resource acquisition portions fail silently if the resource already exists. Or maybe just print a warning?

Adjust cluster limits for GKE (and AWS?)

On GKE, now that we've implemented training, we're often running out of resources in our node pools. The sizes of these node pools should be adjusted so that this isn't an issue. It would be best if we could warn users about approaching node pool limits or handle an exhaustion of a node pool gracefully.

There's probably a similar situation on AWS.

jupyter nav bar link should go to an actual notebook

GKE only allows alphanumeric characters and hyphens in service names

kiosk should not allow users to enter bad service names without at least warning the user of accepted character sets.