vanvalenlab / kiosk-console Goto Github PK
View Code? Open in Web Editor NEWDeepCell Kiosk Distribution for Kubernetes on GKE and AWS
Home Page: https://deepcell-kiosk.readthedocs.io
License: Other
DeepCell Kiosk Distribution for Kubernetes on GKE and AWS
Home Page: https://deepcell-kiosk.readthedocs.io
License: Other
aws configure
writes to /root
.aws
only working from /conf
It'd be good if we could just pause all the instances, thereby putting instance costs into abeyance, without tearing down the entire cluster.
When I use the Destroy command from the main menu to destroy the cluster, it never finishes execution. It never succeeds in deleting the VPC and its associated DHCP resource.
When I look around afterwards, I see that 1) the keypair located in /localhost/.geodesic
is still present, and 2) the S3 bucket that stored the cluster configuration wasn't deleted.
Perhaps this is an issue with the newer version of Geodesic?
What is REDIS and how does it work?
(Or, how can we best leverage a combination of queues and hashmaps in order to 1) ensure that no job gets accidentally processed twice, and 2) the frontend is able to monitor the status of jobs?)
Adapted from a comment in #58.
I tested GPU scaling a lot and here's what I found:
On master
right now (as of the merging of appropriate-cluster-sizing
):
GKE-K80-predict -> works
AWS-K80-predict -> works
GKE-K80-train -> works
AWS-K80-train -> doesn't work
GKE-P100-predict -> doesn't work
GKE-P100-train -> works
This shows that everything that works in master works in this branch, and that there are two different cases (of those tested here) that fail. I'm going to open one issue about those cases, but I suspect they have different causes, since they involve different clouds, different GPUs, and different cluster functionality.
When a user creates a cluster, the cluster's public IP should be printed to the screen once startup is complete.
Do we need to generate new URLs for users for their GKE logins? It seems like there's one hard-coded URL being used repeatedly, and it's unclear that this solution is actually appropriate for distribution of the kiosk.
The easiest implementation here is just having the user exit the config process entirely whenever they choose Cancel
from within the configuration sub-menu.
If AWS cluster teardown fails, users might find themselves trying to delete EC2 instances manually. Sometimes, they just keep respawning themselves, since they're part of an autoscaling group. We should include documentation on how to deal with this in the troubleshooting document. More generally, it might be a good idea to document the entire process of deleting a cluster manually.
This is deployed on GKE. Autoscaler is correctly incrementing the job parallelism for each key, however, the job is not being deployed as a pod
When the kiosk starts up, it doesn't automatically read in any configuration variables (from env.aws
or env.gke
), even though it will read env
and indicate that one of the two clouds is "active". The appropriate variables should be read in on kiosk startup.
There are so many bugs a user can hit that I think we need to have a separate troubleshooting document with proposed fixes for specific errors.
aws s3
cli to provision bucketsI've observed this on GKE when starting up a cluster, destroying it, and restarting the cluster, all without restarting the kiosk.
It would be nice to decrease the initial wait time for predictions.
One possible strategy could be to create a volume as part of the cluster and load the tf-serving docker image onto it during cluster creation. Then, we could just mount the volume onto the GPU instance and save ourselves a minute or two in downloading the tf-serving docker image. Maybe?
Other ideas?
(This issue is not urgent. For now, it's more of a brainstorm. It would be nice to implement solutions for cutting down the initial wait time eventually, though.)
Configuration options should be writing to a file stored on the user's hard drive and then be loaded as defaults by the kiosk upon the next startup. This doesn't always happen. Could this be related to invoking the kiosk with sudo kiosk
?
This should be implemented in the kiosk menu, probably just by tacking on extra fields to the existing GKE and AWS configuration sequences.
so here was my triage process
addons/nvidia-test.yaml
<--- known working examplekubectl get daemonsets --all-namespaces
<-- saw daemonset deployed that installs nvidia drivers (nvidia-device-plugin
) (edited)nvidia-device-plugin
2018/09/28 16:48:44 Could not register device plugin: rpc error: code = Unimplemented desc = unknown service deviceplugin.Registration
2018/09/28 16:48:44 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2018/09/28 16:48:44 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2018/09/28 16:48:44 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2018/09/28 16:48:44 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/09/28 16:48:44 Could not register device plugin: rpc error: code = Unimplemented d
so at this point, I assumed (incorrectly) that the node came up and the installation of the nvidia drivers failed
ssh-add /localhost/.geodesic/id_rsa
<--- add the ssh keyssh [email protected]
sudo bash
journalctl -u nvidia-docker-install.service
<-- observed no problems (edited)docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
<---- it didplugins/nvidia-device-plugin.yaml
<--- saw - image: nvidia/k8s-device-plugin:1.9
<--- assumed that was associated with the minor version of k8s (edited)k8s-device-plugin:1.10
, deleted daemonset, kubectl apply -f plugins/nvidia-device-plugin.yaml
It's annoying to do kubens deepcell
every time I drop to the shell.
The training-job may not be able to scale back up after scaling down. If someone has free time, they should investigate whether there's any issue with this.
We need to find a way to get Tensorflow Serving to server models that have been uploaded to the storage bucket since tf-serving's creation.
Possibilities:
It looks like, from poking around on the Google Cloud website, we're not deleting service accounts and disks associated with cluster. These, and all other provisioned resources and accounts, should be deleted during cluster shutdown.
Is this desirable behavior, @osterman?
The tensorboard code that we merged into master
this past week isn't fully functional. I observed, when dpeloying the current master
branch on GKE, that helmfile deployment would ultimately fail because the tensorboard
deployment held up things long enough for helmfile to timeout. Upon closer inspection, it looks like the tensorboard container gets stuck pulling the image for up to 20 minutes... If there's no problem pulling this image (tensorflow/tensorflow:latest
) in other settings, then I suspect this is some sort of cluster resource issue, perhaps insufficient disk space on some node.
TensorBoard has in ingress_path
set to "/tensorboard"
. If a user goes to hostname/tensorboard
they will see several errors including JSON errors. Instead, the user must go to hostname/tensorboard/
with the trailing "/" character.
Both requests to /tensorboard
and /tensorboard/
are logged in the frontend pod, though the charts resolve with the trailing "/".
The ingress needs to be improved so that tensorboard requests are not handled by the frontend pod, but only by the tensorboard pod.
Instead of copy/pasting from the kiosk output the user must go into AWS/GKE and find the URL of the load balancer to access the web portal. Instead, the URL of the load balancer URL should be visible in the "deployment complete" message.
After creating an AWS cluster, the environmental variable KOPS_CLUSTER_NAME is set to default.k8s.local
, instead of [cluster_name].k8s.local
.
Resolution of this is necessary for implementing #22 on AWS.
We need to have the ability to, for instance, segment and image and then apply tracking, all in one pipeline.
Some users have docker installed in such a way that they don't have unprivileged access to it, leading to a whole bunch of kiosk startup commands requiring sudo
and prehaps(?) leading to issues with AWS deployment.
My thought is that, ideally, we would provide a brief description in the README about how users can modify their Docker installation to no longer need to use sudo
and state that, should they prefer to keep using sudo, some services may not work.
sudo
for make
commands that invoke docker
.Add instructions:
need to configure for either AWS or GKE
GKE: need to make a project ahead of time if using
GKE: confirm via web interface using a google account that has access to the project you chose
GKE: the account you choose also needs to have access to the bucket you choose
GKE: once confirmed, create cluster in kiosk menu
-AWS: You need to have an AWS account and generate an access key pair by following the instructions at https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html
-AWS: the account you're using should be a member of the admin
group, unless you're confident you don't need that (!!!! this may not be sufficient?)
-AWS: make sure you have a bucket with public access
cluster creation may take up to 10 minutes
cluster creation is done when you see "---COMPLETE---"
when using the Predict functionality, the first image will take a while (up to 10 minutes) because the cluster needs to requisition more computing resources. (In its resting state, the cluster is designed to use as few resources as possible.)
Currently, the most efficient way to find the public IP address of your cluster is to return to the kiosk's main menu, select Shell
, and paste the following command into the terminal: kubectl describe service --namespace=kube-system ingress-nginx-ingress-controller
and then find the IP address listed in the LoadBalancer Ingress
field of the output.
-cluster destruction is done when you see "---COMPLETE---"
We need to implement reasonable defaults for pod destruction.
As issue #59 underscores, we need some way to definitively saying that a branch is ready to merge into master
. Since this project looks like it'll be with us for a while to come, this is probably a sound time investment.
@willgraf what are your thoughts on how was should go about implementing automated testing?
I'm not sure when it's supposed to be created, but it doesn't appear to always (ever?) happen. We should add the creation of a ~/.geodesic
folder into the kiosk installation process, if it isn't already there somewhere.
Change ASCII characters comprising deepcell logo.
Also, change intro text to
"
Welcome to the Deepcell Kiosk!
This Kiosk was developed by the Van Valen Lab at the California Institute of Technology.
When a user that did NOT install the kiosk runs the kiosk, they may not have a .geodesic folder in /localhost, causing the sshkey generation to fail
unceremoniously
During cluster startup, the helmfile deployment always makes it to file 0220.tf-serving-redis-interface.yaml
and then fails on that file with Error: unable to move current charts to tmp dir: rename /conf/charts/tf-serving-redis-interface/charts /conf/charts/tf-serving-redis-interface/tmpcharts: invalid cross-device link
.
Let's assume that our audience is wet lab biologists, who might have very little knowledge regarding programming or cloud computing. We then need:
The kiosk process was killed but the cluster is still up. I ran make run
again to start a new kiosk process, and dropped to the shell. helm list
and helmfile
cause the following error:
Error: Get http://localhost:8080/api/v1/namespaces/kube-system/pods?labelSelector=app%3Dhelm%2Cname%3Dtiller: dial tcp 127.0.0.1:8080: connect: connection refused
kubectl
causes:
The connection to the server localhost:8080 was refused - did you specify the right host or port?
Redis goes into an Unknown State
and is unable to recover. The kubelet can no longer get the correct status of the process and the new instances of Redis never start.
EDIT: This image should be used with the mousebrain model and CUTS set to 4.
You know that feeling when you try to startup a cluster, but then it fails for whatever reason, and then you fix that problem and try to start it again, and this time it just fails because it had already executed some of the startup commands on the first try, before failing, and doesn't want to remake a bucket or whatever? We should have all of the resource acquisition portions fail silently if the resource already exists. Or maybe just print a warning?
On GKE, now that we've implemented training, we're often running out of resources in our node pools. The sizes of these node pools should be adjusted so that this isn't an issue. It would be best if we could warn users about approaching node pool limits or handle an exhaustion of a node pool gracefully.
There's probably a similar situation on AWS.
kiosk
should not allow users to enter bad service names without at least warning the user of accepted character sets.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.