Giter VIP home page Giter VIP logo

kiosk-console's People

Contributors

ajt0ng avatar dylanbannon avatar elaubsch avatar enricozb avatar mekwarrior avatar msschwartz21 avatar osterman avatar willgraf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kiosk-console's Issues

Bug Fixes

  • Update README to indicate developers should setup AWS credentials
  • aws configure writes to /root
  • .aws only working from /conf
  • running kiosk twice should error
  • kubernetes learning materials

"stop" or "pause" cluster

It'd be good if we could just pause all the instances, thereby putting instance costs into abeyance, without tearing down the entire cluster.

Cluster destruction partially fails

When I use the Destroy command from the main menu to destroy the cluster, it never finishes execution. It never succeeds in deleting the VPC and its associated DHCP resource.

When I look around afterwards, I see that 1) the keypair located in /localhost/.geodesic is still present, and 2) the S3 bucket that stored the cluster configuration wasn't deleted.

Perhaps this is an issue with the newer version of Geodesic?

AWS too many VPCs error

This is what you see when you have too many VPCs on your AWS account:
aws_vps_limit_error

We need to

  1. document this in the troubleshooting document
    and, potentially,
  2. start deleting VPCs upon AWS cluster shutdown, since each AWS cluster gets its own VPC.

This is the second time I've seen this error come up.

learn how to use REDIS

What is REDIS and how does it work?

(Or, how can we best leverage a combination of queues and hashmaps in order to 1) ensure that no job gets accidentally processed twice, and 2) the frontend is able to monitor the status of jobs?)

GPU scaling fails under certain circumstances

Adapted from a comment in #58.

I tested GPU scaling a lot and here's what I found:
On master right now (as of the merging of appropriate-cluster-sizing):
GKE-K80-predict -> works
AWS-K80-predict -> works
GKE-K80-train -> works
AWS-K80-train -> doesn't work
GKE-P100-predict -> doesn't work
GKE-P100-train -> works

This shows that everything that works in master works in this branch, and that there are two different cases (of those tested here) that fail. I'm going to open one issue about those cases, but I suspect they have different causes, since they involve different clouds, different GPUs, and different cluster functionality.

print cluster IP to kiosk

When a user creates a cluster, the cluster's public IP should be printed to the screen once startup is complete.

URL in GKE login process

Do we need to generate new URLs for users for their GKE logins? It seems like there's one hard-coded URL being used repeatedly, and it's unclear that this solution is actually appropriate for distribution of the kiosk.

Deleting instances in AWS

If AWS cluster teardown fails, users might find themselves trying to delete EC2 instances manually. Sometimes, they just keep respawning themselves, since they're part of an autoscaling group. We should include documentation on how to deal with this in the troubleshooting document. More generally, it might be a good idea to document the entire process of deleting a cluster manually.

reading in configuration variables on kiosk startup

When the kiosk starts up, it doesn't automatically read in any configuration variables (from env.aws or env.gke), even though it will read env and indicate that one of the two clouds is "active". The appropriate variables should be read in on kiosk startup.

Troubleshooting doc

There are so many bugs a user can hit that I think we need to have a separate troubleshooting document with proposed fixes for specific errors.

Deprecate Terraform

what

  • Use aws s3 cli to provision buckets

why

  • Reduce number of buckets required (just one for kops)
  • Reduce moving pieces/complexity
  • Terraform is a heavy handed tool if only a single bucket is needed
  • Make it easier to add support for GCE

speeding up tf-serving pod creation

It would be nice to decrease the initial wait time for predictions.

One possible strategy could be to create a volume as part of the cluster and load the tf-serving docker image onto it during cluster creation. Then, we could just mount the volume onto the GPU instance and save ourselves a minute or two in downloading the tf-serving docker image. Maybe?

Other ideas?

(This issue is not urgent. For now, it's more of a brainstorm. It would be nice to implement solutions for cutting down the initial wait time eventually, though.)

credentials don't always save between kiosk sessions

Configuration options should be writing to a file stored on the user's hard drive and then be loaded as defaults by the kiosk upon the next startup. This doesn't always happen. Could this be related to invoking the kiosk with sudo kiosk?

Unable to Schedule Pods on GPU Nodes on AWS

remediation

so here was my triage process

  1. spin up the cluster from scratch
  2. deploy the addons/nvidia-test.yaml <--- known working example
  3. observed that the pod was not getting scheduled
  4. looked at the autoscaler logs <--- observed that the node was schedule and came up
  5. kubectl get daemonsets --all-namespaces <-- saw daemonset deployed that installs nvidia drivers (nvidia-device-plugin) (edited)
  6. looked at the logs for the nvidia-device-plugin
    observed the following
2018/09/28 16:48:44 Could not register device plugin: rpc error: code = Unimplemented desc = unknown service deviceplugin.Registration
2018/09/28 16:48:44 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2018/09/28 16:48:44 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2018/09/28 16:48:44 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2018/09/28 16:48:44 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/09/28 16:48:44 Could not register device plugin: rpc error: code = Unimplemented d

so at this point, I assumed (incorrectly) that the node came up and the installation of the nvidia drivers failed

  1. ssh-add /localhost/.geodesic/id_rsa <--- add the ssh key
    8 ) ssh [email protected]
  2. sudo bash
  3. looked at the logs journalctl -u nvidia-docker-install.service <-- observed no problems (edited)
  4. check that docker has GPU access docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi <---- it did
  5. now i know the problem is on the kubernetes side
  6. realized we upgraded to kubernetes 1.10 from 1.9 by upgrading geodesic (edited)
  7. looked at plugins/nvidia-device-plugin.yaml <--- saw - image: nvidia/k8s-device-plugin:1.9 <--- assumed that was associated with the minor version of k8s (edited)
  8. upgraded to k8s-device-plugin:1.10, deleted daemonset, kubectl apply -f plugins/nvidia-device-plugin.yaml
  9. success

Scaledown for training-job

The training-job may not be able to scale back up after scaling down. If someone has free time, they should investigate whether there's any issue with this.

Tensorflow-serving won't automatically detect newly-added models

We need to find a way to get Tensorflow Serving to server models that have been uploaded to the storage bucket since tf-serving's creation.

Possibilities:

  1. Have the autoscaler watch the S3 bucket and delete all tf-serving pods whenever it detects a new model. (The pods should be immediately restarted.) This would also recommend that we adjust the redis-consumer's fault-tolerance so that it checks whether tf-serving pods exist and just waits patiently or something until they show up. (The idea is to not have the redis-consumer timeout because the tf-serving pods are busy restarting.)

GKE cluster shutdown -- delete all resources

It looks like, from poking around on the Google Cloud website, we're not deleting service accounts and disks associated with cluster. These, and all other provisioned resources and accounts, should be deleted during cluster shutdown.

tensorboard won't deploy

The tensorboard code that we merged into master this past week isn't fully functional. I observed, when dpeloying the current master branch on GKE, that helmfile deployment would ultimately fail because the tensorboard deployment held up things long enough for helmfile to timeout. Upon closer inspection, it looks like the tensorboard container gets stuck pulling the image for up to 20 minutes... If there's no problem pulling this image (tensorflow/tensorflow:latest) in other settings, then I suspect this is some sort of cluster resource issue, perhaps insufficient disk space on some node.

TensorBoard requests are sent to the Frontend pod AND the TensorBoard pod

TensorBoard has in ingress_path set to "/tensorboard". If a user goes to hostname/tensorboard they will see several errors including JSON errors. Instead, the user must go to hostname/tensorboard/ with the trailing "/" character.

Both requests to /tensorboard and /tensorboard/ are logged in the frontend pod, though the charts resolve with the trailing "/".

The ingress needs to be improved so that tensorboard requests are not handled by the frontend pod, but only by the tensorboard pod.

Load balancer IP/FQDN is not visible from kiosk output

Instead of copy/pasting from the kiosk output the user must go into AWS/GKE and find the URL of the load balancer to access the web portal. Instead, the URL of the load balancer URL should be visible in the "deployment complete" message.

KOPS_CLUSTER_NAME unset

After creating an AWS cluster, the environmental variable KOPS_CLUSTER_NAME is set to default.k8s.local, instead of [cluster_name].k8s.local.

Resolution of this is necessary for implementing #22 on AWS.

some users are using `sudo` to run the `kiosk`

Some users have docker installed in such a way that they don't have unprivileged access to it, leading to a whole bunch of kiosk startup commands requiring sudo and prehaps(?) leading to issues with AWS deployment.

My thought is that, ideally, we would provide a brief description in the README about how users can modify their Docker installation to no longer need to use sudo and state that, should they prefer to keep using sudo, some services may not work.

Readme updates

  1. Make sure "All users" instructions work
  2. Include a note addressing "New Users", so that they know what instructions to use.
  3. Remove highlighting around "build-harness" in developers' instructions.
  4. Address potential issues with Docker and needing to use sudo for make commands that invoke docker.

Add instructions:

  • need to configure for either AWS or GKE

  • GKE: need to make a project ahead of time if using

  • GKE: confirm via web interface using a google account that has access to the project you chose

  • GKE: the account you choose also needs to have access to the bucket you choose

  • GKE: once confirmed, create cluster in kiosk menu

-AWS: You need to have an AWS account and generate an access key pair by following the instructions at https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html
-AWS: the account you're using should be a member of the admin group, unless you're confident you don't need that (!!!! this may not be sufficient?)
-AWS: make sure you have a bucket with public access

  • cluster creation may take up to 10 minutes

  • cluster creation is done when you see "---COMPLETE---"

  • when using the Predict functionality, the first image will take a while (up to 10 minutes) because the cluster needs to requisition more computing resources. (In its resting state, the cluster is designed to use as few resources as possible.)

  • Currently, the most efficient way to find the public IP address of your cluster is to return to the kiosk's main menu, select Shell, and paste the following command into the terminal: kubectl describe service --namespace=kube-system ingress-nginx-ingress-controller and then find the IP address listed in the LoadBalancer Ingress field of the output.

-cluster destruction is done when you see "---COMPLETE---"

Implement automated testing

As issue #59 underscores, we need some way to definitively saying that a branch is ready to merge into master. Since this project looks like it'll be with us for a while to come, this is probably a sound time investment.

@willgraf what are your thoughts on how was should go about implementing automated testing?

.geodesic folder is not always created automatically

I'm not sure when it's supposed to be created, but it doesn't appear to always (ever?) happen. We should add the creation of a ~/.geodesic folder into the kiosk installation process, if it isn't already there somewhere.

Helmfile deployment during cluster startup fails

During cluster startup, the helmfile deployment always makes it to file 0220.tf-serving-redis-interface.yaml and then fails on that file with Error: unable to move current charts to tmp dir: rename /conf/charts/tf-serving-redis-interface/charts /conf/charts/tf-serving-redis-interface/tmpcharts: invalid cross-device link.

Improve documentation

Let's assume that our audience is wet lab biologists, who might have very little knowledge regarding programming or cloud computing. We then need:

  • Detailed documentation, beyond what's in the README
  • Detailed videos of different portions of the setup process (e.g., a video of setting up Google Cloud)
  • Perhaps a Readthedocs

killed kiosk process. run new kiosk but helm errors out.

The kiosk process was killed but the cluster is still up. I ran make run again to start a new kiosk process, and dropped to the shell. helm list and helmfile cause the following error:

Error: Get http://localhost:8080/api/v1/namespaces/kube-system/pods?labelSelector=app%3Dhelm%2Cname%3Dtiller: dial tcp 127.0.0.1:8080: connect: connection refused

kubectl causes:

The connection to the server localhost:8080 was refused - did you specify the right host or port?

cluster startup commands should fail gracefully

You know that feeling when you try to startup a cluster, but then it fails for whatever reason, and then you fix that problem and try to start it again, and this time it just fails because it had already executed some of the startup commands on the first try, before failing, and doesn't want to remake a bucket or whatever? We should have all of the resource acquisition portions fail silently if the resource already exists. Or maybe just print a warning?

Adjust cluster limits for GKE (and AWS?)

On GKE, now that we've implemented training, we're often running out of resources in our node pools. The sizes of these node pools should be adjusted so that this isn't an issue. It would be best if we could warn users about approaching node pool limits or handle an exhaustion of a node pool gracefully.

There's probably a similar situation on AWS.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.