Giter VIP home page Giter VIP logo

data8xhub's People

Contributors

ryanlovett avatar vipasu avatar yuvipanda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

data8xhub's Issues

Provide better notification that the server is dead

Currently, when the server dies, users see their code getting 'stuck', and not much else.

This is very bad UX!

We should instead have a proper notification that you can not connect to server. This should also differentiate between unable to connect vs an explicit 404 / 403, which indicates server has been culled. Should give indication to the user about what they should do!

Set up CI/CD for hub deployments

Currently we do not have any CI/CD set up.

I think we should, but only for hub deployments. I think gdm changes should happen manually for now.

Write a simple service / library that provides sharding info for user homedirs

We will have N NFS servers, each of which can serve a capacity Cn users. We need a high performance, non-racy and simple way of doing the following:

  1. For a given user, fetch what host their homedir is
  2. If the user is new, assign them to a specific host based on some metric (least loaded, probably)
  3. If we perform a migration, update the user info so it directs clients to the new location

It also needs to perform at least some sort of authentication, split for reads and writes.

All this also needs to be fairly race free, and we should have a plan for backing up the data + reconstructing it from scratch if we lose it.

Use a notebook serverextension for culling

We want to have a bit more intelligent culling, so let's make the single-user notebook pods do the culling themselves with a notebook extension. This lets us kill things when kernels have been idle for a while, rather than network request based killing.

Increase autosaving interval

edX Students seem to be forgetting to save their work frequently. I suggest increasing the autosave interval to 5 minutes.

Set up prometheus in-cluster

Collect metrics about things inside the cluster, with prometheus!

Should have a biggish SSD and a couple months retention.

Data8x.2 Checklist

  • Get HTTPS certificates for hub.data8x.berkeley.edu
  • Set up HTTPS certificates on inner edge
  • Copy user directories from old NFS servers to new
  • Switch external IP address over
  • Setup Grafana

Test how a GKE cluster behaves at 50,000 pods

Spin up a cluster with 50,000 pods separated out to say 10 namespaces.

And figure out how the cluster reacts to normal operations when under this much load!

  1. Actively mount and unmount new NFS volumes all the time
  2. Lose a node and see what happens
  3. List and watch latencies
  4. Create a new pod - scheduling latency

If this proves to be too much, we can always split up our cluster into multiple smaller ones.

Decide if we're going to use a ZFS RAID or not

if we don't use ZFS RAID with multiple disks, we can simply rely on google's snapshots for backups. If we do use ZFS RAID, that one is on us.

If we can get away with non RAID'd performance, we should be ok to go without RAID.

Write grading script

We want a script that can take a notebook or a .py file, and push out a score. This score will be then sent to EdX

Resource Slack for AutoScaler

We want our autoscaler to have some amount of 'resource slack' - it should expand the cluster not when it's 100% full but more like, 80% full.

Currently this isn't a supported feature, so we've to work around it in interesting ways.

Here's a proposal:

  1. Clusters have a 'main' nodepool and a 'headroom' pool that is smaller
  2. Hub uses node affinity to prefer nodes in main pool, but can failover to the headroom pool if necessary
  3. We run a cron job that tries to schedule a pod that has affinity set so it must be in the main pool. This runs every minute or so, and just completes successfully as soon as there is more than one notebook pod in the same node as it.

So sequence of actions here would be:

  1. Main Pool is full, user pods start populating headroom pool, no outages.
  2. Cron fires, creating pod that must be on main pool. This goes into pending, triggering the autoscaler
  3. Autoscaler is triggered, creates new node on main pool. User pods start going into this instead!
  4. Over time, headroom pool is drained automatically since no new pods land there when there is still capacity in the main pool
  5. If the headroom pool is also full, the cluster autoscaler notices this and spawns a new node which might take several minutes. But this is a much rarer case!

So the amount of headroom we have is equal to the 'minimum nodes' setting of the 'headroom' nodepool.

Replace nfs-flex-volumes with something hackier

The nfs-flex-volume provider is decent for what it does, but I don't trust the author so an alternative that's simpler would be nicer.

Kubernetes 1.10 brings rshared mount propagation to beta, allowing us (possibly) to use it on k8s. However, that won't be ready in time, so we should do hacks that we trust to tide us over in the meantime. Ideally it'll be hostpath based...

Use GCR for docker images

When performance testing, I'm seeing the following errors when using DockerHub:

  Warning  Failed                 49s              kubelet, gke-hello3-cluster-alpha-default-d5650773-sxqq  Failed to pull image "jupyterhub/k8s-network-tools:81c2613": pull QPS exceeded.

which is fair since it's a free service.

Set up prometheus for NFS hosts

We want to capture NFS metrics from all our NFS hosts. We should do this via Prometheus, which seems to have a way of discovering Google Cloud services.

Create a support cluster

We shouldn't have tools for debugging on the main cluster. Instead we should have a 'support' cluster that contains things like:

  1. Grafana
  2. Prometheus (for NFS instances)
  3. Any health check stuff we have

This could be in a different zone too.

Decide on LoadBalancing strategy for multiple hubs

It looks like we'll end up with 30-40 hubs at max for this hub, so figuring out a load balancing strategy is important.

Requirements:

  1. Load Balancing users across hubs, not shard. The user homedir is really the only persistent storage we care about, and that is explicitly shared outside. So hubs should be load balanced across sessions.
  2. Sticky sessions - once a particular hub is chosen for a session, it should be used until the user session is over. Hubs themselves are stateful, so we can't load balance at the level of each user request - it has to be at the level of each user session. This means that the authentication needs to be at the level of the proxy, and the proxy needs to be aware of user authentication information to some extent.
  3. If a user closes their laptop and opens it in an hour, their notebook should sortof automagically come back to life in most cases. This is a problem because this means the user's user pod has been culled, but somehow when a new request comes in it gets the hub to trigger spawn and route correctly. This already works for the single hub case, and we should try make it work for the multiple hub case too! This I think is what constrains us most.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.