berkeley-dsep-infra / data8xhub Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 5.0 210 KB

Infrastructure for creating & maintaining the data8x JupyterHub

License: BSD 3-Clause "New" or "Revised" License

Python 89.56% Shell 1.43% HTML 4.16% Dockerfile 4.85%

data8xhub's People

Contributors

Stargazers

Watchers

Forkers

ryanlovett wiltonwu pranayhasan ydata123 henfee

data8xhub's Issues

Provide better notification that the server is dead

Currently, when the server dies, users see their code getting 'stuck', and not much else.

This is very bad UX!

We should instead have a proper notification that you can not connect to server. This should also differentiate between unable to connect vs an explicit 404 / 403, which indicates server has been culled. Should give indication to the user about what they should do!

Set up CI/CD for hub deployments

Currently we do not have any CI/CD set up.

I think we should, but only for hub deployments. I think gdm changes should happen manually for now.

Make sure that students can see their score easily in notebook

So they have instant feedback

Write a simple service / library that provides sharding info for user homedirs

We will have N NFS servers, each of which can serve a capacity C_n users. We need a high performance, non-racy and simple way of doing the following:

For a given user, fetch what host their homedir is
If the user is new, assign them to a specific host based on some metric (least loaded, probably)
If we perform a migration, update the user info so it directs clients to the new location

It also needs to perform at least some sort of authentication, split for reads and writes.

All this also needs to be fairly race free, and we should have a plan for backing up the data + reconstructing it from scratch if we lose it.

Use a notebook serverextension for culling

We want to have a bit more intelligent culling, so let's make the single-user notebook pods do the culling themselves with a notebook extension. This lets us kill things when kernels have been idle for a while, rather than network request based killing.

Increase autosaving interval

edX Students seem to be forgetting to save their work frequently. I suggest increasing the autosave interval to 5 minutes.

Add flashing red lights to nbresuse

If the RAM gets close to what the limit is, start flashing!

Use git-crypt for secrets

Currently our secrets are uncommitted. Instead, we should be using git-crypt...

Set up prometheus in-cluster

Collect metrics about things inside the cluster, with prometheus!

Should have a biggish SSD and a couple months retention.

Data8x.2 Checklist

Get HTTPS certificates for hub.data8x.berkeley.edu
Set up HTTPS certificates on inner edge
Copy user directories from old NFS servers to new
Switch external IP address over
Setup Grafana

Test posting grades back to EdX

We wanna do this!

Test how a GKE cluster behaves at 50,000 pods

Spin up a cluster with 50,000 pods separated out to say 10 namespaces.

And figure out how the cluster reacts to normal operations when under this much load!

Actively mount and unmount new NFS volumes all the time
Lose a node and see what happens
List and watch latencies
Create a new pod - scheduling latency

If this proves to be too much, we can always split up our cluster into multiple smaller ones.

Decide if we're going to use a ZFS RAID or not

if we don't use ZFS RAID with multiple disks, we can simply rely on google's snapshots for backups. If we do use ZFS RAID, that one is on us.

If we can get away with non RAID'd performance, we should be ok to go without RAID.

Write grading script

We want a script that can take a notebook or a .py file, and push out a score. This score will be then sent to EdX

Write a note that we do not support IE / Edge

We've had users try to use hub with Edge / IE. We do not support either of those browsers.

/cc @wiltonwu @papajohn I'm just going to add a note in the text above the lab for now.

Whitelist the things that we can change in a nodePool / cluster after the fact

This is not all of them unfortunately, so we have to be careful to see that we engineer things from the beginning in a way that lets us change them over time.

Also try to finalize the config of these things before start of class.

Test if we can mount NFS on container optimized OS hosts

Not sure if it ships with NFS mount support. If not, we need to switch to using ubuntu as the image for our k8s clusters.

Whitelist HTTP proxies that users need to use

Some of the content we use makes calls to external websites. These will be blocked per #11. We need to whitelist some domains through a HTTPS proxy.

Resource Slack for AutoScaler

We want our autoscaler to have some amount of 'resource slack' - it should expand the cluster not when it's 100% full but more like, 80% full.

Currently this isn't a supported feature, so we've to work around it in interesting ways.

Here's a proposal:

Clusters have a 'main' nodepool and a 'headroom' pool that is smaller
Hub uses node affinity to prefer nodes in main pool, but can failover to the headroom pool if necessary
We run a cron job that tries to schedule a pod that has affinity set so it must be in the main pool. This runs every minute or so, and just completes successfully as soon as there is more than one notebook pod in the same node as it.

So sequence of actions here would be:

Main Pool is full, user pods start populating headroom pool, no outages.
Cron fires, creating pod that must be on main pool. This goes into pending, triggering the autoscaler
Autoscaler is triggered, creates new node on main pool. User pods start going into this instead!
Over time, headroom pool is drained automatically since no new pods land there when there is still capacity in the main pool
If the headroom pool is also full, the cluster autoscaler notices this and spawns a new node which might take several minutes. But this is a much rarer case!

So the amount of headroom we have is equal to the 'minimum nodes' setting of the 'headroom' nodepool.

Mirror labs git repository in-cluster so we do not need internet access

This also helps us not hit any github resource limits at scale.

We should probably be using https://github.com/kubernetes/git-sync + an nginx server.

Replace nfs-flex-volumes with something hackier

The nfs-flex-volume provider is decent for what it does, but I don't trust the author so an alternative that's simpler would be nicer.

Kubernetes 1.10 brings rshared mount propagation to beta, allowing us (possibly) to use it on k8s. However, that won't be ready in time, so we should do hacks that we trust to tide us over in the meantime. Ideally it'll be hostpath based...

Use GCR for docker images

When performance testing, I'm seeing the following errors when using DockerHub:

  Warning  Failed                 49s              kubelet, gke-hello3-cluster-alpha-default-d5650773-sxqq  Failed to pull image "jupyterhub/k8s-network-tools:81c2613": pull QPS exceeded.

which is fair since it's a free service.

Test if hub works on Edge

Does it? Does it work on IE? What do we do?

Set up prometheus for NFS hosts

We want to capture NFS metrics from all our NFS hosts. We should do this via Prometheus, which seems to have a way of discovering Google Cloud services.

Create a support cluster

We shouldn't have tools for debugging on the main cluster. Instead we should have a 'support' cluster that contains things like:

Grafana
Prometheus (for NFS instances)
Any health check stuff we have

This could be in a different zone too.

Use external Database for JupyterHubs

Currently the hub uses a local sqlite db.

Instead, use csql proxy and create a postgresql db per hub.

Figure out user backups

We need two kinds of backups:

Disaster recovery
'oh shit I fucked up, can you get this back for me?'

(1) is first priority, (2) is not.

Might use https://github.com/miracle2k/k8s-snapshots

Investigate alternative to nbgitpuller

Decide on LoadBalancing strategy for multiple hubs

It looks like we'll end up with 30-40 hubs at max for this hub, so figuring out a load balancing strategy is important.

Requirements:

Load Balancing users across hubs, not shard. The user homedir is really the only persistent storage we care about, and that is explicitly shared outside. So hubs should be load balanced across sessions.
Sticky sessions - once a particular hub is chosen for a session, it should be used until the user session is over. Hubs themselves are stateful, so we can't load balance at the level of each user request - it has to be at the level of each user session. This means that the authentication needs to be at the level of the proxy, and the proxy needs to be aware of user authentication information to some extent.
If a user closes their laptop and opens it in an hour, their notebook should sortof automagically come back to life in most cases. This is a problem because this means the user's user pod has been culled, but somehow when a new request comes in it gets the hub to trigger spawn and route correctly. This already works for the single hub case, and we should try make it work for the multiple hub case too! This I think is what constrains us most.