snowex-hackweek / jupyterhub Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 60 KB

jupyterhub configuration for snowex hackweek 2021

Home Page: https://snowex.hackweek.io

License: MIT License

HCL 100.00%

actions helm jupyterhub reproducible-science terraform

jupyterhub's People

Contributors

Stargazers

Watchers

jupyterhub's Issues

DOC: Per-user resources configuration

Just want to document where per-user resources are defined, which are a combination of cloud-provider and hub configuration settings:

for CPU and RAM, we have:

cloud provider

jupyterhub/terraform/eks/main.tf

Lines 97 to 99 in c988de7

name = "user-spot"

override_instance_types = ["m5.2xlarge", "m4.2xlarge", "m5a.2xlarge"]

root_volume_type = "gp3"
jupyterhub configuration

jupyterhub/hub/config.yaml

Lines 4 to 9 in c988de7

cpu:

limit: 2

guarantee: 1

memory:

limit: 8G

guarantee: 7G

So, for our initial configuration, the instance type has cVPU=8 and RAM=32GB, and each user is guaranteed 1vCPU and 7GB RAM. So in this case, the RAM request limits the number of users per node (32//7 = 4).

Storage allocations are less obvious because we're using defaults:

The user home directory has a 10GB dedicated disk:
https://zero-to-jupyterhub.readthedocs.io/en/latest/jupyterhub/customizing/user-storage.html?highlight=volume%2010GB#size-of-storage-provisioned
The rest of the file system is "ephemeral" and linked to the K8s node configuration. At the time of setting this up, the default EBS volume attached to each node is 100GB https://github.com/terraform-aws-modules/terraform-aws-eks/blob/9022013844a61193a2f8764311fb679747807f5c/local.tf#L67 , so when you're on the hub you see:

(notebook) jovyan@jupyter-scottyhq:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay         100G  6.5G   94G   7% /
tmpfs            64M     0   64M   0% /dev
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme0n1p1  100G  6.5G   94G   7% /etc/hosts
/dev/nvme3n1    9.8G  8.7G  1.1G  89% /home/jovyan

NOTE: with 4 users on the hub, if everyone is writing a bunch of data to a scratch location outside of /home/jovyan (like) /tmp, that root volume is shared, and also has some space taken up by k8s things. So for the case above we might run into trouble if everyone tries to write 30GB of data to /tmp (30*4=120GB > 94GB).

Summary of Day1 Hiccup (Limits on Number of IP addresses on EKS Cluster)

On day 1 we were caught offguard when only ~40 participants could log onto the hub, and saw these messages in the autoscaler logs.

Long story short: Make sure your VPC network settings offer enough internal IP addresses for your cluster from the start!

Launching a new EC2 instance. Status Reason: Could not launch Spot Instances. InsufficientFreeAddressesInSubnet - There are not enough free addresses in subnet 'subnet-0da0f458c2cb44757' to satisfy the requested number of instances. Launching EC2 instance failed.

Turns out this error was due to 2 things:

every time a user pod starts it requires several internal IP addresses (for a node, for the pod itself, for an EBS volume, and seemingly for other things that i don't fully understand). Our network setting CIDR blocks only allowed up to 256 unique internal IPs.

jupyterhub/terraform/eks/main.tf

Lines 53 to 54 in 3586e3a

public_subnets = ["172.16.1.0/24", "172.16.2.0/24", "172.16.3.0/24"]

private_subnets = ["172.16.4.0/24", "172.16.5.0/24", "172.16.6.0/24"]

And we were forcing everything into a single availability zone instead of spreading everyone across multiple data centers:

jupyterhub/terraform/eks/main.tf

Line 105 in 3586e3a

subnets = [module.vpc.private_subnets[0]]

These settings were taken from examples in the terraform module repository we were using https://github.com/terraform-aws-modules/terraform-aws-eks/search?p=1&q=%2F24&type=code

Over the last couple months we've only had up to 30 simultaneous hub users, which is fine for these settings. But going to 50+ surfaced these issues.

Unfortunately it turned out that it is not simply a matter of changing these network settings and having more IP addresses available. We first tried to change the terraform configuration above to the following matching the pangeo hub configuration with CIDR settings that allow for 8000+ unique IPs per subnet

public_subnets       = ["172.16.0.0/19", "172.16.32.0/19", "172.16.64.0/19"]
private_subnets      = ["172.16.96.0/19", "172.16.128.0/19", "172.16.160.0/19"]

That lead to terraform errors, and possible just isn't doable without destroying the existing VPC and EKS cluster running in that VPC (see https://aws.amazon.com/premiumsupport/knowledge-center/vpc-ip-address-range/). Eventually we ran terraform apply -target=module.eks and deleted the cluster (but fortunately not other things being used by the hackweek such as the EC2 instance with the database, and S3 bucket with tutorial data!).

With the cluster deleted, the helm history was also gone (so the mappings of everyone's EBS home directories) and and jupyterhub configuration we had previously applied, so we had to redeploy jupyterhub and everyone started with new home directories.

Remote access to postgres database

A feature request that would be useful is allowing remote access to the database from a machine outside of the jupyterhub cluster VPC. For example, how can we explore this database from QGIS running on a laptop? There are two requirements for this:

The database must be configured to allow access for given user credentials
The EC2 instance with the database must have certain ports open (can be open to any IP, or restricted to certain ranges of IPs)

jupyterhub/terraform/eks/ec2_postgres.tf

Line 22 in 7adb37f

resource "aws_security_group" "postgres" {

cc @lsetiawan @aaarendt @micahjohnson150

To SPOT or Not?

We've used spot instances for both the core nodegroup (jupyterhub and kube-system pods) and user nodegroup (jupyter-user pods). This works fine for tutorial development and intermittent use. Occasionally a node gets shut down (frequency of days to weeks), but does not work so well during a week long workshop. We've had at least one distuption per day which is noticeable during constant use. Fortunately if the code node goes down (and consequently the hub pod), everyone's sever is merely unavailable for ~ 4 minutes as a new node comes online and the pod restarts. But it would be better to switch to on-demand nodes during a workshop.

Server fails to start if home directory > PVC quota (10GB)

We're using a default capped home directory volume size of 10GB. If a user fills up this space and shuts down their server, it fails to start when they next try to log in, with logs showing:

  Normal   Pulled                  6s                kubelet                  Successfully pulled image "uwhackweek/snowex:latest" in 1.007546596s
  Warning  BackOff                 4s (x5 over 48s)  kubelet                  Back-off restarting failed container
➜  new kubectl logs -n jhub          jupyter-jacktarricone
/srv/conda/envs/notebook/lib/python3.8/subprocess.py:848: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  self.stdout = io.open(c2pread, 'rb', bufsize)
[I 2021-06-03 23:58:18.598 LabApp] JupyterLab extension loaded from /srv/conda/envs/notebook/lib/python3.8/site-packages/jupyterlab
[I 2021-06-03 23:58:18.598 LabApp] JupyterLab application directory is /srv/conda/envs/notebook/share/jupyter/lab
Traceback (most recent call last):
  File "/usr/local/bin/repo2docker-entrypoint", line 97, in <module>
    main()
  File "/usr/local/bin/repo2docker-entrypoint", line 78, in main
    tee(chunk)
  File "/usr/local/bin/repo2docker-entrypoint", line 64, in tee
    f.flush()
OSError: [Errno 28] No space left on device

Docs on storage optimizations are here
https://zero-to-jupyterhub.readthedocs.io/en/latest/jupyterhub/customizing/user-storage.html?highlight=volume%2010GB#customizing-user-storage

Some solutions:

Increase the home directory volume size for all users https://zero-to-jupyterhub.readthedocs.io/en/latest/jupyterhub/customizing/user-storage.html?highlight=volume%2010GB#size-of-storage-provisioned.
manually resize the user PVC in aws console (not sure if k8s will complain about state mismatches for this)
kubectl edit pvc $your_pvc and modify spec.resources.requests.storage
https://stackoverflow.com/questions/40335179/can-a-persistent-volume-be-resized

S3 bucket access outside of the jupyterhub

#8 tried to create an IAM account user we can use for accessing a snowex s3 bucket from anywhere (not just the jupyterhub). It failed with AccessDenied: User: arn:aws:sts::***:assumed-role/github-actions-role/GitHubActions is not authorized to perform: iam:CreateUser on resource: https://github.com/snowex-hackweek/jupyterhub/runs/2807361024?check_suite_focus=true . Should be an easy fix, just need to another policy document with those permissions here https://github.com/snowex-hackweek/jupyterhub/tree/main/terraform/setup/iam

Database configuration and max connections

It seems like we can currently only have a limited number of simultaneous connections (not sure exactly how many or where this configuration lives). cc @micahjohnson150 @jomey @lsetiawan if you want to dig in.

the EC2 config is here https://github.com/snowex-hackweek/jupyterhub/blob/main/terraform/eks/ec2_postgres.tf and the actual database setup is probably documented over in https://snowexsql.readthedocs.io/en/latest/

from snowexsql.db import get_db
db_name = 'snow:[email protected]/snowex'
engine, session = get_db(db_name)
engine.table_names()

OperationalError: (psycopg2.OperationalError) FATAL:  remaining connection slots are reserved for non-replication superuser connections

(Background on this error at: http://sqlalche.me/e/13/e3q8)

Full Traceback

---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _wrap_pool_connect(self, fn, connection)
   2335         try:
-> 2336             return fn()
   2337         except dialect.dbapi.Error as e:

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in connect(self)
    363         if not self._use_threadlocal:
--> 364             return _ConnectionFairy._checkout(self)
    365 

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in _checkout(cls, pool, threadconns, fairy)
    777         if not fairy:
--> 778             fairy = _ConnectionRecord.checkout(pool)
    779 

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in checkout(cls, pool)
    494     def checkout(cls, pool):
--> 495         rec = pool._do_get()
    496         try:

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/impl.py in _do_get(self)
    139                 with util.safe_reraise():
--> 140                     self._dec_overflow()
    141         else:

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback)
     67             if not self.warn_only:
---> 68                 compat.raise_(
     69                     exc_value,

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***)
    181         try:
--> 182             raise exception
    183         finally:

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/impl.py in _do_get(self)
    136             try:
--> 137                 return self._create_connection()
    138             except:

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in _create_connection(self)
    308 
--> 309         return _ConnectionRecord(self)
    310 

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __init__(self, pool, connect)
    439         if connect:
--> 440             self.__connect(first_connect_check=True)
    441         self.finalize_callback = deque()

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check)
    660             with util.safe_reraise():
--> 661                 pool.logger.debug("Error on connect(): %s", e)
    662         else:

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback)
     67             if not self.warn_only:
---> 68                 compat.raise_(
     69                     exc_value,

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***)
    181         try:
--> 182             raise exception
    183         finally:

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check)
    655             self.starttime = time.time()
--> 656             connection = pool._invoke_creator(self)
    657             pool.logger.debug("Created new connection %r", connection)

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/strategies.py in connect(connection_record)
    113                             return connection
--> 114                 return dialect.connect(*cargs, **cparams)
    115 

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/default.py in connect(self, *cargs, **cparams)
    507         # inherits the docstring from interfaces.Dialect.connect
--> 508         return self.dbapi.connect(*cargs, **cparams)
    509 

/srv/conda/envs/notebook/lib/python3.8/site-packages/psycopg2/__init__.py in connect(dsn, connection_factory, cursor_factory, **kwargs)
    121     dsn = _ext.make_dsn(dsn, **kwargs)
--> 122     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
    123     if cursor_factory is not None:

OperationalError: FATAL:  remaining connection slots are reserved for non-replication superuser connections


The above exception was the direct cause of the following exception:

OperationalError                          Traceback (most recent call last)
<ipython-input-2-cf50d0573950> in <module>
      1 # Output the list of tables in the database
----> 2 engine.table_names()

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in table_names(self, schema, connection)
   2314         """
   2315 
-> 2316         with self._optional_conn_ctx_manager(connection) as conn:
   2317             return self.dialect.get_table_names(conn, schema)
   2318 

/srv/conda/envs/notebook/lib/python3.8/contextlib.py in __enter__(self)
    111         del self.args, self.kwds, self.func
    112         try:
--> 113             return next(self.gen)
    114         except StopIteration:
    115             raise RuntimeError("generator didn't yield") from None

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _optional_conn_ctx_manager(self, connection)
   2084     def _optional_conn_ctx_manager(self, connection=None):
   2085         if connection is None:
-> 2086             with self._contextual_connect() as conn:
   2087                 yield conn
   2088         else:

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _contextual_connect(self, close_with_result, **kwargs)
   2300         return self._connection_cls(
   2301             self,
-> 2302             self._wrap_pool_connect(self.pool.connect, None),
   2303             close_with_result=close_with_result,
   2304             **kwargs

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _wrap_pool_connect(self, fn, connection)
   2337         except dialect.dbapi.Error as e:
   2338             if connection is None:
-> 2339                 Connection._handle_dbapi_exception_noconnection(
   2340                     e, dialect, self
   2341                 )

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception_noconnection(cls, e, dialect, engine)
   1581             util.raise_(newraise, with_traceback=exc_info[2], from_=e)
   1582         elif should_wrap:
-> 1583             util.raise_(
   1584                 sqlalchemy_exception, with_traceback=exc_info[2], from_=e
   1585             )

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***)
    180 
    181         try:
--> 182             raise exception
    183         finally:
    184             # credit to

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _wrap_pool_connect(self, fn, connection)
   2334         dialect = self.dialect
   2335         try:
-> 2336             return fn()
   2337         except dialect.dbapi.Error as e:
   2338             if connection is None:

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in connect(self)
    362         """
    363         if not self._use_threadlocal:
--> 364             return _ConnectionFairy._checkout(self)
    365 
    366         try:

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in _checkout(cls, pool, threadconns, fairy)
    776     def _checkout(cls, pool, threadconns=None, fairy=None):
    777         if not fairy:
--> 778             fairy = _ConnectionRecord.checkout(pool)
    779 
    780             fairy._pool = pool

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in checkout(cls, pool)
    493     @classmethod
    494     def checkout(cls, pool):
--> 495         rec = pool._do_get()
    496         try:
    497             dbapi_connection = rec.get_connection()

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/impl.py in _do_get(self)
    138             except:
    139                 with util.safe_reraise():
--> 140                     self._dec_overflow()
    141         else:
    142             return self._do_get()

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback)
     66             self._exc_info = None  # remove potential circular references
     67             if not self.warn_only:
---> 68                 compat.raise_(
     69                     exc_value,
     70                     with_traceback=exc_tb,

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***)
    180 
    181         try:
--> 182             raise exception
    183         finally:
    184             # credit to

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/impl.py in _do_get(self)
    135         if self._inc_overflow():
    136             try:
--> 137                 return self._create_connection()
    138             except:
    139                 with util.safe_reraise():

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in _create_connection(self)
    307         """Called by subclasses to create a new ConnectionRecord."""
    308 
--> 309         return _ConnectionRecord(self)
    310 
    311     def _invalidate(self, connection, exception=None, _checkin=True):

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __init__(self, pool, connect)
    438         self.__pool = pool
    439         if connect:
--> 440             self.__connect(first_connect_check=True)
    441         self.finalize_callback = deque()
    442 

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check)
    659         except Exception as e:
    660             with util.safe_reraise():
--> 661                 pool.logger.debug("Error on connect(): %s", e)
    662         else:
    663             if first_connect_check:

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback)
     66             self._exc_info = None  # remove potential circular references
     67             if not self.warn_only:
---> 68                 compat.raise_(
     69                     exc_value,
     70                     with_traceback=exc_tb,

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***)
    180 
    181         try:
--> 182             raise exception
    183         finally:
    184             # credit to

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check)
    654         try:
    655             self.starttime = time.time()
--> 656             connection = pool._invoke_creator(self)
    657             pool.logger.debug("Created new connection %r", connection)
    658             self.connection = connection

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/strategies.py in connect(connection_record)
    112                         if connection is not None:
    113                             return connection
--> 114                 return dialect.connect(*cargs, **cparams)
    115 
    116             creator = pop_kwarg("creator", connect)

/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/default.py in connect(self, *cargs, **cparams)
    506     def connect(self, *cargs, **cparams):
    507         # inherits the docstring from interfaces.Dialect.connect
--> 508         return self.dbapi.connect(*cargs, **cparams)
    509 
    510     def create_connect_args(self, url):

/srv/conda/envs/notebook/lib/python3.8/site-packages/psycopg2/__init__.py in connect(dsn, connection_factory, cursor_factory, **kwargs)
    120 
    121     dsn = _ext.make_dsn(dsn, **kwargs)
--> 122     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
    123     if cursor_factory is not None:
    124         conn.cursor_factory = cursor_factory

OperationalError: (psycopg2.OperationalError) FATAL:  remaining connection slots are reserved for non-replication superuser connections

(Background on this error at: http://sqlalche.me/e/13/e3q8)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

	name = "user-spot"
	override_instance_types = ["m5.2xlarge", "m4.2xlarge", "m5a.2xlarge"]
	root_volume_type = "gp3"

	public_subnets = ["172.16.1.0/24", "172.16.2.0/24", "172.16.3.0/24"]
	private_subnets = ["172.16.4.0/24", "172.16.5.0/24", "172.16.6.0/24"]

	cpu:
	limit: 2
	guarantee: 1
	memory:
	limit: 8G
	guarantee: 7G