snowex-hackweek / jupyterhub Goto Github PK
View Code? Open in Web Editor NEWjupyterhub configuration for snowex hackweek 2021
Home Page: https://snowex.hackweek.io
License: MIT License
jupyterhub configuration for snowex hackweek 2021
Home Page: https://snowex.hackweek.io
License: MIT License
Just want to document where per-user resources are defined, which are a combination of cloud-provider and hub configuration settings:
for CPU and RAM, we have:
cloud provider
jupyterhub/terraform/eks/main.tf
Lines 97 to 99 in c988de7
jupyterhub configuration
Lines 4 to 9 in c988de7
So, for our initial configuration, the instance type has cVPU=8 and RAM=32GB, and each user is guaranteed 1vCPU and 7GB RAM. So in this case, the RAM request limits the number of users per node (32//7 = 4).
Storage allocations are less obvious because we're using defaults:
The user home directory has a 10GB dedicated disk:
https://zero-to-jupyterhub.readthedocs.io/en/latest/jupyterhub/customizing/user-storage.html?highlight=volume%2010GB#size-of-storage-provisioned
The rest of the file system is "ephemeral" and linked to the K8s node configuration. At the time of setting this up, the default EBS volume attached to each node is 100GB https://github.com/terraform-aws-modules/terraform-aws-eks/blob/9022013844a61193a2f8764311fb679747807f5c/local.tf#L67 , so when you're on the hub you see:
(notebook) jovyan@jupyter-scottyhq:~$ df -h
Filesystem Size Used Avail Use% Mounted on
overlay 100G 6.5G 94G 7% /
tmpfs 64M 0 64M 0% /dev
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/nvme0n1p1 100G 6.5G 94G 7% /etc/hosts
/dev/nvme3n1 9.8G 8.7G 1.1G 89% /home/jovyan
NOTE: with 4 users on the hub, if everyone is writing a bunch of data to a scratch location outside of /home/jovyan (like) /tmp, that root volume is shared, and also has some space taken up by k8s things. So for the case above we might run into trouble if everyone tries to write 30GB of data to /tmp (30*4=120GB > 94GB).
On day 1 we were caught offguard when only ~40 participants could log onto the hub, and saw these messages in the autoscaler logs.
Long story short: Make sure your VPC network settings offer enough internal IP addresses for your cluster from the start!
Launching a new EC2 instance. Status Reason: Could not launch Spot Instances. InsufficientFreeAddressesInSubnet - There are not enough free addresses in subnet 'subnet-0da0f458c2cb44757' to satisfy the requested number of instances. Launching EC2 instance failed.
Turns out this error was due to 2 things:
jupyterhub/terraform/eks/main.tf
Lines 53 to 54 in 3586e3a
And we were forcing everything into a single availability zone instead of spreading everyone across multiple data centers:
jupyterhub/terraform/eks/main.tf
Line 105 in 3586e3a
These settings were taken from examples in the terraform module repository we were using https://github.com/terraform-aws-modules/terraform-aws-eks/search?p=1&q=%2F24&type=code
Over the last couple months we've only had up to 30 simultaneous hub users, which is fine for these settings. But going to 50+ surfaced these issues.
Unfortunately it turned out that it is not simply a matter of changing these network settings and having more IP addresses available. We first tried to change the terraform configuration above to the following matching the pangeo hub configuration with CIDR settings that allow for 8000+ unique IPs per subnet
public_subnets = ["172.16.0.0/19", "172.16.32.0/19", "172.16.64.0/19"]
private_subnets = ["172.16.96.0/19", "172.16.128.0/19", "172.16.160.0/19"]
That lead to terraform errors, and possible just isn't doable without destroying the existing VPC and EKS cluster running in that VPC (see https://aws.amazon.com/premiumsupport/knowledge-center/vpc-ip-address-range/). Eventually we ran terraform apply -target=module.eks
and deleted the cluster (but fortunately not other things being used by the hackweek such as the EC2 instance with the database, and S3 bucket with tutorial data!).
With the cluster deleted, the helm history was also gone (so the mappings of everyone's EBS home directories) and and jupyterhub configuration we had previously applied, so we had to redeploy jupyterhub and everyone started with new home directories.
A feature request that would be useful is allowing remote access to the database from a machine outside of the jupyterhub cluster VPC. For example, how can we explore this database from QGIS running on a laptop? There are two requirements for this:
The database must be configured to allow access for given user credentials
The EC2 instance with the database must have certain ports open (can be open to any IP, or restricted to certain ranges of IPs)
jupyterhub/terraform/eks/ec2_postgres.tf
Line 22 in 7adb37f
We've used spot instances for both the core nodegroup (jupyterhub and kube-system pods) and user nodegroup (jupyter-user pods). This works fine for tutorial development and intermittent use. Occasionally a node gets shut down (frequency of days to weeks), but does not work so well during a week long workshop. We've had at least one distuption per day which is noticeable during constant use. Fortunately if the code node goes down (and consequently the hub pod), everyone's sever is merely unavailable for ~ 4 minutes as a new node comes online and the pod restarts. But it would be better to switch to on-demand nodes during a workshop.
We're using a default capped home directory volume size of 10GB. If a user fills up this space and shuts down their server, it fails to start when they next try to log in, with logs showing:
Normal Pulled 6s kubelet Successfully pulled image "uwhackweek/snowex:latest" in 1.007546596s
Warning BackOff 4s (x5 over 48s) kubelet Back-off restarting failed container
โ new kubectl logs -n jhub jupyter-jacktarricone
/srv/conda/envs/notebook/lib/python3.8/subprocess.py:848: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
self.stdout = io.open(c2pread, 'rb', bufsize)
[I 2021-06-03 23:58:18.598 LabApp] JupyterLab extension loaded from /srv/conda/envs/notebook/lib/python3.8/site-packages/jupyterlab
[I 2021-06-03 23:58:18.598 LabApp] JupyterLab application directory is /srv/conda/envs/notebook/share/jupyter/lab
Traceback (most recent call last):
File "/usr/local/bin/repo2docker-entrypoint", line 97, in <module>
main()
File "/usr/local/bin/repo2docker-entrypoint", line 78, in main
tee(chunk)
File "/usr/local/bin/repo2docker-entrypoint", line 64, in tee
f.flush()
OSError: [Errno 28] No space left on device
Docs on storage optimizations are here
https://zero-to-jupyterhub.readthedocs.io/en/latest/jupyterhub/customizing/user-storage.html?highlight=volume%2010GB#customizing-user-storage
Some solutions:
Increase the home directory volume size for all users https://zero-to-jupyterhub.readthedocs.io/en/latest/jupyterhub/customizing/user-storage.html?highlight=volume%2010GB#size-of-storage-provisioned.
manually resize the user PVC in aws console (not sure if k8s will complain about state mismatches for this)
kubectl edit pvc $your_pvc
and modify spec.resources.requests.storage
https://stackoverflow.com/questions/40335179/can-a-persistent-volume-be-resized
#8 tried to create an IAM account user we can use for accessing a snowex s3 bucket from anywhere (not just the jupyterhub). It failed with AccessDenied: User: arn:aws:sts::***:assumed-role/github-actions-role/GitHubActions is not authorized to perform: iam:CreateUser on resource:
https://github.com/snowex-hackweek/jupyterhub/runs/2807361024?check_suite_focus=true . Should be an easy fix, just need to another policy document with those permissions here https://github.com/snowex-hackweek/jupyterhub/tree/main/terraform/setup/iam
It seems like we can currently only have a limited number of simultaneous connections (not sure exactly how many or where this configuration lives). cc @micahjohnson150 @jomey @lsetiawan if you want to dig in.
the EC2 config is here https://github.com/snowex-hackweek/jupyterhub/blob/main/terraform/eks/ec2_postgres.tf and the actual database setup is probably documented over in https://snowexsql.readthedocs.io/en/latest/
from snowexsql.db import get_db
db_name = 'snow:[email protected]/snowex'
engine, session = get_db(db_name)
engine.table_names()
OperationalError: (psycopg2.OperationalError) FATAL: remaining connection slots are reserved for non-replication superuser connections
(Background on this error at: http://sqlalche.me/e/13/e3q8)
---------------------------------------------------------------------------
OperationalError Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _wrap_pool_connect(self, fn, connection)
2335 try:
-> 2336 return fn()
2337 except dialect.dbapi.Error as e:
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in connect(self)
363 if not self._use_threadlocal:
--> 364 return _ConnectionFairy._checkout(self)
365
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in _checkout(cls, pool, threadconns, fairy)
777 if not fairy:
--> 778 fairy = _ConnectionRecord.checkout(pool)
779
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in checkout(cls, pool)
494 def checkout(cls, pool):
--> 495 rec = pool._do_get()
496 try:
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/impl.py in _do_get(self)
139 with util.safe_reraise():
--> 140 self._dec_overflow()
141 else:
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback)
67 if not self.warn_only:
---> 68 compat.raise_(
69 exc_value,
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***)
181 try:
--> 182 raise exception
183 finally:
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/impl.py in _do_get(self)
136 try:
--> 137 return self._create_connection()
138 except:
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in _create_connection(self)
308
--> 309 return _ConnectionRecord(self)
310
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __init__(self, pool, connect)
439 if connect:
--> 440 self.__connect(first_connect_check=True)
441 self.finalize_callback = deque()
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check)
660 with util.safe_reraise():
--> 661 pool.logger.debug("Error on connect(): %s", e)
662 else:
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback)
67 if not self.warn_only:
---> 68 compat.raise_(
69 exc_value,
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***)
181 try:
--> 182 raise exception
183 finally:
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check)
655 self.starttime = time.time()
--> 656 connection = pool._invoke_creator(self)
657 pool.logger.debug("Created new connection %r", connection)
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/strategies.py in connect(connection_record)
113 return connection
--> 114 return dialect.connect(*cargs, **cparams)
115
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/default.py in connect(self, *cargs, **cparams)
507 # inherits the docstring from interfaces.Dialect.connect
--> 508 return self.dbapi.connect(*cargs, **cparams)
509
/srv/conda/envs/notebook/lib/python3.8/site-packages/psycopg2/__init__.py in connect(dsn, connection_factory, cursor_factory, **kwargs)
121 dsn = _ext.make_dsn(dsn, **kwargs)
--> 122 conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
123 if cursor_factory is not None:
OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser connections
The above exception was the direct cause of the following exception:
OperationalError Traceback (most recent call last)
<ipython-input-2-cf50d0573950> in <module>
1 # Output the list of tables in the database
----> 2 engine.table_names()
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in table_names(self, schema, connection)
2314 """
2315
-> 2316 with self._optional_conn_ctx_manager(connection) as conn:
2317 return self.dialect.get_table_names(conn, schema)
2318
/srv/conda/envs/notebook/lib/python3.8/contextlib.py in __enter__(self)
111 del self.args, self.kwds, self.func
112 try:
--> 113 return next(self.gen)
114 except StopIteration:
115 raise RuntimeError("generator didn't yield") from None
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _optional_conn_ctx_manager(self, connection)
2084 def _optional_conn_ctx_manager(self, connection=None):
2085 if connection is None:
-> 2086 with self._contextual_connect() as conn:
2087 yield conn
2088 else:
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _contextual_connect(self, close_with_result, **kwargs)
2300 return self._connection_cls(
2301 self,
-> 2302 self._wrap_pool_connect(self.pool.connect, None),
2303 close_with_result=close_with_result,
2304 **kwargs
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _wrap_pool_connect(self, fn, connection)
2337 except dialect.dbapi.Error as e:
2338 if connection is None:
-> 2339 Connection._handle_dbapi_exception_noconnection(
2340 e, dialect, self
2341 )
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception_noconnection(cls, e, dialect, engine)
1581 util.raise_(newraise, with_traceback=exc_info[2], from_=e)
1582 elif should_wrap:
-> 1583 util.raise_(
1584 sqlalchemy_exception, with_traceback=exc_info[2], from_=e
1585 )
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***)
180
181 try:
--> 182 raise exception
183 finally:
184 # credit to
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/base.py in _wrap_pool_connect(self, fn, connection)
2334 dialect = self.dialect
2335 try:
-> 2336 return fn()
2337 except dialect.dbapi.Error as e:
2338 if connection is None:
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in connect(self)
362 """
363 if not self._use_threadlocal:
--> 364 return _ConnectionFairy._checkout(self)
365
366 try:
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in _checkout(cls, pool, threadconns, fairy)
776 def _checkout(cls, pool, threadconns=None, fairy=None):
777 if not fairy:
--> 778 fairy = _ConnectionRecord.checkout(pool)
779
780 fairy._pool = pool
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in checkout(cls, pool)
493 @classmethod
494 def checkout(cls, pool):
--> 495 rec = pool._do_get()
496 try:
497 dbapi_connection = rec.get_connection()
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/impl.py in _do_get(self)
138 except:
139 with util.safe_reraise():
--> 140 self._dec_overflow()
141 else:
142 return self._do_get()
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback)
66 self._exc_info = None # remove potential circular references
67 if not self.warn_only:
---> 68 compat.raise_(
69 exc_value,
70 with_traceback=exc_tb,
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***)
180
181 try:
--> 182 raise exception
183 finally:
184 # credit to
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/impl.py in _do_get(self)
135 if self._inc_overflow():
136 try:
--> 137 return self._create_connection()
138 except:
139 with util.safe_reraise():
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in _create_connection(self)
307 """Called by subclasses to create a new ConnectionRecord."""
308
--> 309 return _ConnectionRecord(self)
310
311 def _invalidate(self, connection, exception=None, _checkin=True):
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __init__(self, pool, connect)
438 self.__pool = pool
439 if connect:
--> 440 self.__connect(first_connect_check=True)
441 self.finalize_callback = deque()
442
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check)
659 except Exception as e:
660 with util.safe_reraise():
--> 661 pool.logger.debug("Error on connect(): %s", e)
662 else:
663 if first_connect_check:
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py in __exit__(self, type_, value, traceback)
66 self._exc_info = None # remove potential circular references
67 if not self.warn_only:
---> 68 compat.raise_(
69 exc_value,
70 with_traceback=exc_tb,
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***)
180
181 try:
--> 182 raise exception
183 finally:
184 # credit to
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check)
654 try:
655 self.starttime = time.time()
--> 656 connection = pool._invoke_creator(self)
657 pool.logger.debug("Created new connection %r", connection)
658 self.connection = connection
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/strategies.py in connect(connection_record)
112 if connection is not None:
113 return connection
--> 114 return dialect.connect(*cargs, **cparams)
115
116 creator = pop_kwarg("creator", connect)
/srv/conda/envs/notebook/lib/python3.8/site-packages/sqlalchemy/engine/default.py in connect(self, *cargs, **cparams)
506 def connect(self, *cargs, **cparams):
507 # inherits the docstring from interfaces.Dialect.connect
--> 508 return self.dbapi.connect(*cargs, **cparams)
509
510 def create_connect_args(self, url):
/srv/conda/envs/notebook/lib/python3.8/site-packages/psycopg2/__init__.py in connect(dsn, connection_factory, cursor_factory, **kwargs)
120
121 dsn = _ext.make_dsn(dsn, **kwargs)
--> 122 conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
123 if cursor_factory is not None:
124 conn.cursor_factory = cursor_factory
OperationalError: (psycopg2.OperationalError) FATAL: remaining connection slots are reserved for non-replication superuser connections
(Background on this error at: http://sqlalche.me/e/13/e3q8)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.