Giter VIP home page Giter VIP logo

cocalc-kubernetes's Introduction

This is not currently being updated.


cocalc-kubernetes

This was a free open source AGPL licensed slightly modified version of cocalc-docker, but for running CoCalc on a Kubernetes cluster. There is one pod for the server and one pod for each project.

STATUS

  • This is not currently being updated.

  • We sell a much better scalable Kubernetes based version of CoCalc, but it's not open source. It contains some proprietary extensions we developed and battle tested for https://cocalc.com. It's based on flexible HELM charts (a way to configure services in a kubernetes cluster) and it's currently working with GKE (Google Compute Engine's managed kubernetes cluster) and bare-metal Kubernetes clusters. It shouldn't be too hard to make it work on other Kubernetes clusters. The pricing is higher than cocalc-docker, depending on the installation and is proportional to the number of users. Contact us at [email protected] for more information.

Installation

See server/README.md to get going!

LICENSE AND SUPPORT

  • Much of this code is licensed under the AGPL. If you would instead like a business-friendly MIT license, please contact [email protected], and we will sell you a 1-year license for $1499. This also includes some support, though with no guarantees (that costs more).
  • Join the CoCalc Docker mailing list for news, updates and more.
  • Use the CoCalc mailing list for general community support.

SECURITY STATUS

  • If you setup everything as explained in server/README.md, including appropriate network restrictions on the project pods, then there are no known security vulnerabilities. In particular, this is much safer to run than cocalc-docker, if you are going to expose this to untrusted (or uncareful) users.

Discussion

cocalc-kubernetes is an open version of CoCalc that can be run on an existing generic Kubernetes cluster.

It is as similar as possible to cocalc-docker, with the following changes:

  1. It runs in a Kubernetes cluster rather than a plain Docker install, and
  2. Projects run as separate pods on the cluster rather than all in the same Docker container.

The benefits of this architecture include:

  • Projects can have individual resource limitations imposed by Kubernetes.
  • The security issues with cocalc-docker (involving projects connecting to other project services on localhost) can be addressed by blocking outgoing network connections from projects.
  • Projects run across a cluster, so the number of projects that one can run at once is a function of the number (and size) of nodes in the cluster, rather than cocalc-docker host.
  • This can be done with just a slight modification and addition to the entirely open source (AGPL) codebase that cocalc-docker uses.

The drawbacks of this architecture over a more complicated architecture like the closed-source KuCalc (what https://cocalc.com uses) include:

  • It is unclear to what extent it can handle a large number of simultaneous users, since the entire server component (the database, hub, NFS file server, etc.) are all served from a single pod.
  • Project storage is just a single NFS server, so disk iops may be lower for client projects, which may or may not be an issue depending on the network and use of projects.
  • Filesystem level snapshots and other backups have to be handled outside CoCalc. This is the responsibility of the admin and is not part of cocalc-kubernetes at present. However, the TimeTravel functionality, which records every version of a file or notebook while you work on it, and lets you browse all past versions, does fully work in cocalc-kubernetes.
  • The project image is much more minimal than the one provided by https://cocalc.com -- it has to be small enough to reasonably run in a normal way without pull taking too long. The image in KuCalc is hundreds of gigabytes but mounts quickly using some sophisticated tricks.

cocalc-kubernetes's People

Contributors

haraldschilly avatar slel avatar th1j5 avatar williamstein avatar yuanzhaoyz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cocalc-kubernetes's Issues

Cocalc project restarts. How to investigate?

We are experimenting with Cocalc (a slightly slimmed image with fewer kernels and with increased memory defaults) for remote teaching/pair programming. (It works pretty well!) I am currently noticing three different types of crashes and would like to get a hint as to how to find out why the crash occurred/how I can fix it/see the logs.

  1. Python kernel crashes. Seems to occur when I allocate too much memory in a numpy array for example. The relevant cell gets a red tag with the kernel killed message. All understandable, I can live with that. (Although I wouldn’t mind seeing this somewhere in some project admin/server admin logs.)

  2. Project Pod sometimes gets killed. All I see is a Killed event in kubectl get events. Doesn’t happen super often, so it is not too bad, but I’d still like to get an idea why.

  3. Project restarts without notice. Sometimes this happens every 10 minutes while people are working on a project, so it doesn’t seem to be some idle timeout. (I figured it’s not the worst thing that can happen for teaching, as it clears all hidden variables and gives the student a clean state. ;) ) This is the nastiest problem as the reason is very unclear to me and I wouldn’t know where to look (and which limit to increase).

Any hints?

creating projects doesn't work on Google Kubernetes Engine (GKE)

I setup everything, and also got an external IP address as follows:


Depending on your Kubernetes cluster, you might be able to do kubectl edit service cocalc-kubernetes-server and change the type to LoadBalancer, in order to expose the service on its own external IP address. For example, if you do this with Google Kubernetes Engine, then you might see:

wstein@cloudshell:~/cocalc-kubernetes/server (sage-math-inc)$ kubectl get services
NAME                           TYPE           CLUSTER-IP   EXTERNAL-IP     PORT(S)                                                                   AGE
cocalc-kubernetes-server       LoadBalancer   10.4.2.15    34.xxx.xxx.xxx   80:30767/TCP,443:30019/TCP,2049:32468/TCP,20048:30682/TCP,111:31494/TCP   10m

However, logging into the cocalc-kubernetes-server and checking /var/log/compute.err shows:

 "(0.112360954285 seconds): Error from server (Forbidden): pods "project-2c92ccd4-05d9-45bc-90af-35e49baab538" is forbidden: User "system:serviceaccount:default:cocalc-kubernetes-server" cannot get resource "pods" in API group ""[34m~@', exit_code=0"

This means that using kubectl from within cocalc-kubernetes-server isn't working. Kubernetes is denying the ability to create pods. Since I made a service account to do that, I'm not sure what the problem is.

NFS is not mounted on the OpenShift(OKD)

I was following the readme to deploy the cocalc on the openshift:
1)

[admin@master0 server]$ kubectl apply -f deployment.yaml
[admin@master0 server]$ kubectl apply -f nfs-service.yaml
[admin@master0 server]$ oc get svc 
NAME                           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                     AGE
cocalc-kubernetes-server       ClusterIP   172.30.125.22   <none>        80/TCP,443/TCP,2049/TCP,20048/TCP,111/TCP   1d
cocalc-kubernetes-server-nfs   ClusterIP   172.30.168.48   <none>        2049/TCP,20048/TCP,111/TCP                  1d

  1. the server pod starting w/o problem
[admin@master0 server]$ oc get pods 
NAME                                        READY     STATUS    RESTARTS   AGE
cocalc-kubernetes-server-696876cc5d-z5p4d   1/1       Running   0          23h

  1. I can login to it, but the projects are not starting
  2. the NFS service is not starting
[admin@master0 server]$ kubectl exec -it cocalc-kubernetes-server-696876cc5d-z5p4d bash 
groups: cannot find name for group ID 1000170000
root@cocalc-kubernetes-server-696876cc5d-z5p4d:/# df
Filesystem              1K-blocks     Used Available Use% Mounted on
overlay                  86064068 19663584  66400484  23% /
tmpfs                       65536        0     65536   0% /dev
tmpfs                     4082216        0   4082216   0% /sys/fs/cgroup
/dev/mapper/centos-root  86064068 19663584  66400484  23% /projects
shm                         65536        8     65528   1% /dev/shm
tmpfs                     4082216       16   4082200   1% /run/secrets/kubernetes.io/serviceaccount
root@cocalc-kubernetes-server-696876cc5d-z5p4d:/# 

Should be that nfs needs to be started first?

unable to start cluster

Hi WIlliam,
unfortunately I am not able to run the cluster, getting an error:

[root@master0 server]# kubectl port-forward  service/cocalc-kubernetes-server 4043:443
error: timed out waiting for the condition

Any hints to fix this?

btw I have imported the deployment.yml successfully:

kubectl get deployments
NAME                       DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
cocalc-kubernetes-server   1         0         0            0           19m
kubectl describe service
Name:              cocalc-kubernetes-server
Namespace:         cocalc
Labels:            run=cocalc-kubernetes-server
Annotations:       <none>
Selector:          run=cocalc-kubernetes-server
Type:              ClusterIP
IP:                172.30.179.0
Port:              port-1  80/TCP
TargetPort:        80/TCP
Endpoints:         <none>
Port:              port-2  443/TCP
TargetPort:        443/TCP
Endpoints:         <none>
Port:              port-3  2049/TCP
TargetPort:        2049/TCP
Endpoints:         <none>
Port:              port-4  20048/TCP
TargetPort:        20048/TCP
Endpoints:         <none>
Port:              port-5  111/TCP
TargetPort:        111/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>


Name:              cocalc-kubernetes-server-nfs
Namespace:         cocalc
Labels:            run=cocalc-kubernetes-server-nfs
Annotations:       kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"run":"cocalc-kubernetes-server-nfs"},"name":"cocalc-kubernetes-server-nfs",...
Selector:          run=cocalc-kubernetes-server
Type:              ClusterIP
IP:                172.30.251.11
Port:              nfs  2049/TCP
TargetPort:        2049/TCP
Endpoints:         <none>
Port:              mountd  20048/TCP
TargetPort:        20048/TCP
Endpoints:         <none>
Port:              rpcbind  111/TCP
TargetPort:        111/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>

Project gets created but never started

I'm running a Rancher cluster (v2.4.5) and instructions from here in order to deploy cocalc-kubernetes on its latest image versions (default sagemathinc images). I can deploy it successfully, create my account and login but, when trying to create a new project, screen keeps loading indefinitely.

By opening a terminal on the project container, I can see that some files are created on the user home dir:

~$ ls -lisha
total 44K
62196781 4.0K drwx------ 5 user user 4.0K Jul 29 18:14 .
 6431322 4.0K drwxr-xr-x 1 root root 4.0K Jun 22 03:43 ..
62196784    0 lrwxrwxrwx 1 user user   18 Jul 29 18:04 .bash_profile -> /home/user/.bashrc
62196782 4.0K -rw-r--r-- 1 user user 2.3K Jul 29 18:04 .bashrc
62196795 4.0K -rw-r--r-- 1 user user   47 Jul 29 18:04 .gitconfig
62196792 4.0K -rw-r--r-- 1 user user  918 Jul 29 18:04 .gitexcludes
62196788 8.0K -rw-r--r-- 1 user user 8.0K Jul 29 18:04 .jupyter-blobs-v0.db
62196790 4.0K drwxr-xr-x 2 user user 4.0K Jul 29 18:04 .sage
62196269 4.0K drwxr-xr-x 3 user user 4.0K Jul 29 18:14 .smc
62196786 4.0K drwxr-xr-x 2 user user 4.0K Jul 29 18:14 .ssh
62196804 4.0K -rw------- 1 user user 1.6K Jul 29 18:09 .viminfo

But nothing else happens and the screen keeps loading, apparently looking for the list of files.

I found on the init.sh file, entrypoint for the container image, this:

if [[ -s "$HOME/project_init.sh" ]]; then
  bash "$HOME/project_init.sh" < /dev/null > /dev/stdout 2> /dev/stderr &
  disown
fi

but this project_init.sh script does not exist and I can't find it anywhere. Can this be related to this script?

Edit 1

Logs for the project pod:

2020-07-29T18:25:10,146988094+00:00
2020-07-29T18:25:10,154244801+00:00 Configured whitelisted environment
ANACONDA3=/ext/anaconda3
ANACONDA5=/ext/anaconda5
COCALC_HTTP_PORT=6001
COCALC_JUPYTER_LAB_PORT=6002
COCALC_LOCAL_HUB_PORT=6000
COCALC_PROJECT_ID=eb4639ac-76fb-448a-9435-c1b82a906772
COCALC_SSH_PORT=2222
COCALC_USERNAME=user
DISPLAY=:0
EXT=/ext
HOME=/home/user
HOSTNAME=project-eb4639ac-76fb-448a-9435-c1b82a906772
ISOCHRONES=/ext/data/isochrones
JULIA_DEPOT_PATH=/home/user/.julia:/ext/julia/depot/
JUPYTER_PATH=/ext/jupyter
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_ALL=C.UTF-8
MKL_THREADING_LAYER=GNU
NLTK_DATA=/ext/data/nltk_data
NODE_ENV=production
NODE_PATH=/cocalc/src:/cocalc/src/node_modules/smc-util:/cocalc/src/node_modules:/cocalc/src/smc-project/node_modules:/cocalc/src/smc-project/
PATH=/cocalc/bin:/cocalc/src/smc-project/bin:/home/user/bin:/home/user/.local/bin:/ext/bin:/usr/lib/xpra:/opt/ghc/bin:/usr/local/bin:/usr/bin:/bin:/ext/data/homer/bin:/ext/data/weblogo:/ext/intellij/idea/bin:/ext/pycharm/pycharm/bin
PWD=/home/user
QT_QPA_PLATFORM=xcb
SCREENDIR=/tmp/screen
SHLVL=1
SMC=/home/user/.smc
SMC_ROOT=/cocalc/src
TERM=xterm-256color
USER=user
XDG_RUNTIME_DIR=/tmp/xdg-runtime-user
_=/usr/bin/env
_JAVA_OPTIONS=-Djava.io.tmpdir=/home/user/tmp -Xms64m
2020-07-29T18:25:10,176966296+00:00
rm -rf $HOME/.ssh/authorized_keys
mkdir -p $HOME/.ssh
cat <<EOF > $HOME/.ssh/authorized_keys
#                     *** THIS FILE IS AUTOGENERATED ***
# Do not modify this authorized_keys file. Adding your own keys here does not work!
# Instead, add your public keys via the user interface in your project or account settings.
EOF
cat /secrets/gateway-public/id_ed25519.pub >> $HOME/.ssh/authorized_keys
cat: /secrets/gateway-public/id_ed25519.pub: No such file or directory
2020-07-29T18:25:10.787Z - debug: jupyter BlobStore: constructor
2020-07-29T18:25:10.810Z - debug: jupyter BlobStore: /home/user/.jupyter-blobs-v0.db opened fine
2020-07-29T18:25:11.149Z - debug: NOT running in kucalc
2020-07-29T18:25:11.150Z - debug: set_extra_env: nothing provided
2020-07-29T18:25:11.150Z - debug: initializing INFO
2020-07-29T18:25:11.152Z - debug: execute_code: "git config --global --get core.excludesfile"
2020-07-29T18:25:11.153Z - debug: starting raw server...
2020-07-29T18:25:11.155Z - info: starting express http server...
2020-07-29T18:25:11.175Z - debug: Spawning the command git with given args config,--global,--get,core.excludesfile and timeout of 10s...
2020-07-29T18:25:11.181Z - debug: Listen for stdout, stderr and exit events.
2020-07-29T18:25:11.183Z - debug: finished exec of git (took 0.032000064849853516s)
2020-07-29T18:25:11.184Z - debug: stdout='/home/user/.gitexcludes
', stderr='', exit_code=0
2020-07-29T18:25:11.258Z - debug: Wrote 'info.json'
2020-07-29T18:25:11.260Z - info: raw server: port=6001, host='10.42.87.87', base='/eb4639ac-76fb-448a-9435-c1b82a906772/raw/'
2020-07-29T18:25:11.305Z - debug: primus listening on /eb4639ac-76fb-448a-9435-c1b82a906772/raw/.smc/ws
2020-07-29T18:25:11.311Z - debug: primus waiting for clients to request primus.js (length=119143)...
2020-07-29T18:25:11.311Z - debug: upload_endpoint conf
2020-07-29T18:25:11.424Z - debug: initializing secret token...
2020-07-29T18:25:11.425Z - debug: initializing secret token '/home/user/.smc/secret_token'
2020-07-29T18:25:11.426Z - debug: create '/home/user/.smc/secret_token'
2020-07-29T18:25:11.534Z - debug: start API server...
2020-07-29T18:25:12.152Z - debug: starting tcp server...
2020-07-29T18:25:12.152Z - info: starting tcp server: project <--> hub...
2020-07-29T18:25:12.153Z - info: tcp_server listening 0.0.0.0:6000
2020-07-29T18:25:12.280Z - debug: Successfully started servers.

CUDA support

Hi! I'm just wondering if there's any image of CoCalc with both CUDA and Kubernetes Cluster support. I'm aware of cocalc-kubernetes and CuCalc projects, but unfortunately I couldn't find an image that attends to both. Is that a thing?

apiVersion depricated

Is it possible that
apiVersion: extensions/v1beta1
is deprecated in favor of
apiVersion: apps/v1?
If that's the case, this should be updated in deployment.yaml
Sources:
other github issue
and blogpost.
Warning: I've no Kubernetes knowledge whatshowever, I was just curious if this is going to install (and the deprecated option didn't work).
Using some kubernetes called: https://k3s.io/

pip3 not found

Docker build failed for cocalc-kubernetes server image after upgrading cocalc code base to sagemathinc/cocalc@a0717be

How to reproduce:

git clone https://github.com/sagemathinc/cocalc.git
cd cocalc/cocalc-kubernetes
docker build --no-cache -t sagemathinc/cocalc-kubernetes-server:1.0.0 ./server/image/

implement better service account

Right now we suggest this in the README.md:

kubectl create rolebinding cocalc-kubernetes-server-binding --clusterrole=admin --serviceaccount=default:cocalc-kubernetes-server

However:

  • The account doesn't need admin for the entire namespace, but we should do something more precise.
  • In some cases (e.g., Docker for Windows + Kubernetes) --role=admin instead of --clusterrole=admin will work, and in others (e.g., GKE) it won't work at all and Kubernetes just says there is no admin role.

Somebody who is a Kubernetes RBAC security expert could do a better job and better lock down the cocalc-kubernetes server (so if it were compromised, then it can't do as much further damage to the whole Kubernetes cluster).

Kubenetes pods are not being created when starting new projects

I'm testing latest cocalc codebase(sagemathinc/cocalc@ae68908) in the cocalc-dev namespace, the server is working well, but no kubernetes pods were created when creating new projects.

It doesn't seems to be related to my kubernetes environment, because I was able to get Cocalc up and running by using this build (https://github.com/intellinum/cocalc/tree/1.0.1).

My team has been looking into this issue for 2 days now, but couldn't figure what's going on. Much appreciated if someone can shed on light on this one and give us some pointers.

Trouble and solution: Unable to pull to the local registry due to the project image size

OpenShift/Docker specific where backend storage is devicemapper:
If you will try to login the cocalc starting a project pull first time.
Project pod start is failing continuously with failure:
Error pulling image (latest) from .... , Untar exit status 1 open .... no space left on device

and some times you get pull time outs.

This is due to the default size of the docker image: the root partition is 10GB thin lvm, but coalc project image size is >10GB
docker.io/sagemathinc/cocalc-kubernetes-project:latest | 10.04 GB

To fix one should patch 2 things on the compute nodes:

  • change docker storage default root size:
# WARNING this will wipe all your local images and docker settings, you should redeploy the compute-node
#edit /etc/sysconfig/docker-storage-setup
    echo DEVS=$DISK > /etc/sysconfig/docker-storage-setup
    echo VG=DOCKER >> /etc/sysconfig/docker-storage-setup
    echo SETUP_LVM_THIN_POOL=yes >> /etc/sysconfig/docker-storage-setup
    echo EXTRA_STORAGE_OPTIONS="--storage-opt dm.basesize=20G" >> /etc/sysconfig/docker-storage-setup
    echo DATA_SIZE="100%FREE" >> /etc/sysconfig/docker-storage-setup

    systemctl stop docker

    rm -rf /var/lib/docker
    wipefs --all $DISK
    docker-storage-setup

To test if it is working on the compute node:

docker run -it ubuntu bash
df -h should show 20GB /

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.