Giter VIP home page Giter VIP logo

run-house / runhouse Goto Github PK

View Code? Open in Web Editor NEW
711.0 711.0 32.0 29.04 MB

Fast, Pythonic AI services and workflows on your own infra. Unobtrusive, debuggable, PyTorch-like APIs.

Home Page: https://run.house

License: Apache License 2.0

Python 97.94% Shell 0.57% Dockerfile 0.56% HCL 0.93%
api artificial-intelligence aws azure collaboration data-science deployment distributed fastapi gcp infrastructure machine-learning middleware observability python pytorch ray sagemaker serverless

runhouse's People

Contributors

ankit-dhankhar avatar aria1th avatar belsasha avatar carolineechen avatar denisyay avatar dongreenberg avatar fluder-paradyne avatar jlewitt1 avatar mkandler avatar rohansreerama5 avatar rohinb2 avatar steve-marmalade avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

runhouse's Issues

PX (P90) for inference Cold start

Describe the bug
Please provide a clear and concise expectation of how cold start looks like.
I see the docs mentions couple of methods ot speed up the load time for models, it would be great if objective numbers could be added. Ray also provides methods to combat cold start, and I see the library is being utilized, but do you use such methods?

For example if you look the img below from this article, most providers of the cold starts are below 100s. (see img) & most providers list either P90/P70/P50 values to help understand the cold start problem & solutions in those terms.

Other relevant stuff:
https://news.ycombinator.com/item?id=35738072
https://www.banana.dev/blog/turboboot

Hit "failed to rsync up" to test test_self_hosted_huggingface_instructor_embedding_documents()

Describe the bug
Test with langchain function self_hosted_huggingface_instructor_embedding_documents(), it transfers small files from client to server, the client hits the following error during the process:

INFO | 2023-08-01 21:57:49,547 | Setting up Function on cluster.
INFO | 2023-08-01 21:57:49,547 | Copying folder from file:///root/t to: rh-cls
sky.exceptions.CommandError: Command rsync -Pavz --filter='dir-merge,- .gitignore' -e "ssh -i /root/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ConnectTimeout=30s -o ForwardAgent=yes -o ControlMaster=auto -o ControlPath=/tmp/skypilot_ssh_root/3651d5b8ee/%C -o ControlPersist=300s" '/root/t/' [email protected]:'~/t/' failed with return code 2.
Failed to rsync up: /root/t/ -> ~/t/. Ensure that the network is stable, then retry.

Then, single the command out and launch:

#rsync -Pavz --filter='dir-merge,- .gitignore' -e "ssh -i /root/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ConnectTimeout=30s -o ForwardAgent=yes -o ControlMaster=auto -o ControlPath=/tmp/skypilot_ssh_root/3651d5b8ee/%C -o ControlPersist=300s" '/root/t/' [email protected]:'~/t/'
protocol version mismatch -- is your shell clean?
(see the rsync manpage for an explanation)
rsync error: protocol incompatibility (code 2) at compat.c(622) [sender=3.2.7]

If relevant, include the steps or code snippet to reproduce the error.

Versions
Please run the following and paste the output below.

wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

python collect_env.py [50/1206]

Python Platform: Linux-5.15.0-60-lowlatency-x86_64-with-glibc2.35
Python Version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]

Relevant packages:
boto3==1.28.17
fastapi==0.99.0
fsspec==2023.5.0
pyarrow==12.0.1
pycryptodome==3.12.0
rich==13.5.2
runhouse==0.0.9
skypilot==0.3.3
sshfs==2023.7.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.2
wheel==0.38.4

Additional context

  1. ray start --head
  2. runhouse login
  3. python -m runhouse.servers.http.http_server

Secrets Management Overview + Tracker

Overview and progress tracker for secrets management revamp, including new APIs and support for new types of secret types and providers.

Motivation

Keeping track of secrets and keys for your various cloud, cluster, and dev accounts, and sharing them across dev environments and teammates is manual and messy. Providing secrets management for Runhouse-adjacent work (e.g. cloud providers for Runhouse clusters, API keys used alongside Runhouse functions, etc) makes it easier to onboard Runhouse Den. Even as a standalone, Runhouse Secrets can be an easy way to get started with storing, keeping track of, and sharing keys.

Runhouse already has basic secrets management support, including saving/syncing provider secrets to default locations, and a login/logout flow. The secrets flow is currently quite separate from the rest of RH resource abstractions, but can benefit from inheriting the properties expected from RH resources, including naming, saving, and sharing.

Design

Converting secrets to a RH resource makes it easier to further develop secrets to support sharing across users/devices, add flexibility to the types of secrets, and extend to new provider-specific secrets.

Types of Secrets

  • Base Secret -- holds a values dict and a paths dict
  • Env Vars Secret -- holds env vars keys and value mapping
  • Provider Secret -- all-encompassing secret with default (but customizable) path/env vars assumed from the provider. automatically extract or specify values.
  • Cluster Secret -- holds values necessary for sshing into or constructing a RH cluster (e.g. key pair values, cluster config settings, password)

Additional Secrets functions and built-in use cases

  • Top level functions for bulk loading/saving/sending/sharing secrets
  • Env Var Secret support in RH Env resource
  • Cluster Secret support in RH Cluster resource

Den updates

  • Secrets treated as a resource

BC-Breaking

  • Removal of some top level functions, such as rh.Secrets.put/get

Sample Usage

custom_secret = rh.secret(name="my_secret", values={"my_key": "my_value"}
custom_secret = custom_secret.write(path="~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~/.rh/secrets/custom_secret.json")

aws_secret = rh.provider_secret("aws")  # extracts from default path or env vars
aws_secret.values
>>> {'access_key': 'XXX_KEY', 'secret_key': 'YYY_KEY'}

lambdalabs_secret = rh.provider_secret("lambda", values={"api_key": "*****"}).write()
cluster.sync_secrets(["aws", "lambda"])

env_secret = rh.env_secret(name="my_env_vars", env_vars=["OPENAI_API_KEY"])  # extracts from os.environ

Progress Tracker

  • undefined#135 (Base PR; at parity with current Secrets)
  • Add env var secrets + support in rh.env
  • Add cluster secrets + support in cluster class / sharing
  • Add paths dict to Base (all) secrets
  • Additional provider secrets (Planned: OpenAI, Anthropic, WandB, Databricks. Feel free to comment below any others you'd like to see!)
  • Improved (more controlled/intuitive) login and logout flow
  • undefined[Den] Secrets Sharing
  • undefined[Den] Update Secrets presentation in Den
  • undefined[Marketing] Improved/more comprehensive tutorial
  • undefined[Marketing] Blog Post (target Mid Dec)

cc @dongreenberg @jlewitt1angell

From SyncLinear.com | KIT-88

RuntimeError when setting up self hosted model + langchain integration

Im having this bug when trying to setup a model within a lambda cloud running SelfHostedHuggingFaceLLM() after the rh.cluster() function.

`
from langchain.llms import SelfHostedPipeline, SelfHostedHuggingFaceLLM
from langchain import PromptTemplate, LLMChain
import runhouse as rh
gpu = rh.cluster(name="rh-a10", instance_type="A10:1").save()
template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])
llm = SelfHostedHuggingFaceLLM(model_id="gpt2", hardware=gpu, model_reqs=["pip:./", "transformers", "torch"])
`

image

I made sure with sky check that the lambda credentials are set, but the error i get within the log is this, which i havent been able to solve.

image

If i can get any help solving this i would appreciate it.

Getting `ValueError: Error calling check on server: Internal Server Error` when checking server on an `aws` cluster

Hi! First off, I just wanna say runhouse is an awesome project! Really gonna revolutionize how people run machine learning workflows!

Describe the bug
I'm running into an issue where I can't run any remote functions on the cluster, but I can do a cluster.run_python(...)

Here's the code I'm running:

    import runhouse as rh
    cluster = rh.OnDemandCluster(
                name="cpu-cluster",
                instance_type="CPU:8",
                provider="aws",      # options: "AWS", "GCP", "Azure", "Lambda", or "cheapest"
            )

    cluster.up_if_not()
    cluster.run_python(['import numpy', 'print(numpy.__version__)'])
    print(cluster.check_server()) # ERRORS HERE

This runs fine until the cluster.check_server(), as you can see here:

INFO | 2023-07-10 21:51:56,953 | Loaded Runhouse config from /home/shyam/.rh/config.yaml
Refreshing status for 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--INFO | 2023-07-10 21:51:58,623 | Found credentials in shared credentials file: ~/.aws/credentials
INFO | 2023-07-10 21:52:05,743 | Running command on cpu-cluster: python3 -c "import numpy; print(numpy.__version__)"
1.25.1
INFO | 2023-07-10 21:52:07,304 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-07-10 21:52:07,855 | Authentication (publickey) successful!
INFO | 2023-07-10 21:52:08,095 | Checking server cpu-cluster
Traceback (most recent call last):
  File "/home/shyam/Code/trainyard/examples/test.py", line 54, in <module>
    print(cluster.check_server())
  File "/home/shyam/miniconda3/envs/py310/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 363, in check_server
    self.client.check_server(cluster_config=cluster_config)
  File "/home/shyam/miniconda3/envs/py310/lib/python3.10/site-packages/runhouse/servers/http/http_client.py", line 48, in check_server
    self.request(
  File "/home/shyam/miniconda3/envs/py310/lib/python3.10/site-packages/runhouse/servers/http/http_client.py", line 41, in request
    raise ValueError(
ValueError: Error calling check on server: Internal Server Error

Not sure if I'm doing something wrong here, but I think my credentials work because I can see that the cluster is being created and I can ssh into it. My package versions can be seen below, let me know if you need more information! Thanks!

Versions

Python Platform: Linux-5.8.0-36-generic-x86_64-with-glibc2.31
Python Version: 3.10.0 (default, Mar  3 2022, 09:58:08) [GCC 7.5.0]

Relevant packages: 
awscli==1.25.60
azure-cli==2.31.0
azure-cli-core==2.31.0
azure-cli-telemetry==1.0.6
azure-core==1.28.0
boto3==1.24.59
docker==6.1.3
fsspec==2023.1.0
gcsfs==2023.1.0
google-api-python-client==2.92.0
google-cloud-storage==2.10.0
pyarrow==12.0.1
pycryptodome==3.12.0
rich==13.4.2
runhouse==0.0.7
s3fs==2023.1.0
skypilot==0.3.1
sshfs==2023.4.1
sshtunnel==0.4.0
typer==0.9.0
wheel==0.38.4

Checking credentials to enable clouds for SkyPilot.
  AWS: enabled          
  Azure: disabled          
    Reason: Azure credential is not set. Run the following commands:
      $ az login
      $ az account set -s <subscription_id>
    For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
  GCP: disabled          
    Reason: GCP tools are not installed or credentials are not set. Run the following commands:
      $ pip install google-api-python-client
      $ conda install -c conda-forge google-cloud-sdk -y
      $ gcloud init
      $ gcloud auth application-default login
    For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html
  Lambda: enabled          
  IBM: disabled          
    Reason: Missing credential file at /home/shyam/.ibm/credentials.yaml.
    Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
      iam_api_key: <IAM_API_KEY>
      resource_group_id: <RESOURCE_GROUP_ID>
  Cloudflare (for R2 object store): disabled          
    Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
      $ pip install boto3
      $ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
      $ mkdir -p ~/.cloudflare
      $ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
    For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2

SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
NAME         LAUNCHED     RESOURCES            STATUS  AUTOSTOP  COMMAND  
cpu-cluster  13 mins ago  1x AWS(m6i.2xlarge)  INIT    (down)    test.py  

Managed spot jobs
No in progress jobs. (See: sky spot -h)

In addition, here's the end of the setup of the cluster:

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='172.31.46.12:6380'
  
  To connect to this Ray cluster:
    import ray
    ray.init()
Shared connection to 54.166.159.228 closed.
  
  To submit a Ray job using the Ray Jobs CLI:
    RAY_ADDRESS='http://127.0.0.1:8266' ray job submit --working-dir . -- python my_script.py
  
  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html 
  for more information on submitting Ray jobs to the Ray cluster.
  
  To terminate the Ray runtime, run
    ray stop
  
  To view the status of the cluster, use
    ray status
  
  To monitor and debug Ray, view the dashboard at 
    127.0.0.1:8266
  
  If connection to the dashboard fails, check your firewall settings and network configuration.
/usr/bin/prlimit
2023-07-10 21:35:36,790	INFO log_timer.py:25 -- NodeUpdater: i-036c634eb67821936: Setup commands succeeded [LogTimer=92341ms]
2023-07-10 21:35:36,791	INFO updater.py:489 -- [7/7] Starting the Ray runtime
2023-07-10 21:35:36,792	VINFO command_runner.py:371 -- Running `export RAY_USAGE_STATS_ENABLED=0;export RAY_OVERRIDE_RESOURCES='{"CPU":8}';((ps aux | grep -v nohup | grep -v grep | grep -q -- "python3 -m sky.skylet.skylet") || nohup python3 -m sky.skylet.skylet >> ~/.sky/skylet.log 2>&1 &); ray stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 ray start --disable-usage-stats --head --port=6380 --dashboard-port=8266 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml  --temp-dir /tmp/ray_skypilot || exit 1; which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done; python -c 'import json, os; json.dump({"ray_port":6380, "ray_dashboard_port":8266}, open(os.path.expanduser("~/.sky/ray_port.json"), "w"))';`
2023-07-10 21:35:36,792	VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_5a4cd850fc/7112f145b3/%C -o ControlPersist=10s -o ConnectTimeout=120s [email protected] bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_USAGE_STATS_ENABLED=0;export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":8}'"'"';((ps aux | grep -v nohup | grep -v grep | grep -q -- "python3 -m sky.skylet.skylet") || nohup python3 -m sky.skylet.skylet >> ~/.sky/skylet.log 2>&1 &); ray stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 ray start --disable-usage-stats --head --port=6380 --dashboard-port=8266 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml  --temp-dir /tmp/ray_skypilot || exit 1; which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done; python -c '"'"'import json, os; json.dump({"ray_port":6380, "ray_dashboard_port":8266}, open(os.path.expanduser("~/.sky/ray_port.json"), "w"))'"'"';)'`
2023-07-10 21:35:41,238	INFO log_timer.py:25 -- NodeUpdater: i-036c634eb67821936: Ray start commands succeeded [LogTimer=4447ms]
2023-07-10 21:35:41,238	INFO log_timer.py:25 -- NodeUpdater: i-036c634eb67821936: Applied config f62a597a450a8281871e7ace3caa155afb5dfe65  [LogTimer=183192ms]
2023-07-10 21:35:42,755	INFO log_timer.py:25 -- AWSNodeProvider: Set tag ray-node-status=up-to-date on ['i-036c634eb67821936']  [LogTimer=515ms]
2023-07-10 21:35:42,925	INFO log_timer.py:25 -- AWSNodeProvider: Set tag ray-runtime-config=f62a597a450a8281871e7ace3caa155afb5dfe65 on ['i-036c634eb67821936']  [LogTimer=170ms]
2023-07-10 21:35:43,090	INFO log_timer.py:25 -- AWSNodeProvider: Set tag ray-file-mounts-contents=24403a03b3acb79e10305dbf19904b00a057a0a1 on ['i-036c634eb67821936']  [LogTimer=165ms]
2023-07-10 21:35:43,091	INFO updater.py:188 -- New status: up-to-date
2023-07-10 21:35:43,273	INFO commands.py:836 -- Useful commands
2023-07-10 21:35:43,273	INFO commands.py:838 -- Monitor autoscaling with
2023-07-10 21:35:43,274	INFO commands.py:839 --   ray exec /home/shyam/.sky/generated/cpu-cluster.yml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
2023-07-10 21:35:43,274	INFO commands.py:846 -- Connect to a terminal on the cluster head:
2023-07-10 21:35:43,274	INFO commands.py:847 --   ray attach /home/shyam/.sky/generated/cpu-cluster.yml
2023-07-10 21:35:43,274	INFO commands.py:850 -- Get a remote shell to the cluster manually:
2023-07-10 21:35:43,274	INFO commands.py:851 --   ssh -o IdentitiesOnly=yes -i ~/.ssh/sky-key [email protected]

System cannot find the path specified

Describe the bug
I'm having an issue when trying to start up a Lang chain llm. After setting up the cluster

gpu = rh.cluster('test', instance_type='T4:1', use_spot=False)

I attempt to create the llm that will run my inferences

from langchain.llms import SelfHostedHuggingFaceLLM

llm = SelfHostedHuggingFaceLLM(model_id='dolly-v2-2-8b', hardware=gpu, model_reqs=['pip:./', 'transformers', 'torch'])

My code appears to run into some error with creating / finding a file. Hoping you all would be able to support.

INFO | 2023-04-20 11:38:47,871 | Setting up Function on cluster.
INFO | 2023-04-20 11:38:47,884 | Upping the cluster test
I 04-20 11:38:53 optimizer.py:617] == Optimizer ==
I 04-20 11:38:53 optimizer.py:628] Target: minimizing cost
I 04-20 11:38:53 optimizer.py:640] Estimated cost: $0.5 / hour
I 04-20 11:38:53 optimizer.py:640] 
I 04-20 11:38:53 optimizer.py:712] Considered resources (1 node):
I 04-20 11:38:53 optimizer.py:760] ---------------------------------------------------------------------------------------------------
I 04-20 11:38:53 optimizer.py:760]  CLOUD   INSTANCE               vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 04-20 11:38:53 optimizer.py:760] ---------------------------------------------------------------------------------------------------
I 04-20 11:38:53 optimizer.py:760]  Azure   Standard_NC4as_T4_v3   4       28        T4:1           eastus        0.53          ✔     
I 04-20 11:38:53 optimizer.py:760] ---------------------------------------------------------------------------------------------------
I 04-20 11:38:53 optimizer.py:760] 
I 04-20 11:38:53 optimizer.py:775] Multiple Azure instances satisfy T4:1. The cheapest Azure(Standard_NC4as_T4_v3, {'T4': 1}) is considered among:
I 04-20 11:38:53 optimizer.py:775] ['Standard_NC4as_T4_v3', 'Standard_NC8as_T4_v3', 'Standard_NC16as_T4_v3'].
I 04-20 11:38:53 optimizer.py:775] 
I 04-20 11:38:53 optimizer.py:781] To list more details, run 'sky show-gpus T4'.
I 04-20 11:38:53 cloud_vm_ray_backend.py:3327] Creating a new cluster: "test" [1x Azure(Standard_NC4as_T4_v3, {'T4': 1})].
I 04-20 11:38:53 cloud_vm_ray_backend.py:3327] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 04-20 11:38:58 cloud_vm_ray_backend.py:1156] To view detailed progress: tail -n100 -f [C:\Users\stollbak/sky_logs\sky-2023-04-20-11-38-53-125409\provision.log](file:///C:/Users/stollbak/sky_logs/sky-2023-04-20-11-38-53-125409/provision.log)
Output exceeds the [size limit](command:workbench.action.openSettings?%5B%22notebook.output.textLineLimit%22%5D). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?782f5ea9-ab7f-4618-adf1-dfd0a80d4ddb)---------------------------------------------------------------------------
ScannerError                              Traceback (most recent call last)
File [c:\Python310\lib\site-packages\sky\execution.py:266](file:///C:/Python310/lib/site-packages/sky/execution.py:266), in _execute(entrypoint, dryrun, down, stream_logs, handle, backend, retry_until_up, optimize_target, stages, cluster_name, detach_setup, detach_run, idle_minutes_to_autostop, no_setup, _is_launched_by_spot_controller)
    265     if handle is None:
--> 266         handle = backend.provision(task,
    267                                    task.best_resources,
    268                                    dryrun=dryrun,
    269                                    stream_logs=stream_logs,
    270                                    cluster_name=cluster_name,
    271                                    retry_until_up=retry_until_up)
    273 if dryrun:

File [c:\Python310\lib\site-packages\sky\utils\common_utils.py:241](file:///C:/Python310/lib/site-packages/sky/utils/common_utils.py:241), in make_decorator.._record(*args, **kwargs)
    240 with cls(full_name, **ctx_kwargs):
--> 241     return f(*args, **kwargs)

File [c:\Python310\lib\site-packages\sky\utils\common_utils.py:220](file:///C:/Python310/lib/site-packages/sky/utils/common_utils.py:220), in make_decorator.._wrapper.._record(*args, **kwargs)
    219 with cls(name_or_fn, **ctx_kwargs):
--> 220     return f(*args, **kwargs)

File [c:\Python310\lib\site-packages\sky\backends\backend.py:56](file:///C:/Python310/lib/site-packages/sky/backends/backend.py:56), in Backend.provision(self, task, to_provision, dryrun, stream_logs, cluster_name, retry_until_up)
     55 usage_lib.messages.usage.update_actual_task(task)
---> 56 return self._provision(task, to_provision, dryrun, stream_logs,
     57                        cluster_name, retry_until_up)

File [c:\Python310\lib\site-packages\sky\backends\cloud_vm_ray_backend.py:2220](file:///C:/Python310/lib/site-packages/sky/backends/cloud_vm_ray_backend.py:2220), in CloudVmRayBackend._provision(self, task, to_provision, dryrun, stream_logs, cluster_name, retry_until_up)
   2217 provisioner = RetryingVmProvisioner(
   2218     self.log_dir, self._dag, self._optimize_target,
   2219     self._requested_features, local_wheel_path, wheel_hash)
-> 2220 config_dict = provisioner.provision_with_retries(
   2221     task, to_provision_config, dryrun, stream_logs)
   2222 break

File [c:\Python310\lib\site-packages\sky\utils\common_utils.py:241](file:///C:/Python310/lib/site-packages/sky/utils/common_utils.py:241), in make_decorator.._record(*args, **kwargs)
    240 with cls(full_name, **ctx_kwargs):
--> 241     return f(*args, **kwargs)

File [c:\Python310\lib\site-packages\sky\backends\cloud_vm_ray_backend.py:1718](file:///C:/Python310/lib/site-packages/sky/backends/cloud_vm_ray_backend.py:1718), in RetryingVmProvisioner.provision_with_retries(self, task, to_provision_config, dryrun, stream_logs)
   1715 to_provision.cloud.check_features_are_supported(
   1716     self._requested_features)
-> 1718 config_dict = self._retry_zones(
   1719     to_provision,
   1720     num_nodes,
   1721     requested_resources=task.resources,
   1722     dryrun=dryrun,
   1723     stream_logs=stream_logs,
   1724     cluster_name=cluster_name,
   1725     cloud_user_identity=cloud_user,
   1726     prev_cluster_status=prev_cluster_status)
   1727 if dryrun:

File [c:\Python310\lib\site-packages\sky\backends\cloud_vm_ray_backend.py:1203](file:///C:/Python310/lib/site-packages/sky/backends/cloud_vm_ray_backend.py:1203), in RetryingVmProvisioner._retry_zones(self, to_provision, num_nodes, requested_resources, dryrun, stream_logs, cluster_name, cloud_user_identity, prev_cluster_status)
   1202 try:
-> 1203     config_dict = backend_utils.write_cluster_config(
   1204         to_provision,
...
   1450     self._close_pipe_fds(p2cread, p2cwrite,
   1451                          c2pread, c2pwrite,
   1452                          errread, errwrite)

FileNotFoundError: [WinError 3] The system cannot find the path specified.

Versions

Python Platform: Windows-10-10.0.19044-SP0
Python Version: 3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)]

Relevant packages:
awscli==1.27.115
azure-cli==2.31.0
azure-cli-core==2.31.0
azure-cli-telemetry==1.0.6
azure-core==1.26.4
boto3==1.26.115
fsspec==2023.4.0
pyarrow==11.0.0
pycryptodome==3.12.0
rich==13.3.4
runhouse==0.0.5
skypilot==0.2.5
sshfs==2023.4.1
sshtunnel==0.4.0
typer==0.7.0
wheel==0.40.0

Checking credentials to enable clouds for SkyPilot.
  AWS: disabled
    Reason: AWS CLI is not installed properly. Run the following commands:
      $ pip install skypilot[aws]    Credentials may also need to be set. Run the following commands:
      $ pip install boto3
      $ aws configure
    For more info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
  Azure: enabled
  GCP: disabled
    Reason: GCP tools are not installed or credentials are not set. Run the following commands:
      $ pip install google-api-python-client
      $ conda install -c conda-forge google-cloud-sdk -y
      $ gcloud init
      $ gcloud auth application-default login
    For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html
  Lambda: disabled
    Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
      https://cloud.lambdalabs.com/api-keys
    to generate API key and add the line
      api_key = [YOUR API KEY]
    to ~/.lambda_cloud/lambda_keys

SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
No existing clusters.

Managed spot jobs
No in progress jobs. (See: sky spot -h)

Additional context
Add any other context about the problem here.

[KIT-80] MLFlow RNS Store

Right now we support two kinds of RNS stores for saving and loading - the Runhouse RNS and the git repo. MLFlow has a high degree of flexibility in the storage backends users can persist their logs and experiments to, and many DS teams already have these stores set up. MLFlow only provides first-class support for models as saved and loaded primitives from the store (which is funny, because "models" are a primitive we specifically don't support, on purpose). Today, people are saving and loading other infrastructure metadata as free-form strings (e.g. s3 paths), but this has significant limitations:

  1. Sharing infra via strings is like sharing files via filepaths, it's just not an "active" ready to use object like sharing a Google doc.
  2. MLFlow doesn't make it easy to pull these strings programmatically (that I can see, and I've asked several active users in MLFlow slack about it), they intend for them to be surfaced through the UI or via their search api.
  3. Strings obviously only go so far - no one is sharing the equivalent of a Runhouse function's metadata through these strings to allow for shared services.
    From MLFlow slack:

image

I think there are a few possible APIs we can provide here:

  1. Allow users to use MLFlow as the RNS store via the same Resource.save() or Resource.from_name() etc. APIs. This would mean many users who already use MLFlow could start sharing Runhouse resources immediately, without any approvals for external metadata storage or setup, and it would integrate into their existing MLFlow-centric experiment workflows. Disadvantage is the experiment or project centric structure of how MLFlow presents metadata. We wouldn't be capturing these to Runhouse RNS to show in a team-centric way.
  2. Integrate with the MLFlow tracer to allow save() to write to both MLFlow (with names only) and Runhouse RNS. That way, users can view Runhouse's interfaces for a single pane of glass into the infra resources, and MLFlow's as an experiment, project, or model centric homepage. Essentially, this would mean using MLFlow's experiment and project structure to add foldering convenience to resources, but avoid having two sources of truth for the metadata. If the user has a particular MLFlow experiment set, and then they request rh.Table.from_name("bert_dropout_v5"), we'll be going to MLFlow first to get the full RNS path for that resource, and then to Runhouse to fetch the resource itself. We could also support an api to pull a dict of all available resources for an experiment at once.
  3. Introduce an mlflow.runhouse integration (or model type) which facilitates saving and loading of Runhouse resources. This would allow saving and loading of resources in a familiar way to MLFlow users, but would also add a lot of new non-model things into the users' model registry.

The easiest way to think about the user journey is like this (showing a notebook-centric workflow in a system like Databricks just to stress the assumptions, but this would all work even more simply in a git+IDE setting):

  1. User is in a notebook working on an experiment. They're using MLFlow to document hyperparameters, metrics, and logs as they set up the experiment. They create a number of functions (pull, preprocess, train, eval, etc.) and artifacts (preprocessed tables, folder of model outputs or intermediate checkpoints, etc.) which they want to dispatch to remote infra and preserve within the experiment. They use Runhouse to facilitate the infra interactions, and then call .save() on their resources to persist the metadata and the record of the resources used in the experiment (including versions probably) to MLFlow.
  2. Another user needs to reproduce or play with the experiment. They either 1) open the MLFlow UI and see the resources used for the experiment, so they can begin playing with them in a new notebook or script (or do the same by pulling through the API), or 2) open the notebook, and begin working with the resources without having to regenerate them because they can be loaded from RNS.
  3. Eventually, this experiment is chosen to move to(ward) production. An initial inference function can be shared by RNS name with a customer team for QA without needing to undergo an export step, and the notebook can even be scheduled to run repeatedly with the inference function auto-updating (and an export step to an inference engine can be added, obviously). If the user wants to schedule the logic to run as a dependency of another job, or have fault-tolerance, monitoring, etc. they can simply adapt the notebook into a single script to drop into their orchestrator (with many heavy or light options there). The user can load their resources as-is by name inside their reproduction script (e.g. rh.Function.from_name("yolo_v5_training_dropout")), copy out the full logic (including functions) from the notebook, or copy out the logic and flow but move the reusable functions into a shared git repo and import them into the script.

Cc @rmehyde, @ankmathur96

From SyncLinear.com | KIT-80

[Doc] Issues with Inline Markup Rendering

This is a follow-up to a separate offline discussion about API feedback.

Describe the Issues

What I've Tried

  • The rendering is correct on the local build.

  • I also ruled out any conflicting or overriding configurations in the files below:

    • docs/conf.py for any conflicting sphinx extension or html theme
    • .readthedocs.yaml under project root for any configuration overrides
  • I was not able to cross reference doc built from the main branch which was "404 not found" at the time of submitting this issue.

Next Step

See if those issues will persist the next time we build the remote doc.

Consider adding `-y` option to `runhouse login` CLI command

Feature

Simple use case is logging in with system command instead of Python API:

!runhouse login [TOKEN]

Currently, the CLI is hardcoded with interactive=True:

valid_token: str = login_module.login(token=token, interactive=True, ret_token=True)

Motivation

It's a minor quality of life improvement.

Ideal Solution

See above

Additional context

Excited to get Runhouse integration up on NatML 😄

Running jobs in a Slurm cluster

The feature
It would be interesting if Runhouse could also interface to a cluster in the form of a an existing Slurm cluster.

Motivation
I am part of a team managing a Slurm (GPU) cluster. On the other hand, I have users who are interested in being able to run large language models via Runhouse (https://langchain.readthedocs.io/en/latest/modules/llms/integrations/self_hosted_examples.html). It would be excellent if I could bridge this gap between supply and demand with Runhouse. From what I have read in the documentation so far, Runhouse does not seem to come with an interface to Slurm so far.

What the ideal solution looks like
I am completely new to Runhouse, so this may not be the ideal solution model, but I imagine this could be supported as a bring-your-own cluster with a little bit of extra interaction between Runhouse and Slurm to request the necessary resources (maybe from the Cluster factory method) as a job / jobs in Slurm (probably through the Slurm REST API). Once the jobs are running, the nodes involved can be contacted by Runhouse as a BYO cluster.

Uncaught error when bringing up on-demand GCP cluster with invalid `image_id`

Hi team, the Runhouse docs for on-demand clusters were not super clear about the format of the image_id, but helpfully my initial attempts to bring up a GCP cluster with e.g. image_id="pytorch-cpu-latest" (taken from the GCP docs) raised a clear error e.g. ValueError: Image 'pytorch-latest-cpu' not found in GCP.

I ended up going into the skypilot repo for clarification and found a GCP example in their yaml-spec: projects/deeplearning-platform-release/global/images/family/tf2-ent-2-1-cpu-ubuntu-2004

I modified the above for the image I wanted projects/deeplearning-platform-release/global/images/family/pytorch-1-13-cpu-v20230807-debian-11-py310 and while runhouse allowed me to submit, it hung until it timed out (and I saw no indication in the GCP Console that the instance was coming up).

I tried to run a similar command via sky launch, and saw the error, which I reported to them in this Github Issue. I am raising it here as well in case you want to update your wrapping code to catch this error.

Versions
Please run the following and paste the output below.

Python Platform: Linux-6.4.12-arch1-1-x86_64-with-glibc2.38
Python Version: 3.10.13 (main, Sep  4 2023, 15:52:34) [GCC 13.2.1 20230801]

Relevant packages: 
boto3==1.28.40
fastapi==0.103.1
fsspec==2023.5.0
gcsfs==2023.5.0
google-api-python-client==2.97.0
google-cloud-storage==2.10.0
pyarrow==13.0.0
pycryptodome==3.12.0
rich==13.5.2
runhouse==0.0.11
skypilot==0.3.3
sshfs==2023.7.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.2
wheel==0.41.2

Checking credentials to enable clouds for SkyPilot.
  AWS: disabled          
    Reason: AWS credentials are not set. Run the following commands:
      $ pip install boto3
      $ aws configure
      $ aws configure list  # Ensure that this shows identity is set.
    For more info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
    Details: `aws sts get-caller-identity` failed with error: [botocore.exceptions.NoCredentialsError] Unable to locate credentials.
  Azure: disabled          
    Reason: ~/.azure/msal_token_cache.json does not exist. Run the following commands:
      $ az login
      $ az account set -s <subscription_id>
    For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
  GCP: enabled          
  Lambda: disabled          
    Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
      https://cloud.lambdalabs.com/api-keys
    to generate API key and add the line
      api_key = [YOUR API KEY]
    to ~/.lambda_cloud/lambda_keys
  IBM: disabled          
    Reason: Missing credential file at /home/user/.ibm/credentials.yaml.
    Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
      iam_api_key: <IAM_API_KEY>
      resource_group_id: <RESOURCE_GROUP_ID>
  SCP: disabled          
    Reason: Failed to access SCP with credentials. To configure credentials, see: https://cloud.samsungsds.com/openapiguide
    Generate API key and add the following line to ~/.scp/scp_credential:
      access_key = [YOUR API ACCESS KEY]
      secret_key = [YOUR API SECRET KEY]
      project_id = [YOUR PROJECT ID]
  OCI: disabled          
    Reason: `oci` is not installed. Install it with: pip install oci
    For more details, refer to: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#oracle-cloud-infrastructure-oci
  Cloudflare (for R2 object store): disabled          
    Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
      $ pip install boto3
      $ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
      $ mkdir -p ~/.cloudflare
      $ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
    For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2

SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
NAME         LAUNCHED     RESOURCES                                                                  STATUS  AUTOSTOP  COMMAND                       

Managed spot jobs
No in progress jobs. (See: sky spot -h)

Additional context
Add any other context about the problem here.

[KIT-73] fn.pdb(args, kwargs)

At one point we had breakpoints and pdb working using RPyC (it's still in function.py at 379). It might be worth trying to get that working again.

Another option barring that:

When user calls fn.pdb, start a new rpc server on the cluster on a different port and a new screen name (e.g. fn_name_timestamp), dedicated to this function. Then, start an ssh terminal into the cluster with `screen -r screen_name`.

Also could be worth exploring the pty approach Modal took.

Cc @Caroline

From SyncLinear.com | KIT-73

How to use runhouse on my local server

Please provide example about how to launch runhouse on my local server

Instead of use AWS, GCP cloud etc, I don't have any of these cloud account, but I have GPU V100 on my local server.
I would like to know how to setup on the server/client sides and how to interactive, I have been trying all examples from:
https://github.com/run-house/tutorials, but none of them worked.

please give more setup instructions and guidances.

Here is what I hit when I was trying to setup on-prem cluster:

$cat rh.py
import runhouse as rh
from diffusers import StableDiffusionPipeline

def sd_generate(prompt):
model = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-base").to("cpu")
return model(prompt).images[0]

gpu = rh.cluster( ips=['127.0.0.1'],
ssh_creds={'ssh_user': 'htang', 'ssh_private_key':'/home/htang/.ssh/id_rsa'},
name='rh-cluster')

sd_generate = rh.function(sd_generate).to(gpu, reqs=["./", "torch", "diffusers"])
img = sd_generate("An oil painting of Keanu Reeves eating a sandwich.")
print(type(img))
img.save("sd.png")
img.show()

$ python rh.py
INFO | 2023-05-28 04:11:35,284 | Loaded Runhouse config from /home/ytang/.rh/config.yaml
INFO | 2023-05-28 04:11:36,858 | Running command on rh-cluster: ray start --head
INFO | 2023-05-28 04:11:37,663 | Running command on rh-cluster: mkdir -p ~/.rh; touch /.rh/cluster_config.yaml; echo '{"name": "/rh-cluster", "resource_type": "cluster", "resource_subtype": "Cluster", "ips": ["127.0.0.1"], "ssh_creds": {"ssh_user": "ytang", "ssh_private_key": "/home/ytang/.ssh/id_rsa"}}' > ~/.rh/cluster_config.yaml
INFO | 2023-05-28 04:11:37,773 | Setting up Function on cluster.
INFO | 2023-05-28 04:11:38,044 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-05-28 04:11:38,105 | Authentication (publickey) successful!
INFO | 2023-05-28 04:11:38,361 | Running command on rh-cluster: ray start --head
INFO | 2023-05-28 04:11:39,023 | Running command on rh-cluster: mkdir -p ~/.rh; touch /.rh/cluster_config.yaml; echo '{"name": "/rh-cluster", "resource_type": "cluster", "resource_subtype": "Cluster", "ips": ["127.0.0.1"], "ssh_creds": {"ssh_user": "ytang", "ssh_private_key": "/home/ytang/.ssh/id_rsa"}}' > ~/.rh/cluster_config.yaml
INFO | 2023-05-28 04:11:39,137 | Copying local package scripts to cluster
INFO | 2023-05-28 04:11:39,327 | Installing packages on cluster rh-cluster: ['./', 'torch', 'diffusers']
Traceback (most recent call last):
File "/home/ytang/scripts/./rh.py", line 15, in
sd_generate = rh.function(sd_generate).to(gpu, reqs=["./", "torch", "diffusers"])
File "/home/ytang/.local/lib/python3.10/site-packages/runhouse/rns/function.py", line 119, in to
new_function.system.install_packages(new_function.reqs)
File "/home/ytang/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 205, in install_packages
self.client.install_packages(to_install)
File "/home/ytang/.local/lib/python3.10/site-packages/runhouse/servers/grpc/unary_client.py", line 59, in install_packages
server_res = self.stub.InstallPackages(message)
File "/home/ytang/.local/lib/python3.10/site-packages/grpc/_channel.py", line 946, in call
return _end_unary_response_blocking(state, call, False, None)
File "/home/ytang/.local/lib/python3.10/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1685247106.845643418","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1685247106.845642647","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"

error when start with '--screen' option

when run runhouse start --screen, it shows error like

python3 command was not found. Make sure you have python3 installed.

but when running without --screen, it's ok

Versions
Please run the following and paste the output below.

wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
python collect_env.py 
Python Platform: Linux-3.10.0-957.21.3.el7.x86_64-x86_64-with-glibc2.17
Python Version: 3.11.4 (main, Jul  5 2023, 13:45:01) [GCC 11.2.0]

Relevant packages: 
boto3==1.33.11
fastapi==0.103.1
fsspec==2023.5.0
pyarrow==13.0.0
rich==13.5.2
runhouse==0.0.13
skypilot==0.4.0
sshfs==2023.10.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.2
wheel==0.38.4

Checking credentials to enable clouds for SkyPilot.
  AWS: disabled                              
    Reason: AWS credentials are not set. Run the following commands:
      $ pip install boto3
      $ aws configure
      $ aws configure list  # Ensure that this shows identity is set.
    For more info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
    Details: `aws sts get-caller-identity` failed with error: [botocore.exceptions.NoCredentialsError] Unable to locate credentials.
  Azure: disabled                              
    Reason: ~/.azure/msal_token_cache.json does not exist. Run the following commands:
      $ az login
      $ az account set -s <subscription_id>
    For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
  GCP: disabled                              
    Reason: GCP tools are not installed. Run the following commands:
      $ pip install google-api-python-client
      $ conda install -c conda-forge google-cloud-sdk -y
    Credentials may also need to be set. Run the following commands:
      $ gcloud init
      $ gcloud auth application-default login
    For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#google-cloud-platform-gcp
    Details: [builtins.ModuleNotFoundError] No module named 'googleapiclient'
  IBM: disabled                              
    Reason: Missing credential file at /home/admins/.ibm/credentials.yaml.
    Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
      iam_api_key: <IAM_API_KEY>
      resource_group_id: <RESOURCE_GROUP_ID>
  Kubernetes: disabled                              
    Reason: Credentials not found - check if ~/.kube/config exists.
  Lambda: disabled                              
    Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
      https://cloud.lambdalabs.com/api-keys
    to generate API key and add the line
      api_key = [YOUR API KEY]
    to ~/.lambda_cloud/lambda_keys
  OCI: disabled                              
    Reason: `oci` is not installed. Install it with: pip install oci
    For more details, refer to: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#oracle-cloud-infrastructure-oci
  SCP: disabled                              
    Reason: Failed to access SCP with credentials. To configure credentials, see: https://cloud.samsungsds.com/openapiguide
    Generate API key and add the following line to ~/.scp/scp_credential:
      access_key = [YOUR API ACCESS KEY]
      secret_key = [YOUR API SECRET KEY]
      project_id = [YOUR PROJECT ID]
  Cloudflare (for R2 object store): disabled                              
    Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
      $ pip install boto3
      $ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
      $ mkdir -p ~/.cloudflare
      $ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
    For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2

SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
No existing clusters.

Managed spot jobs
No in progress jobs. (See: sky spot -h)

Additional context
fulll logs:

 runhouse start --port 2222
INFO | 2023-12-11 02:29:30.713426 | NumExpr defaulting to 8 threads.
INFO | 2023-12-11 02:29:32.342877 | Using port: 2222.
INFO | 2023-12-11 02:29:32.343102 | Starting API server using the following command: /home/admins/miniconda3/bin/python3 -m runhouse.servers.http.http_server.
Executing `/home/admins/miniconda3/bin/python3 -m runhouse.servers.http.http_server --port 2222`
INFO | 2023-12-11 02:29:34.061997 | NumExpr defaulting to 8 threads.
INFO | 2023-12-11 02:29:36.233910 | Launching HTTP server on port: 2222.
INFO | 2023-12-11 02:29:36.234118 | Launching Runhouse API server with den_auth=False and use_local_telemetry=False on host: 0.0.0.0 and port: 32300
INFO:     Started server process [15764]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:32300 (Press CTRL+C to quit)
^CINFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [15764]
runhouse start --port 2222 --screen
INFO | 2023-12-11 02:29:45.997178 | NumExpr defaulting to 8 threads.
INFO | 2023-12-11 02:29:46.455935 | Using port: 2222.
INFO | 2023-12-11 02:29:46.456143 | Starting API server using the following command: /home/admins/miniconda3/bin/python3 -m runhouse.servers.http.http_server.
Executing `screen -dm bash -c "/home/admins/miniconda3/bin/python3 -m runhouse.servers.http.http_server --port 2222 2>&1 | tee -a '/home/admins/.rh/server.log' 2>&1"`
python3 command was not found. Make sure you have python3 installed.

Need to support on HPU servers

The feature

Support for HPU Habana hardware accelerator in runhouse

Motivation
With the increasing demand for high-performance computing and the need for faster processing of large-scale machine learning and deep learning workloads, HPUs have emerged as powerful hardware accelerators. These accelerators offer significant performance advantages over traditional CPUs and GPUs when it comes to tasks involving LLM, neural networks, large-scale data processing, and scientific simulations.

What the ideal solution looks like
By integrating support for HPUs in runhouse, you would provide developers with a platform that enables them to leverage these advanced hardware accelerators seamlessly. This would open up new possibilities for building and running computationally intensive applications and workflows directly on runhouse infrastructure.

For example: the client would be able to remote launch applications on a HPU aws server by:

rh.cluster(name='rh-gaudi', instance_type='dl1.24xlarge', provider='aws').save()

https://aws.amazon.com/ec2/instance-types/dl1/
https://developer.habana.ai/

Additional context
HPU self hosted server as well.

Install fails with conda and python10

Describe the bug
The following fails on a M1 Macbook Pro:

conda create -n runhouse python==3.10
conda activate runhouse
pip install --no-cache "runhouse[aws]"

The error is:

Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [48 lines of output]
      running egg_info
      writing lib3/PyYAML.egg-info/PKG-INFO
      writing dependency_links to lib3/PyYAML.egg-info/dependency_links.txt
      writing top-level names to lib3/PyYAML.egg-info/top_level.txt
      Traceback (most recent call last):
        File "/Users/abeatson/mambaforge/envs/runhouse3/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/Users/abeatson/mambaforge/envs/runhouse3/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/Users/abeatson/mambaforge/envs/runhouse3/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
          self.run_setup()
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 311, in run_setup
          exec(code, locals())
        File "<string>", line 271, in <module>
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/__init__.py", line 103, in setup
          return distutils.core.setup(**attrs)
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
          return run_commands(dist)
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
          dist.run_commands()
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
          self.run_command(cmd)
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/dist.py", line 963, in run_command
          super().run_command(command)
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 321, in run
          self.find_sources()
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 329, in find_sources
          mm.run()
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 551, in run
          self.add_defaults()
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 589, in add_defaults
          sdist.add_defaults(self)
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/command/sdist.py", line 112, in add_defaults
          super().add_defaults()
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/command/sdist.py", line 251, in add_defaults
          self._add_defaults_ext()
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/command/sdist.py", line 336, in _add_defaults_ext
          self.filelist.extend(build_ext.get_source_files())
        File "<string>", line 201, in get_source_files
        File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 107, in __getattr__
          raise AttributeError(attr)
      AttributeError: cython_sources
      [end of output]

Versions
Please run the following and paste the output below.

wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Output:

python collect_env.py
Python Platform: macOS-13.4-arm64-arm-64bit
Python Version: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:27:15) [Clang 11.1.0 ]

Relevant packages:
wheel==0.42.0

sh: sky: command not found
sh: sky: command not found

Consistently hit "BaseSSHTunnelForwarderError"

Describe the bug
Hi, for the runhouse version 0.0.9, I consistently hit error when run the following script ( it worked before for previous version)
import runhouse as rh
gpu = rh.cluster(ips=['127.0.0.1'],
ssh_creds={'ssh_user': 'rhclient', 'ssh_private_key':'/home/rhclient/.ssh/id_rsa'},
name='rh-cls')
print("#################Restart server")
print("Exit now")
....

INFO | 2023-07-31 18:30:20,983 | No auth token provided, so not using RNS API to save and load configs
INFO | 2023-07-31 18:30:21,832 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-07-31 18:30:21,944 | Authentication (publickey) failed.
INFO | 2023-07-31 18:30:21,951 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-07-31 18:30:22,010 | Authentication (publickey) failed.
2023-07-31 18:30:22,010| ERROR | Could not open connection to gateway
ERROR | 2023-07-31 18:30:22,010 | Could not open connection to gateway
2023-07-31 18:30:22,011| ERROR | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-07-31 18:30:22,011 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
INFO | 2023-07-31 18:30:22,011 | Server rh-cls is up, but the HTTP server may not be up.
INFO | 2023-07-31 18:30:22,011 | Restarting HTTP server on rh-cls.
INFO | 2023-07-31 18:30:22,011 | Running command on rh-cls: pkill -f "python -m runhouse.servers.http.http_server"
Warning: Permanently added '127.0.0.1' (ED25519) to the list of known hosts.
Permission denied, please try again.
Permission denied, please try again.
[email protected]: Permission denied (publickey,password).
INFO | 2023-07-31 18:30:22,123 | Running command on rh-cls: screen -dm bash -c 'python -m runhouse.servers.http.http_server |& tee -a ~/.rh/cluster_server_rh-cls.log 2>&1'
Warning: Permanently added '127.0.0.1' (ED25519) to the list of known hosts.
Permission denied, please try again.
Permission denied, please try again.
[email protected]: Permission denied (publickey,password).
INFO | 2023-07-31 18:30:27,237 | Checking server rh-cls again.
Traceback (most recent call last):
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 357, in check_server
self.connect_server_client()
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 324, in connect_server_client
self._rpc_tunnel, connected_port = self.ssh_tunnel(
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 411, in ssh_tunnel
ssh_tunnel.start()
File "/home/rhclient/.local/lib/python3.10/site-packages/sshtunnel.py", line 1331, in start
self._raise(BaseSSHTunnelForwarderError,
File "/home/rhclient/.local/lib/python3.10/site-packages/sshtunnel.py", line 1174, in _raise
raise exception(reason)
sshtunnel.BaseSSHTunnelForwarderError: Could not establish session to SSH gateway

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/devspace/test_self_hosted_llm.py", line 14, in
gpu = rh.cluster(ips=['127.0.0.1'],
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster_factory.py", line 59, in cluster
return Cluster(ips=ips, ssh_creds=ssh_creds, name=name, dryrun=dryrun)
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 58, in init
self.check_server()
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 379, in check_server
self.client.check_server(cluster_config=cluster_config)
AttributeError: 'NoneType' object has no attribute 'check_server'

Versions
Please run the following and paste the output below


# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Python Platform: Linux-5.15.0-60-lowlatency-x86_64-with-glibc2.35
Python Version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]
Relevant packages:
boto3==1.28.15
fastapi==0.99.0
fsspec==2023.5.0
pyarrow==12.0.1
pycryptodome==3.12.0
rich==13.5.1
runhouse==0.0.9
skypilot==0.3.3
sshfs==2023.7.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.2
wheel==0.38.4

Additional context
I started:

  1. ray start --head
  2. runhouse login ...
  3. python -m runhouse.servers.http.http_server

Add local network equipment management

The feature
I wonder if you have any plans to add features and interfaces that allow runhouse to manage local network GPU (not native cloud ) devices?

Motivation
Because I need to deploy localization devices. Instead of relying entirely on cloud devices

Need help with local gpu system

Describe the bug
Hi,
I'm trying to use a gpu system on our local network. However I'm running into issues.
Basic question: Does the runhouse package need to be installed on the remote gpu system? Couldn't figure this out from the documentation.

Here is the snippet of code I'm trying to run:

import runhouse as rh

import pdb;pdb.set_trace()
cluster = rh.cluster(
              name="mlw-cluster",
              ips=['xx.xx.xx.xx'],
              ssh_creds={'ssh_user': 'lab', 'ssh_private_key':'/export/lab/.ssh/mlw01.key'},
          )


def num_cpus():
    import multiprocessing
    return f"Num cpus: {multiprocessing.cpu_count()}"

num_cpus()
num_cpus_cluster = rh.function(name="num_cpus_cluster", fn=num_cpus).to(system=cluster, reqs=["./"])


I get following error in creating the cluster:


(Pdb) c
2023-07-20 10:17:54,985| WAR | MainThrea/1032@sshtunnel | Could not read SSH configuration file: ~/.ssh/config
WARNING | 2023-07-20 10:17:54,985 | Could not read SSH configuration file: ~/.ssh/config
2023-07-20 10:17:54,987| INF | MainThrea/1060@sshtunnel | 1 keys loaded from agent
INFO | 2023-07-20 10:17:54,987 | 1 keys loaded from agent
2023-07-20 10:17:54,988| INF | MainThrea/1117@sshtunnel | 1 key(s) loaded
INFO | 2023-07-20 10:17:54,988 | 1 key(s) loaded
2023-07-20 10:17:54,988| ERR | MainThrea/1314@sshtunnel | Password is required for key /export/lab/.ssh/mlw01.key
ERROR | 2023-07-20 10:17:54,988 | Password is required for key /export/lab/.ssh/mlw01.key
2023-07-20 10:17:54,988| INF | MainThrea/0978@sshtunnel | Connecting to gateway: xx.x.xxx.x:22 as user 'lab'
INFO | 2023-07-20 10:17:54,988 | Connecting to gateway: 172.17.10.110:22 as user 'lab'
2023-07-20 10:17:54,988| DEB | MainThrea/0983@sshtunnel | Concurrent connections allowed: True
2023-07-20 10:17:54,989| DEB | MainThrea/1400@sshtunnel | Trying to log in with key: b'asdWEQWEQWe'
2023-07-20 10:17:55,012| DEB | MainThrea/1204@sshtunnel | Transport socket info: (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 0), timeout=0.1
2023-07-20 10:17:55,043| INF |  Thread-1/1893@transport | Connected (version 2.0, client OpenSSH_7.6p1)
INFO | 2023-07-20 10:17:55,043 | Connected (version 2.0, client OpenSSH_7.6p1)
2023-07-20 10:17:55,278| INF |  Thread-1/1893@transport | Authentication (publickey) successful!
INFO | 2023-07-20 10:17:55,278 | Authentication (publickey) successful!
2023-07-20 10:17:55,279| ERR | MainThrea/1230@sshtunnel | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-07-20 10:17:55,279 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
2023-07-20 10:17:55,280| WAR | MainThrea/1032@sshtunnel | Could not read SSH configuration file: ~/.ssh/config
WARNING | 2023-07-20 10:17:55,280 | Could not read SSH configuration file: ~/.ssh/config
2023-07-20 10:17:55,282| INF | MainThrea/1060@sshtunnel | 1 keys loaded from agent
INFO | 2023-07-20 10:17:55,282 | 1 keys loaded from agent
2023-07-20 10:17:55,282| INF | MainThrea/1117@sshtunnel | 1 key(s) loaded
INFO | 2023-07-20 10:17:55,282 | 1 key(s) loaded
2023-07-20 10:17:55,283| ERR | MainThrea/1314@sshtunnel | Password is required for key /export/lab/.ssh/mlw01.key
ERROR | 2023-07-20 10:17:55,283 | Password is required for key /export/lab/.ssh/mlw01.key
2023-07-20 10:17:55,283| INF | MainThrea/0978@sshtunnel | Connecting to gateway: 172.17.10.110:22 as user 'lab'
INFO | 2023-07-20 10:17:55,283 | Connecting to gateway: 172.17.10.110:22 as user 'lab'
2023-07-20 10:17:55,283| DEB | MainThrea/0983@sshtunnel | Concurrent connections allowed: True
2023-07-20 10:17:55,283| WAR | MainThrea/1618@sshtunnel | It looks like you didn't call the .stop() before the SSHTunnelForwarder obj was collected by the garbage collector! Running .stop(force=True)
WARNING | 2023-07-20 10:17:55,283 | It looks like you didn't call the .stop() before the SSHTunnelForwarder obj was collected by the garbage collector! Running .stop(force=True)
2023-07-20 10:17:55,284| INF | MainThrea/1374@sshtunnel | Closing all open connections...
INFO | 2023-07-20 10:17:55,284 | Closing all open connections...
2023-07-20 10:17:55,284| DEB | MainThrea/1378@sshtunnel | Listening tunnels: None
2023-07-20 10:17:55,284| WAR | MainThrea/1450@sshtunnel | Tunnels are not started. Please .start() first!
WARNING | 2023-07-20 10:17:55,284 | Tunnels are not started. Please .start() first!
2023-07-20 10:17:55,284| INF | MainThrea/1453@sshtunnel | Closing ssh transport
INFO | 2023-07-20 10:17:55,284 | Closing ssh transport
2023-07-20 10:17:55,284| DEB | MainThrea/1477@sshtunnel | Transport is closed
2023-07-20 10:17:55,285| DEB | MainThrea/1400@sshtunnel | Trying to log in with key: b'463095aa1803da78647cd548f37173ef'
2023-07-20 10:17:55,305| DEB | MainThrea/1204@sshtunnel | Transport socket info: (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 0), timeout=0.1
2023-07-20 10:17:55,334| INF |  Thread-3/1893@transport | Connected (version 2.0, client OpenSSH_7.6p1)
INFO | 2023-07-20 10:17:55,334 | Connected (version 2.0, client OpenSSH_7.6p1)
2023-07-20 10:17:55,578| INF |  Thread-3/1893@transport | Authentication (publickey) successful!
INFO | 2023-07-20 10:17:55,578 | Authentication (publickey) successful!
2023-07-20 10:17:55,579| INF | Srv-50053/1433@sshtunnel | Opening tunnel: 0.0.0.0:50053 <> 127.0.0.1:50052
INFO | 2023-07-20 10:17:55,579 | Opening tunnel: 0.0.0.0:50053 <> 127.0.0.1:50052
INFO | 2023-07-20 10:17:55,580 | Checking server mlw-cluster
2023-07-20 10:17:55,814| TRA | Thread-5 /0360@sshtunnel | #1 <-- ('127.0.0.1', 44364) connected
2023-07-20 10:17:55,815| TRA | Thread-5 /0316@sshtunnel | >>> OUT #1 <-- ('127.0.0.1', 44364) send to ('127.0.0.1', 50052): b'504f5354202f636865636b2f20485454502f312e310d0a486f73743a203132372e302e302e313a35303035330d0a557365722d4167656e743a20707974686f6e2d72657175657374732f322e33312e300d0a4163636570742d456e636f64696e673a20677a69702c206465666c6174650d0a4163636570743a202a2f2a0d0a436f6e6e656374696f6e3a206b6565702d616c6976650d0a436f6e74656e742d4c656e6774683a203330300d0a436f6e74656e742d547970653a206170706c69636174696f6e2f6a736f6e0d0a0d0a7b2264617461223a20227b5c6e202020205c226e616d655c223a205c227e2f6d6c772d636c75737465725c222c5c6e202020205c227265736f757263655f747970655c223a205c22636c75737465725c222c5c6e202020205c227265736f757263655f737562747970655c223a205c22436c75737465725c222c5c6e202020205c226970735c223a205b5c6e20202020202020205c223137322e31372e31302e3131305c225c6e202020205d2c5c6e202020205c227373685f63726564735c223a207b5c6e20202020202020205c227373685f757365725c223a205c226c61625c222c5c6e20202020202020205c227373685f707269766174655f6b65795c223a205c222f6578706f72742f6c61622f2e7373682f6d6c7730312e6b65795c225c6e202020207d5c6e7d227d' >>>
2023-07-20 10:17:55,816| TRA | Thread-5 /0333@sshtunnel | <<< IN #1 <-- ('127.0.0.1', 44364) recv: b'5353482d322e302d4f70656e5353485f372e367031205562756e74752d347562756e7475302e350d0a' <<<
INFO | 2023-07-20 10:17:55,816 | Server mlw-cluster is up, but the HTTP server may not be up.
INFO | 2023-07-20 10:17:55,817 | Restarting HTTP server on mlw-cluster.
INFO | 2023-07-20 10:17:55,817 | Running command on mlw-cluster: pkill -f "python -m runhouse.servers.http.http_server"
2023-07-20 10:17:55,817| TRA | Thread-5 /0311@sshtunnel | >>> OUT #1 <-- ('127.0.0.1', 44364) recv empty data >>>
2023-07-20 10:17:55,820| TRA | Thread-5 /0375@sshtunnel | #1 <-- ('127.0.0.1', 44364) connection closed.
INFO | 2023-07-20 10:17:56,571 | Running command on mlw-cluster: screen -dm bash -c 'python -m runhouse.servers.http.http_server |& tee -a ~/.rh/cluster_server_mlw-cluster.log 2>&1'
INFO | 2023-07-20 10:18:02,291 | Checking server mlw-cluster again.
2023-07-20 10:18:02,318| ERR |  Thread-3/1893@transport | Secsh channel 1 open FAILED: Connection refused: Connect failed
ERROR | 2023-07-20 10:18:02,318 | Secsh channel 1 open FAILED: Connection refused: Connect failed
2023-07-20 10:18:02,318| TRA | Thread-14/0357@sshtunnel | #2 <-- ('127.0.0.1', 47456) open new channel ssh error: ChannelException(2, 'Connect failed')
2023-07-20 10:18:02,318| ERR | Thread-14/0394@sshtunnel | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
ERROR | 2023-07-20 10:18:02,318 | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
Traceback (most recent call last):
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen
    httplib_response = self._make_request(
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 461, in _make_request
    httplib_response = conn.getresponse()
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 798, in urlopen
    retries = retries.increment(
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen
    httplib_response = self._make_request(
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 461, in _make_request
    httplib_response = conn.getresponse()
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/export/lab/work/learn_runhouse/testmlw01.py", line 4, in <module>
    cluster = rh.cluster(
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/rns/hardware/cluster_factory.py", line 59, in cluster
    return Cluster(ips=ips, ssh_creds=ssh_creds, name=name, dryrun=dryrun)
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 60, in __init__
    self.check_server()
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 381, in check_server
    self.client.check_server(cluster_config=cluster_config)
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/servers/http/http_client.py", line 48, in check_server
    self.request(
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/servers/http/http_client.py", line 35, in request
    response = req_fn(
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))




Versions
Please run the following and paste the output below.

wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Python Platform: Linux-5.19.0-46-generic-x86_64-with-glibc2.35
Python Version: 3.10.12 (main, Jul  5 2023, 18:54:27) [GCC 11.2.0]

Relevant packages: 
boto3==1.28.6
fastapi==0.99.0
fsspec==2023.6.0
pyarrow==12.0.1
pycryptodome==3.12.0
rich==13.4.2
runhouse==0.0.9
skypilot==0.3.3
sshfs==2023.7.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.1
wheel==0.38.4

SkyPilot collects usage data to improve its services. `setup` and `run` commands are not collected to ensure privacy.
Usage logging can be disabled by setting the environment variable SKYPILOT_DISABLE_USAGE_COLLECTION=1.
Checking credentials to enable clouds for SkyPilot.
  AWS: disabled          
    Reason: AWS credentials are not set. Run the following commands:
      $ pip install boto3
      $ aws configure
      $ aws configure list  # Ensure that this shows identity is set.
    For more info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
    Details: `aws sts get-caller-identity` failed with error: [botocore.exceptions.NoCredentialsError] Unable to locate credentials.
  Azure: disabled          
    Reason: ~/.azure/msal_token_cache.json does not exist. Run the following commands:
      $ az login
      $ az account set -s <subscription_id>
    For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
  GCP: disabled          
    Reason: GCP tools are not installed. Run the following commands:
      $ pip install google-api-python-client
      $ conda install -c conda-forge google-cloud-sdk -y
    Credentials may also need to be set. Run the following commands:
      $ gcloud init
      $ gcloud auth application-default login
    For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#google-cloud-platform-gcp
    Details: [builtins.ModuleNotFoundError] No module named 'googleapiclient'
  Lambda: disabled          
    Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
      https://cloud.lambdalabs.com/api-keys
    to generate API key and add the line
      api_key = [YOUR API KEY]
    to ~/.lambda_cloud/lambda_keys
  IBM: disabled          
    Reason: Missing credential file at /export/lab/.ibm/credentials.yaml.
    Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
      iam_api_key: <IAM_API_KEY>
      resource_group_id: <RESOURCE_GROUP_ID>
  SCP: disabled          
    Reason: Failed to access SCP with credentials. To configure credentials, see: https://cloud.samsungsds.com/openapiguide
    Generate API key and add the following line to ~/.scp/scp_credential:
      access_key = [YOUR API ACCESS KEY]
      secret_key = [YOUR API SECRET KEY]
      project_id = [YOUR PROJECT ID]
  OCI: disabled          
    Reason: `oci` is not installed. Install it with: pip install oci
    For more details, refer to: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#oracle-cloud-infrastructure-oci
  Cloudflare (for R2 object store): disabled          
    Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
      $ pip install boto3
      $ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
      $ mkdir -p ~/.cloudflare
      $ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
    For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2

SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
No existing clusters.

Managed spot jobs
No in progress jobs. (See: sky spot -h)


Additional context
Add any other context about the problem here.

[KIT-71] fn.batch(item, batch_size=10)

Batching is critical for good compute utilization in ML. Assuming fn is written to accept a list of inputs, calling fn.batch(single_item, batch_size=10) should accumulate the inputs on the server and only call fn(list_of_items) when it has a full batch. Open questions:

  1. Should this be sync (waiting for a full batch), async (return remote), or give a choice?
  2. Should there be a way to trigger an immediate result with an incomplete batch (e.g. if I've reached the end of a list of items and only have a partial batch)?
  3. Should there be a way to specify a time SLA, e.g. wait 30 seconds for other inputs, otherwise just submit the incomplete batch?

From SyncLinear.com | KIT-71

[KIT-67] Runs

Basic API ideas (WIP):

Create Run object (captures logs, inputs, outputs, other artifacts read or written within call, who ran, where):

res = fn(**kwargs, name=”my_run”)

A run is a folder (created inside local rh directory by default), and can be sent elsewhere to persist logs, results, artifact info, etc.:

rh.run(name=“my_run”).to("s3", path="runhouse/nlp_team/bert_ft/results")

Ideally, we can have a "default log store" setting in the user config so the logs from their runs can be sent to the same place by default when they save, rather than having to send each run one by one.

This could be the way for users to configure for artifacts/logs to flow to an existing MLFlow store, or to flow to W&B, Grafana, Datadog, etc.

Save the run to local or RNS (not all runs need to be saved)

rh.run(name=“my_run”).save()

Creates a run object by tracing the activity within the block - no inputs and outputs, but captures logs (perhaps several logfiles for different calls) and artifacts used:

with rh.run(name=”my_run”) as r:

Big feature, essentially the same as auto-caching in orchestrators - check if this run was already completed, and load results if so, otherwise run:

res = fn.get_or_run(name=”yelp_review_preproc_test”)

Create/name a CLI run:

r = my_cluster.run(["python test_bert.py --gpus 4 --model distilbert"], name="test_distilbert_ddp")

Inspiration: this MLFlow example

We can also support event (failure or completion) notifications through knocknock or pagerduty!

Cc @Caroline

From SyncLinear.com | KIT-67

SSH ProxyCommand support

I am trying Run House with a local pre-configured server. But that server needs to use "ProxyCommand" option to SSH into. Is there a way the PxoxyCommand can be specified in the Cluster API (like in the ssh_creds dict)?

Typical way to SSH into the server is something like this:

ssh -i -o ProxyCommand="ssh -W %h:%p <user>>@<frontendproxyhost>" <user>@<targethost>>

I do have a workaround to add the ProxyCommand in ~/.ssh/config but would be nice to specify as params in the rh.cluster API for cases where the SSH command are a bit dynamic (like in my case).

[KIT-85] Testing

(some basic notes below, feel free to edit/comment)

GHA Setup

  • use GH secrets (I think this avoids printing out secrets into GHA logs), and move it to the correct corresponding files that skypilot expects them to be
    • using runhouse login -y has an easier setup flow but potentially more prone to leaks
  • always up rh-cpu corresponding to those GHA secrets (AWS)
  • AWS test token has access necessary to create s3 buckets, etc
  • button for maintainers to run the tests (not automatically run)

Types of testing, split using pytest.mark

  • local
  • requires cluster
  • requires s3
  • requires gcp (skip on gha)
  • slow (skip on gha)

Test profiling

  • determine which ones are slow, prioritize adding faster tests that can be run on GHA

 

Refactoring

  • pre-populate the local folder creations in a common test folder, so we don't need to redo it all the time and have it save down to local (and fail to clean up)
  • more unified setup/teardown for classes
  • parameterize + less duplicate code

From SyncLinear.com | KIT-85

[KIT-83] In-memory data resources

Eliminate need to .write() a data object to the filesystem before returning from a function, which can be quite expensive - e.g. after I've preprocessed a dataset that's already backed by files in the filesystem, calling .write() is just copying them for no reason, including a costly partition.

Basically, we have the cluster object store, we should use it to avoid fs reads and writes we don't need (and save the user the trouble of knowing they need to .write() before using data remotely). This also saves us the trouble of finding places to write down data when the user doesn't feel like providing a path, or is just working with an anonymous data object (e.g. returning an rh.Table from a preprocessing fn). This will also clean up a sort of API wrinkle where a pinned object is markedly different from a blob (there doesn't need to be a real difference in terms of user intent), and the relationship between data passed to a resource constructor and the written-down data is a little unclear (e.g. if I do rh.table(my_ray_table, path="real/path/to/existing.parquet") which data should fetch return?).

Basic API concepts:

rh.table(my_table)

  • If we're on a cluster, saves the data into the clusters obj store, setting system=this_cluster and name=f"table_{random_hex}" (just like we do to generate random run_keys). The rns_address (whether random or user-provided) is the key in the object store.
  • Calling .save just persists in the RNS that the object lives on that cluster in the object store. If the cluster goes down, the table is obviously gone.
  • If we're not on a cluster, this should just store the data in a _data field because there's no need for a local object store (nothing can .get the object from the local interpreter anyway).

rh.table(my_table).write() would actually save the table down (same as present behavior), but return a new table object with path set to the fs path. That eliminates the current ._cached_data ambiguity (multiple sources of truth), because the original object still holds the original data, and the new returned object just points to the fs data. rh.table(my_table).write(path="local/path.parquet") is clearer than the present constructor accepting both (we should probably throw an error if both are passed in, because it's ambiguous). One gotcha: if the user sets the name for the in-memory table and then writes it, should the new table have the same name? If they .save it, should we delete the existing object out of the object store so it's clear that there's only one table with that rns_address (and its not really accessible anymore)? In general, if a user loads an object from_name, the one stored in RNS should be the source of truth, even if there's a local one in the object store.

my_table.fetch() and my_table.stream() from elsewhere should still work, but now via RPCs - the cluster's .get should already work for fetch, but we'd likely need a new one for stream. For fetch, the object needs to be pickleable (not cloudpickle-able) for us to be able to send it over the wire without dealing with python version mismatches (I don't think this is unreasonable).

We need a way to tell for a given blob or table if we need to use the RPCs instead of the existing fs-based operations, and I'm leaning toward actually breaking out the folder-backed table or blob to be separate classes from the in-memory ones. It would probably make the most sense if the in-memory Blob/Table/KVstore etc. classes were actually the base classes, and the folder-backed ones were subclasses. There are a number of advantages to doing this:

  • The rpc-based fetch/stream etc. logic is in its own class, not a bunch of branching within the blob/table methods (where we need to store some private state that this object is in the obj-store anyway).
  • The in-memory blob/table doesn't need a path field.
  • There's a clear separation between the folder-backed and infra-backed table types. It makes more sense to me to have an obj-store backed Table base class from which the folder-backed, Postgres-backed, BQ-backed, etc. tables each derive (rather than have BQ be a subclass of the folder-backed table, which is a bit strange).
  • Right now we decide on the table type the moment the user passes data into the constructor, which I'm not sure is necessary. If a user passes in a pandas table and simply intends to fetch or stream it from somewhere else, the logic to do that would be identical to cudf, hf, etc. We mostly branch into separate subclasses for the current tables to handle filesystem logic (write, save, stream, etc.)
  • We could instantly add useful KVStore, Pubsub (naming pending), and maybe vectorstore base classes without having to do any integrations - just dead simple python object implementations. We could even support .write() with pickle initially before we pick a proper system to back them.

rh.blob(my_model) saves into the object store with key blob.name orf"blob_{randomhex}". rh.blob(my_model, name="my_model") and rh.Blob.from_name("my_model") should behave identically to rh.pin_to_memory("my_model", my_model) and rh.get_pinned_object("my_model") , (except with rns_address as the obj_store key instead of name, but that's an implementation detail) and ideally replace it. The current pinning system isn't very elegant and eats too much user brainspace.

An immediate implication of the above (because we use pinning for storing results when a user calls fn.remote), is that fn.remote can just wrap the result in a blob before returning instead of returning the run_key. Wrapping a result in rh.blob is common enough that it makes sense for .remote to mean "please return a remote object." The current .remote behavior of returning the run_key is actually "run this async and return a key to retrieve the result", which I think would make more sense to be called fn.async or fn.submit, considering the fact that most users don't seem to know we support async because the naming is unclear (submit could make it clearer that the function will continue to run in the background even if they kill the interpreter locally). Also, right now we need to INFO log a bunch of instructions for killing or retrieving for every .remote call, but this isn't necessary and looks ugly when the user just wants a remote object back.

Lastly, supporting remote in-memory objects opens the door to remote calls on those objects. We could pretty easily support this just by intercepting any call on the object, and if the rh.blob doesn't have that function/attr, we try RPCing the call over to the cluster. Like this:

class Blob(Resource):
    ...
    def __getattribute__(self,name):
        if not_a_blob_attr(name):
            remote_attr = self.get_attr_over_rpc(name)
            if name == "__call__" or hasattr(remote_attr, "__call__"):
                def newfunc(*args, **kwargs):
                    result = self.call_on_obj_via_rpc(name, *args, **kwargs)
                    return result if self.is_primitive(result) else rh.blob(result)
                return newfunc
            else:
                return remote_attr
        else:
            return attr

This would make our remote objects real remote objects, and save a lot of trouble creating one-off functions to send to the cluster to call methods on objects. You can do something crazy like:

model = rh.blob(my_model).to(gpu).cuda()        # But can't use .to("cuda") because it'd call blob's .to
local_pil_image = model("my_input_string").fetch()

So overall the benefits of this change are:

  1. Saving mental and performance overhead of unneeded .write calls
  2. More coherent relationship between python table/blob object and folder-backed, less error prone with multiple sources of truth
  3. Saving mental overhead of separate pinning APIs
  4. Clearer and lower mental overhead .remote APIs
  5. Powerful native remote object support

From SyncLinear.com | KIT-83

I consistently see the user script hanging when copying a local package to the cluster.

Hi, I consistently see my script hanging when it copies local package to the server, is there any way from the server side which can display the packages are actually been copying?

/work/rh/scripts/self-hosted.py
INFO | 2023-05-31 20:38:49,626 | No auth token provided, so not using RNS API to save and load configs
INFO | 2023-05-31 20:39:24,493 | Running command on rh-cluster: ray start --head
INFO | 2023-05-31 20:39:46,019 | Running command on rh-cluster: mkdir -p ~/.rh; touch /.rh/cluster_config.yaml; echo '{"name": "/rh-cluster", "resource_type": "cluster", "resource_subtype": "Cluster", "ips": ["127.0.0.1"], "ssh_creds": {"ssh_user": "ytang", "ssh_private_key": "/home/ytang/.ssh/id_rsa"}}' > ~/.rh/cluster_config.yaml
Warning: Identity file /home/ytang/.ssh/id_rsa not accessible: No such file or directory.
INFO | 2023-05-31 20:39:50,904 | Setting up Function on cluster.
INFO | 2023-05-31 20:39:51,059 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-05-31 20:39:51,127 | Authentication (publickey) successful!
2023-05-31 20:39:51,128| ERROR | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-05-31 20:39:51,128 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
INFO | 2023-05-31 20:39:51,288 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-05-31 20:39:51,418 | Authentication (publickey) successful!
INFO | 2023-05-31 20:39:51,674 | Copying local package work to cluster
root@35c45fe5c801:/work/rh# cd /work/rh ; /usr/bin/env /usr/bin/python3 /root/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 54577 -- /work/rh/scripts/self-hosted.py
INFO | 2023-05-31 20:46:22,533 | No auth token provided, so not using RNS API to save and load configs
INFO | 2023-05-31 20:46:27,993 | Running command on rh-cluster: ray start --head
INFO | 2023-05-31 20:46:28,686 | Running command on rh-cluster: mkdir -p ~/.rh; touch /.rh/cluster_config.yaml; echo '{"name": "/rh-cluster", "resource_type": "cluster", "resource_subtype": "Cluster", "ips": ["127.0.0.1"], "ssh_creds": {"ssh_user": "ytang", "ssh_private_key": "/home/ytang/.ssh/id_rsa"}}' > ~/.rh/cluster_config.yaml
Warning: Identity file /home/ytang/.ssh/id_rsa not accessible: No such file or directory.
INFO | 2023-05-31 20:46:29,852 | Setting up Function on cluster.
INFO | 2023-05-31 20:46:29,917 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-05-31 20:46:30,028 | Authentication (publickey) successful!
2023-05-31 20:46:30,028| ERROR | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-05-31 20:46:30,028 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
INFO | 2023-05-31 20:46:30,081 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-05-31 20:46:30,157 | Authentication (publickey) successful!
INFO | 2023-05-31 20:46:30,413 | Copying local package work to cluster

Consistantly hit "http.client.BadStatusLine" issue in self-hosted tests.

Describe the bug
Hi recently I constantly hit BadStatusLine issue as follows, may be it is related to urllib library issue?

client@4c31ddeb9ade:/zip$ python test_self_hosted_llm.py
INFO | 2023-06-14 18:24:16,048 | No auth token provided, so not using RNS API to save and load configs
INFO | 2023-06-14 18:24:16,921 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-06-14 18:24:16,981 | Authentication (publickey) successful!
2023-06-14 18:24:16,982| ERROR | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-06-14 18:24:16,982 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
INFO | 2023-06-14 18:24:17,115 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-06-14 18:24:17,174 | Authentication (publickey) successful!
INFO | 2023-06-14 18:24:17,224 | Running command on rh-cluster: pkill -f "python -m runhouse.servers.http.http_server"
Warning: Identity file /home/server/.ssh/id_rsa not accessible: Permission denied.
pkill: killing pid 255251 failed: Operation not permitted
pkill: killing pid 255253 failed: Operation not permitted
INFO | 2023-06-14 18:24:17,274 | Running command on rh-cluster: screen -dm bash -c 'python -m runhouse.servers.http.http_server |& tee -a ~/.rh/cluster_server_rh-cluster.log 2>&1'
Warning: Identity file /home/server/.ssh/id_rsa not accessible: Permission denied.
INFO | 2023-06-14 18:24:20,324 | Running command on rh-cluster: ray start --head
WARNING | 2023-06-14 18:24:21,357 | /home/client/.local/lib/python3.10/site-packages/runhouse/rns/function.py:110: UserWarning: reqs and setup_cmds arguments has been deprecated. Please use env instead.
warnings.warn(

INFO | 2023-06-14 18:24:21,358 | Setting up Function on cluster.
INFO | 2023-06-14 18:24:21,495 | Installing packages on cluster rh-cluster: ['transformers', 'torch', 'Package: zip']
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 714, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 466, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 461, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.10/http/client.py", line 1374, in getresponse
response.begin()
File "/usr/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.10/http/client.py", line 300, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: ?ÿÿ?ÿÿ ?

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 798, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.10/dist-packages/urllib3/packages/six.py", line 769, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 714, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 466, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 461, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.10/http/client.py", line 1374, in getresponse
response.begin()
File "/usr/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.10/http/client.py", line 300, in _read_status
raise BadStatusLine(line)
urllib3.exceptions.ProtocolError: ('Connection aborted.', BadStatusLine('\x00\x00\x18\x04\x00\x00\x00\x00\x00\x00\x04\x00?ÿÿ\x00\x05\x00?ÿÿ\x00\x06\x00\x00 \x00þ\x03\x00\x00\x00\x01\x00\x00\x04\x08\x00\x00\x00\x00\x00\x00?\x00\x00'))
Versions
Please run the following and paste the output below.

wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

python collect_env.py
Python Platform: Linux-5.15.0-60-lowlatency-x86_64-with-glibc2.35
Python Version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]

Relevant packages:
awscli==1.27.153
boto3==1.26.153
fsspec==2023.5.0
pyarrow==12.0.1
pycryptodome==3.12.0
rich==13.4.2
runhouse @ file:///tmp/runhouse-0.0.6-py3-none-any.whl
skypilot==0.3.1
sshfs==2023.4.1
sshtunnel==0.4.0
typer==0.9.0
wheel==0.38.4

sh: 1: sky: not found
sh: 1: sky: not found

**Additional context**
Add any other context about the problem here.

Runhouse 2023 Roadmap 🚗

Please help prioritize our roadmap! We have a long list of projects we'd like to complete to make Runhouse robust 🦾, comprehensive 🎨, and flexible 🙆‍♀️ across research and production usage. Please comment which items resonate for your use cases, or let us know if there are features we've missed!

Compute

  • On-prem Clusters
  • Sending functions to K8s
  • Sending functions to Slurm
  • HTTP endpoints
    • Instant HTTP endpoints for Functions (via Ray Serve)
    • OpenAPI - Auto-generating docs or clients in other programming languages
    • Custom ASGI apps
  • Asyncio support
  • Sending functions to serverless compute - Modal, AWS, GCP, Azure
  • Auto-handling dependency matrices for packages (e.g. PyTorch<>CUDA versions)
  • Docker
    • Custom images
    • Sending functions inside a docker container
    • Runhouse AMIs
  • Autoscaling and multi-region

Data

  • Blobs - AWS, GCP, Azure
  • New Tables - BigQuery, Databricks, Postgres, Snowflake, Redshift, Delta lake
  • KVStores - Redis, BigTable, ElastiCache
  • VectorStores

Accessibility

  • Monitoring - Prometheus, Grafana, Datadog, Pagerduty
  • Artifact lineage and forking
  • Resource search and discovery (e.g. Backstage)
  • Custom triggers, alerts (including human-in-the-loop), retries

Management

  • Secrets - Custom Vault, AWS, GCP, Azure, (Akeyless)
  • Groups - SAML, Okta, Azure AD, AWS, GCP
  • Management of underlying resources in dashboard UI
  • Resource versioning and history
  • Networking - VPCs, IP allowlists, Sidecars/Service meshes (Envoy/LinkerD/Istio)

Tutorials

  • Using Runhouse with other ML platforms - e.g. Airflow, Sagemaker, Vertex, KfP, MetaFlow, MLFlow, etc.
  • Continual learning (e.g. DLRM daily retraining)
  • Large-scale Training and HPO
  • Distributed inference (e.g. OPT or BLOOM)
  • E2E model deployment, including ingress and firewall management.

KIT-61

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.