run-house / runhouse Goto Github PK
View Code? Open in Web Editor NEWFast, Pythonic AI services and workflows on your own infra. Unobtrusive, debuggable, PyTorch-like APIs.
Home Page: https://run.house
License: Apache License 2.0
Fast, Pythonic AI services and workflows on your own infra. Unobtrusive, debuggable, PyTorch-like APIs.
Home Page: https://run.house
License: Apache License 2.0
From SyncLinear.com | KIT-75
Describe the bug
Please provide a clear and concise expectation of how cold start looks like.
I see the docs mentions couple of methods ot speed up the load time for models, it would be great if objective numbers could be added. Ray also provides methods to combat cold start, and I see the library is being utilized, but do you use such methods?
For example if you look the img below from this article, most providers of the cold starts are below 100s. (see img) & most providers list either P90/P70/P50 values to help understand the cold start problem & solutions in those terms.
Other relevant stuff:
https://news.ycombinator.com/item?id=35738072
https://www.banana.dev/blog/turboboot
Super cool. I have an existing pytorch project that has over 100 .to(device) calls. is there an easy way to transform our codebase to incorporate runhouse? or should i manually change all my .to calls to accomodate for runhouse?
cc: @carolineechen @dongreenberg
Describe the bug
Test with langchain function self_hosted_huggingface_instructor_embedding_documents(), it transfers small files from client to server, the client hits the following error during the process:
INFO | 2023-08-01 21:57:49,547 | Setting up Function on cluster.
INFO | 2023-08-01 21:57:49,547 | Copying folder from file:///root/t to: rh-cls
sky.exceptions.CommandError: Command rsync -Pavz --filter='dir-merge,- .gitignore' -e "ssh -i /root/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ConnectTimeout=30s -o ForwardAgent=yes -o ControlMaster=auto -o ControlPath=/tmp/skypilot_ssh_root/3651d5b8ee/%C -o ControlPersist=300s" '/root/t/' [email protected]:'~/t/' failed with return code 2.
Failed to rsync up: /root/t/ -> ~/t/. Ensure that the network is stable, then retry.
Then, single the command out and launch:
#rsync -Pavz --filter='dir-merge,- .gitignore' -e "ssh -i /root/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ConnectTimeout=30s -o ForwardAgent=yes -o ControlMaster=auto -o ControlPath=/tmp/skypilot_ssh_root/3651d5b8ee/%C -o ControlPersist=300s" '/root/t/' [email protected]:'~/t/'
protocol version mismatch -- is your shell clean?
(see the rsync manpage for an explanation)
rsync error: protocol incompatibility (code 2) at compat.c(622) [sender=3.2.7]
If relevant, include the steps or code snippet to reproduce the error.
Versions
Please run the following and paste the output below.
wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
Python Platform: Linux-5.15.0-60-lowlatency-x86_64-with-glibc2.35
Python Version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]
Relevant packages:
boto3==1.28.17
fastapi==0.99.0
fsspec==2023.5.0
pyarrow==12.0.1
pycryptodome==3.12.0
rich==13.5.2
runhouse==0.0.9
skypilot==0.3.3
sshfs==2023.7.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.2
wheel==0.38.4
Additional context
Clicking on the Discord link in the README (both the Discord badge and in the Getting Help section) goes to an "Invite Invalid" page:
Is Discord still the recommended way to ask questions, or should I post them as Issues? I'm curious about this project :)
Overview and progress tracker for secrets management revamp, including new APIs and support for new types of secret types and providers.
Keeping track of secrets and keys for your various cloud, cluster, and dev accounts, and sharing them across dev environments and teammates is manual and messy. Providing secrets management for Runhouse-adjacent work (e.g. cloud providers for Runhouse clusters, API keys used alongside Runhouse functions, etc) makes it easier to onboard Runhouse Den. Even as a standalone, Runhouse Secrets can be an easy way to get started with storing, keeping track of, and sharing keys.
Runhouse already has basic secrets management support, including saving/syncing provider secrets to default locations, and a login/logout flow. The secrets flow is currently quite separate from the rest of RH resource abstractions, but can benefit from inheriting the properties expected from RH resources, including naming, saving, and sharing.
Converting secrets to a RH resource makes it easier to further develop secrets to support sharing across users/devices, add flexibility to the types of secrets, and extend to new provider-specific secrets.
rh.Secrets.put/get
custom_secret = rh.secret(name="my_secret", values={"my_key": "my_value"}
custom_secret = custom_secret.write(path="~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~/.rh/secrets/custom_secret.json")
aws_secret = rh.provider_secret("aws") # extracts from default path or env vars
aws_secret.values
>>> {'access_key': 'XXX_KEY', 'secret_key': 'YYY_KEY'}
lambdalabs_secret = rh.provider_secret("lambda", values={"api_key": "*****"}).write()
cluster.sync_secrets(["aws", "lambda"])
env_secret = rh.env_secret(name="my_env_vars", env_vars=["OPENAI_API_KEY"]) # extracts from os.environ
cc @dongreenberg @jlewitt1angell
From SyncLinear.com | KIT-88
Tracking issue and design stub for k8s cluster.
From SyncLinear.com | KIT-77
Im having this bug when trying to setup a model within a lambda cloud running SelfHostedHuggingFaceLLM() after the rh.cluster() function.
`
from langchain.llms import SelfHostedPipeline, SelfHostedHuggingFaceLLM
from langchain import PromptTemplate, LLMChain
import runhouse as rh
gpu = rh.cluster(name="rh-a10", instance_type="A10:1").save()
template = """Question: {question}
Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm = SelfHostedHuggingFaceLLM(model_id="gpt2", hardware=gpu, model_reqs=["pip:./", "transformers", "torch"])
`
I made sure with sky check that the lambda credentials are set, but the error i get within the log is this, which i havent been able to solve.
If i can get any help solving this i would appreciate it.
Curious
Hi! First off, I just wanna say runhouse is an awesome project! Really gonna revolutionize how people run machine learning workflows!
Describe the bug
I'm running into an issue where I can't run any remote functions on the cluster, but I can do a cluster.run_python(...)
Here's the code I'm running:
import runhouse as rh
cluster = rh.OnDemandCluster(
name="cpu-cluster",
instance_type="CPU:8",
provider="aws", # options: "AWS", "GCP", "Azure", "Lambda", or "cheapest"
)
cluster.up_if_not()
cluster.run_python(['import numpy', 'print(numpy.__version__)'])
print(cluster.check_server()) # ERRORS HERE
This runs fine until the cluster.check_server()
, as you can see here:
INFO | 2023-07-10 21:51:56,953 | Loaded Runhouse config from /home/shyam/.rh/config.yaml
Refreshing status for 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:--INFO | 2023-07-10 21:51:58,623 | Found credentials in shared credentials file: ~/.aws/credentials
INFO | 2023-07-10 21:52:05,743 | Running command on cpu-cluster: python3 -c "import numpy; print(numpy.__version__)"
1.25.1
INFO | 2023-07-10 21:52:07,304 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-07-10 21:52:07,855 | Authentication (publickey) successful!
INFO | 2023-07-10 21:52:08,095 | Checking server cpu-cluster
Traceback (most recent call last):
File "/home/shyam/Code/trainyard/examples/test.py", line 54, in <module>
print(cluster.check_server())
File "/home/shyam/miniconda3/envs/py310/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 363, in check_server
self.client.check_server(cluster_config=cluster_config)
File "/home/shyam/miniconda3/envs/py310/lib/python3.10/site-packages/runhouse/servers/http/http_client.py", line 48, in check_server
self.request(
File "/home/shyam/miniconda3/envs/py310/lib/python3.10/site-packages/runhouse/servers/http/http_client.py", line 41, in request
raise ValueError(
ValueError: Error calling check on server: Internal Server Error
Not sure if I'm doing something wrong here, but I think my credentials work because I can see that the cluster is being created and I can ssh into it. My package versions can be seen below, let me know if you need more information! Thanks!
Versions
Python Platform: Linux-5.8.0-36-generic-x86_64-with-glibc2.31
Python Version: 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0]
Relevant packages:
awscli==1.25.60
azure-cli==2.31.0
azure-cli-core==2.31.0
azure-cli-telemetry==1.0.6
azure-core==1.28.0
boto3==1.24.59
docker==6.1.3
fsspec==2023.1.0
gcsfs==2023.1.0
google-api-python-client==2.92.0
google-cloud-storage==2.10.0
pyarrow==12.0.1
pycryptodome==3.12.0
rich==13.4.2
runhouse==0.0.7
s3fs==2023.1.0
skypilot==0.3.1
sshfs==2023.4.1
sshtunnel==0.4.0
typer==0.9.0
wheel==0.38.4
Checking credentials to enable clouds for SkyPilot.
AWS: enabled
Azure: disabled
Reason: Azure credential is not set. Run the following commands:
$ az login
$ az account set -s <subscription_id>
For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
GCP: disabled
Reason: GCP tools are not installed or credentials are not set. Run the following commands:
$ pip install google-api-python-client
$ conda install -c conda-forge google-cloud-sdk -y
$ gcloud init
$ gcloud auth application-default login
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html
Lambda: enabled
IBM: disabled
Reason: Missing credential file at /home/shyam/.ibm/credentials.yaml.
Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
iam_api_key: <IAM_API_KEY>
resource_group_id: <RESOURCE_GROUP_ID>
Cloudflare (for R2 object store): disabled
Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
$ pip install boto3
$ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
$ mkdir -p ~/.cloudflare
$ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2
SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
cpu-cluster 13 mins ago 1x AWS(m6i.2xlarge) INIT (down) test.py
Managed spot jobs
No in progress jobs. (See: sky spot -h)
In addition, here's the end of the setup of the cluster:
--------------------
Ray runtime started.
--------------------
Next steps
To add another node to this Ray cluster, run
ray start --address='172.31.46.12:6380'
To connect to this Ray cluster:
import ray
ray.init()
Shared connection to 54.166.159.228 closed.
To submit a Ray job using the Ray Jobs CLI:
RAY_ADDRESS='http://127.0.0.1:8266' ray job submit --working-dir . -- python my_script.py
See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
for more information on submitting Ray jobs to the Ray cluster.
To terminate the Ray runtime, run
ray stop
To view the status of the cluster, use
ray status
To monitor and debug Ray, view the dashboard at
127.0.0.1:8266
If connection to the dashboard fails, check your firewall settings and network configuration.
/usr/bin/prlimit
2023-07-10 21:35:36,790 INFO log_timer.py:25 -- NodeUpdater: i-036c634eb67821936: Setup commands succeeded [LogTimer=92341ms]
2023-07-10 21:35:36,791 INFO updater.py:489 -- [7/7] Starting the Ray runtime
2023-07-10 21:35:36,792 VINFO command_runner.py:371 -- Running `export RAY_USAGE_STATS_ENABLED=0;export RAY_OVERRIDE_RESOURCES='{"CPU":8}';((ps aux | grep -v nohup | grep -v grep | grep -q -- "python3 -m sky.skylet.skylet") || nohup python3 -m sky.skylet.skylet >> ~/.sky/skylet.log 2>&1 &); ray stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 ray start --disable-usage-stats --head --port=6380 --dashboard-port=8266 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --temp-dir /tmp/ray_skypilot || exit 1; which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done; python -c 'import json, os; json.dump({"ray_port":6380, "ray_dashboard_port":8266}, open(os.path.expanduser("~/.sky/ray_port.json"), "w"))';`
2023-07-10 21:35:36,792 VVINFO command_runner.py:373 -- Full command is `ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_5a4cd850fc/7112f145b3/%C -o ControlPersist=10s -o ConnectTimeout=120s [email protected] bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_USAGE_STATS_ENABLED=0;export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":8}'"'"';((ps aux | grep -v nohup | grep -v grep | grep -q -- "python3 -m sky.skylet.skylet") || nohup python3 -m sky.skylet.skylet >> ~/.sky/skylet.log 2>&1 &); ray stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 ray start --disable-usage-stats --head --port=6380 --dashboard-port=8266 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --temp-dir /tmp/ray_skypilot || exit 1; which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done; python -c '"'"'import json, os; json.dump({"ray_port":6380, "ray_dashboard_port":8266}, open(os.path.expanduser("~/.sky/ray_port.json"), "w"))'"'"';)'`
2023-07-10 21:35:41,238 INFO log_timer.py:25 -- NodeUpdater: i-036c634eb67821936: Ray start commands succeeded [LogTimer=4447ms]
2023-07-10 21:35:41,238 INFO log_timer.py:25 -- NodeUpdater: i-036c634eb67821936: Applied config f62a597a450a8281871e7ace3caa155afb5dfe65 [LogTimer=183192ms]
2023-07-10 21:35:42,755 INFO log_timer.py:25 -- AWSNodeProvider: Set tag ray-node-status=up-to-date on ['i-036c634eb67821936'] [LogTimer=515ms]
2023-07-10 21:35:42,925 INFO log_timer.py:25 -- AWSNodeProvider: Set tag ray-runtime-config=f62a597a450a8281871e7ace3caa155afb5dfe65 on ['i-036c634eb67821936'] [LogTimer=170ms]
2023-07-10 21:35:43,090 INFO log_timer.py:25 -- AWSNodeProvider: Set tag ray-file-mounts-contents=24403a03b3acb79e10305dbf19904b00a057a0a1 on ['i-036c634eb67821936'] [LogTimer=165ms]
2023-07-10 21:35:43,091 INFO updater.py:188 -- New status: up-to-date
2023-07-10 21:35:43,273 INFO commands.py:836 -- Useful commands
2023-07-10 21:35:43,273 INFO commands.py:838 -- Monitor autoscaling with
2023-07-10 21:35:43,274 INFO commands.py:839 -- ray exec /home/shyam/.sky/generated/cpu-cluster.yml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
2023-07-10 21:35:43,274 INFO commands.py:846 -- Connect to a terminal on the cluster head:
2023-07-10 21:35:43,274 INFO commands.py:847 -- ray attach /home/shyam/.sky/generated/cpu-cluster.yml
2023-07-10 21:35:43,274 INFO commands.py:850 -- Get a remote shell to the cluster manually:
2023-07-10 21:35:43,274 INFO commands.py:851 -- ssh -o IdentitiesOnly=yes -i ~/.ssh/sky-key [email protected]
Describe the bug
I'm having an issue when trying to start up a Lang chain llm. After setting up the cluster
gpu = rh.cluster('test', instance_type='T4:1', use_spot=False)
I attempt to create the llm that will run my inferences
from langchain.llms import SelfHostedHuggingFaceLLM
llm = SelfHostedHuggingFaceLLM(model_id='dolly-v2-2-8b', hardware=gpu, model_reqs=['pip:./', 'transformers', 'torch'])
My code appears to run into some error with creating / finding a file. Hoping you all would be able to support.
INFO | 2023-04-20 11:38:47,871 | Setting up Function on cluster.
INFO | 2023-04-20 11:38:47,884 | Upping the cluster test
I 04-20 11:38:53 optimizer.py:617] == Optimizer ==
I 04-20 11:38:53 optimizer.py:628] Target: minimizing cost
I 04-20 11:38:53 optimizer.py:640] Estimated cost: $0.5 / hour
I 04-20 11:38:53 optimizer.py:640]
I 04-20 11:38:53 optimizer.py:712] Considered resources (1 node):
I 04-20 11:38:53 optimizer.py:760] ---------------------------------------------------------------------------------------------------
I 04-20 11:38:53 optimizer.py:760] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 04-20 11:38:53 optimizer.py:760] ---------------------------------------------------------------------------------------------------
I 04-20 11:38:53 optimizer.py:760] Azure Standard_NC4as_T4_v3 4 28 T4:1 eastus 0.53 ✔
I 04-20 11:38:53 optimizer.py:760] ---------------------------------------------------------------------------------------------------
I 04-20 11:38:53 optimizer.py:760]
I 04-20 11:38:53 optimizer.py:775] Multiple Azure instances satisfy T4:1. The cheapest Azure(Standard_NC4as_T4_v3, {'T4': 1}) is considered among:
I 04-20 11:38:53 optimizer.py:775] ['Standard_NC4as_T4_v3', 'Standard_NC8as_T4_v3', 'Standard_NC16as_T4_v3'].
I 04-20 11:38:53 optimizer.py:775]
I 04-20 11:38:53 optimizer.py:781] To list more details, run 'sky show-gpus T4'.
I 04-20 11:38:53 cloud_vm_ray_backend.py:3327] Creating a new cluster: "test" [1x Azure(Standard_NC4as_T4_v3, {'T4': 1})].
I 04-20 11:38:53 cloud_vm_ray_backend.py:3327] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 04-20 11:38:58 cloud_vm_ray_backend.py:1156] To view detailed progress: tail -n100 -f [C:\Users\stollbak/sky_logs\sky-2023-04-20-11-38-53-125409\provision.log](file:///C:/Users/stollbak/sky_logs/sky-2023-04-20-11-38-53-125409/provision.log)
Output exceeds the [size limit](command:workbench.action.openSettings?%5B%22notebook.output.textLineLimit%22%5D). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?782f5ea9-ab7f-4618-adf1-dfd0a80d4ddb)---------------------------------------------------------------------------
ScannerError Traceback (most recent call last)
File [c:\Python310\lib\site-packages\sky\execution.py:266](file:///C:/Python310/lib/site-packages/sky/execution.py:266), in _execute(entrypoint, dryrun, down, stream_logs, handle, backend, retry_until_up, optimize_target, stages, cluster_name, detach_setup, detach_run, idle_minutes_to_autostop, no_setup, _is_launched_by_spot_controller)
265 if handle is None:
--> 266 handle = backend.provision(task,
267 task.best_resources,
268 dryrun=dryrun,
269 stream_logs=stream_logs,
270 cluster_name=cluster_name,
271 retry_until_up=retry_until_up)
273 if dryrun:
File [c:\Python310\lib\site-packages\sky\utils\common_utils.py:241](file:///C:/Python310/lib/site-packages/sky/utils/common_utils.py:241), in make_decorator.._record(*args, **kwargs)
240 with cls(full_name, **ctx_kwargs):
--> 241 return f(*args, **kwargs)
File [c:\Python310\lib\site-packages\sky\utils\common_utils.py:220](file:///C:/Python310/lib/site-packages/sky/utils/common_utils.py:220), in make_decorator.._wrapper.._record(*args, **kwargs)
219 with cls(name_or_fn, **ctx_kwargs):
--> 220 return f(*args, **kwargs)
File [c:\Python310\lib\site-packages\sky\backends\backend.py:56](file:///C:/Python310/lib/site-packages/sky/backends/backend.py:56), in Backend.provision(self, task, to_provision, dryrun, stream_logs, cluster_name, retry_until_up)
55 usage_lib.messages.usage.update_actual_task(task)
---> 56 return self._provision(task, to_provision, dryrun, stream_logs,
57 cluster_name, retry_until_up)
File [c:\Python310\lib\site-packages\sky\backends\cloud_vm_ray_backend.py:2220](file:///C:/Python310/lib/site-packages/sky/backends/cloud_vm_ray_backend.py:2220), in CloudVmRayBackend._provision(self, task, to_provision, dryrun, stream_logs, cluster_name, retry_until_up)
2217 provisioner = RetryingVmProvisioner(
2218 self.log_dir, self._dag, self._optimize_target,
2219 self._requested_features, local_wheel_path, wheel_hash)
-> 2220 config_dict = provisioner.provision_with_retries(
2221 task, to_provision_config, dryrun, stream_logs)
2222 break
File [c:\Python310\lib\site-packages\sky\utils\common_utils.py:241](file:///C:/Python310/lib/site-packages/sky/utils/common_utils.py:241), in make_decorator.._record(*args, **kwargs)
240 with cls(full_name, **ctx_kwargs):
--> 241 return f(*args, **kwargs)
File [c:\Python310\lib\site-packages\sky\backends\cloud_vm_ray_backend.py:1718](file:///C:/Python310/lib/site-packages/sky/backends/cloud_vm_ray_backend.py:1718), in RetryingVmProvisioner.provision_with_retries(self, task, to_provision_config, dryrun, stream_logs)
1715 to_provision.cloud.check_features_are_supported(
1716 self._requested_features)
-> 1718 config_dict = self._retry_zones(
1719 to_provision,
1720 num_nodes,
1721 requested_resources=task.resources,
1722 dryrun=dryrun,
1723 stream_logs=stream_logs,
1724 cluster_name=cluster_name,
1725 cloud_user_identity=cloud_user,
1726 prev_cluster_status=prev_cluster_status)
1727 if dryrun:
File [c:\Python310\lib\site-packages\sky\backends\cloud_vm_ray_backend.py:1203](file:///C:/Python310/lib/site-packages/sky/backends/cloud_vm_ray_backend.py:1203), in RetryingVmProvisioner._retry_zones(self, to_provision, num_nodes, requested_resources, dryrun, stream_logs, cluster_name, cloud_user_identity, prev_cluster_status)
1202 try:
-> 1203 config_dict = backend_utils.write_cluster_config(
1204 to_provision,
...
1450 self._close_pipe_fds(p2cread, p2cwrite,
1451 c2pread, c2pwrite,
1452 errread, errwrite)
FileNotFoundError: [WinError 3] The system cannot find the path specified.
Versions
Python Platform: Windows-10-10.0.19044-SP0
Python Version: 3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)]
Relevant packages:
awscli==1.27.115
azure-cli==2.31.0
azure-cli-core==2.31.0
azure-cli-telemetry==1.0.6
azure-core==1.26.4
boto3==1.26.115
fsspec==2023.4.0
pyarrow==11.0.0
pycryptodome==3.12.0
rich==13.3.4
runhouse==0.0.5
skypilot==0.2.5
sshfs==2023.4.1
sshtunnel==0.4.0
typer==0.7.0
wheel==0.40.0
Checking credentials to enable clouds for SkyPilot.
AWS: disabled
Reason: AWS CLI is not installed properly. Run the following commands:
$ pip install skypilot[aws] Credentials may also need to be set. Run the following commands:
$ pip install boto3
$ aws configure
For more info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
Azure: enabled
GCP: disabled
Reason: GCP tools are not installed or credentials are not set. Run the following commands:
$ pip install google-api-python-client
$ conda install -c conda-forge google-cloud-sdk -y
$ gcloud init
$ gcloud auth application-default login
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html
Lambda: disabled
Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
https://cloud.lambdalabs.com/api-keys
to generate API key and add the line
api_key = [YOUR API KEY]
to ~/.lambda_cloud/lambda_keys
SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
No existing clusters.
Managed spot jobs
No in progress jobs. (See: sky spot -h)
Additional context
Add any other context about the problem here.
For now, we can do this server-side it'll only be surfaced if stream_logs=True.
Important context: ray-project/ray#5554
From SyncLinear.com | KIT-72
Accessing a function via http (in addition to grpc)
From SyncLinear.com | KIT-65
Right now we support two kinds of RNS stores for saving and loading - the Runhouse RNS and the git repo. MLFlow has a high degree of flexibility in the storage backends users can persist their logs and experiments to, and many DS teams already have these stores set up. MLFlow only provides first-class support for models as saved and loaded primitives from the store (which is funny, because "models" are a primitive we specifically don't support, on purpose). Today, people are saving and loading other infrastructure metadata as free-form strings (e.g. s3 paths), but this has significant limitations:
I think there are a few possible APIs we can provide here:
rh.Table.from_name("bert_dropout_v5")
, we'll be going to MLFlow first to get the full RNS path for that resource, and then to Runhouse to fetch the resource itself. We could also support an api to pull a dict of all available resources for an experiment at once.mlflow.runhouse
integration (or model type) which facilitates saving and loading of Runhouse resources. This would allow saving and loading of resources in a familiar way to MLFlow users, but would also add a lot of new non-model things into the users' model registry.The easiest way to think about the user journey is like this (showing a notebook-centric workflow in a system like Databricks just to stress the assumptions, but this would all work even more simply in a git+IDE setting):
rh.Function.from_name("yolo_v5_training_dropout")
), copy out the full logic (including functions) from the notebook, or copy out the logic and flow but move the reusable functions into a shared git repo and import them into the script.Cc @rmehyde, @ankmathur96
From SyncLinear.com | KIT-80
This is a follow-up to a separate offline discussion about API feedback.
Since the typos were fixed in this commit by @carolineechen , there is no need to submit a fix-up on my end.
The open issues are with the rendering of certain inline markups on Runhouse website (v. latest), for example:
Docstrings under Package Factory Method, Blob Factory Method:
Page Secrets in Vault
The rendering is correct on the local build.
I also ruled out any conflicting or overriding configurations in the files below:
docs/conf.py
for any conflicting sphinx extension or html theme.readthedocs.yaml
under project root for any configuration overridesI was not able to cross reference doc built from the main branch which was "404 not found" at the time of submitting this issue.
See if those issues will persist the next time we build the remote doc.
Simple use case is logging in with system
command instead of Python API:
!runhouse login [TOKEN]
Currently, the CLI is hardcoded with interactive=True
:
Line 27 in 560a528
It's a minor quality of life improvement.
See above
Excited to get Runhouse integration up on NatML 😄
The feature
It would be interesting if Runhouse could also interface to a cluster in the form of a an existing Slurm cluster.
Motivation
I am part of a team managing a Slurm (GPU) cluster. On the other hand, I have users who are interested in being able to run large language models via Runhouse (https://langchain.readthedocs.io/en/latest/modules/llms/integrations/self_hosted_examples.html). It would be excellent if I could bridge this gap between supply and demand with Runhouse. From what I have read in the documentation so far, Runhouse does not seem to come with an interface to Slurm so far.
What the ideal solution looks like
I am completely new to Runhouse, so this may not be the ideal solution model, but I imagine this could be supported as a bring-your-own cluster with a little bit of extra interaction between Runhouse and Slurm to request the necessary resources (maybe from the Cluster factory method) as a job / jobs in Slurm (probably through the Slurm REST API). Once the jobs are running, the nodes involved can be contacted by Runhouse as a BYO cluster.
Hi team, the Runhouse docs for on-demand clusters were not super clear about the format of the image_id
, but helpfully my initial attempts to bring up a GCP cluster with e.g. image_id="pytorch-cpu-latest"
(taken from the GCP docs) raised a clear error e.g. ValueError: Image 'pytorch-latest-cpu' not found in GCP.
I ended up going into the skypilot repo for clarification and found a GCP example in their yaml-spec: projects/deeplearning-platform-release/global/images/family/tf2-ent-2-1-cpu-ubuntu-2004
I modified the above for the image I wanted projects/deeplearning-platform-release/global/images/family/pytorch-1-13-cpu-v20230807-debian-11-py310
and while runhouse allowed me to submit, it hung until it timed out (and I saw no indication in the GCP Console that the instance was coming up).
I tried to run a similar command via sky launch
, and saw the error, which I reported to them in this Github Issue. I am raising it here as well in case you want to update your wrapping code to catch this error.
Versions
Please run the following and paste the output below.
Python Platform: Linux-6.4.12-arch1-1-x86_64-with-glibc2.38
Python Version: 3.10.13 (main, Sep 4 2023, 15:52:34) [GCC 13.2.1 20230801]
Relevant packages:
boto3==1.28.40
fastapi==0.103.1
fsspec==2023.5.0
gcsfs==2023.5.0
google-api-python-client==2.97.0
google-cloud-storage==2.10.0
pyarrow==13.0.0
pycryptodome==3.12.0
rich==13.5.2
runhouse==0.0.11
skypilot==0.3.3
sshfs==2023.7.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.2
wheel==0.41.2
Checking credentials to enable clouds for SkyPilot.
AWS: disabled
Reason: AWS credentials are not set. Run the following commands:
$ pip install boto3
$ aws configure
$ aws configure list # Ensure that this shows identity is set.
For more info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
Details: `aws sts get-caller-identity` failed with error: [botocore.exceptions.NoCredentialsError] Unable to locate credentials.
Azure: disabled
Reason: ~/.azure/msal_token_cache.json does not exist. Run the following commands:
$ az login
$ az account set -s <subscription_id>
For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
GCP: enabled
Lambda: disabled
Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
https://cloud.lambdalabs.com/api-keys
to generate API key and add the line
api_key = [YOUR API KEY]
to ~/.lambda_cloud/lambda_keys
IBM: disabled
Reason: Missing credential file at /home/user/.ibm/credentials.yaml.
Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
iam_api_key: <IAM_API_KEY>
resource_group_id: <RESOURCE_GROUP_ID>
SCP: disabled
Reason: Failed to access SCP with credentials. To configure credentials, see: https://cloud.samsungsds.com/openapiguide
Generate API key and add the following line to ~/.scp/scp_credential:
access_key = [YOUR API ACCESS KEY]
secret_key = [YOUR API SECRET KEY]
project_id = [YOUR PROJECT ID]
OCI: disabled
Reason: `oci` is not installed. Install it with: pip install oci
For more details, refer to: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#oracle-cloud-infrastructure-oci
Cloudflare (for R2 object store): disabled
Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
$ pip install boto3
$ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
$ mkdir -p ~/.cloudflare
$ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2
SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
Managed spot jobs
No in progress jobs. (See: sky spot -h)
Additional context
Add any other context about the problem here.
From SyncLinear.com | KIT-63
Tried reloading "sd_generate" from inside a notebook, and it hung trying to copy over the "./" of the notebook's environment (which was huge).
From SyncLinear.com | KIT-54
At one point we had breakpoints and pdb working using RPyC (it's still in function.py at 379). It might be worth trying to get that working again.
Another option barring that:
When user calls fn.pdb, start a new rpc server on the cluster on a different port and a new screen name (e.g. fn_name_timestamp), dedicated to this function. Then, start an ssh terminal into the cluster with `screen -r screen_name`.
Also could be worth exploring the pty approach Modal took.
Cc @Caroline
From SyncLinear.com | KIT-73
Maybe use grpclib for non-dev install.
From SyncLinear.com | KIT-14
A new PostgresTable Table subclass. Maybe we should have a SQLTable subclass which defaults to DuckDB, and then support Postgres, MySQL, SQLite, others?
From SyncLinear.com | KIT-76
Please provide example about how to launch runhouse on my local server
Instead of use AWS, GCP cloud etc, I don't have any of these cloud account, but I have GPU V100 on my local server.
I would like to know how to setup on the server/client sides and how to interactive, I have been trying all examples from:
https://github.com/run-house/tutorials, but none of them worked.
please give more setup instructions and guidances.
Here is what I hit when I was trying to setup on-prem cluster:
$cat rh.py
import runhouse as rh
from diffusers import StableDiffusionPipeline
def sd_generate(prompt):
model = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-base").to("cpu")
return model(prompt).images[0]
gpu = rh.cluster( ips=['127.0.0.1'],
ssh_creds={'ssh_user': 'htang', 'ssh_private_key':'/home/htang/.ssh/id_rsa'},
name='rh-cluster')
sd_generate = rh.function(sd_generate).to(gpu, reqs=["./", "torch", "diffusers"])
img = sd_generate("An oil painting of Keanu Reeves eating a sandwich.")
print(type(img))
img.save("sd.png")
img.show()
$ python rh.py
INFO | 2023-05-28 04:11:35,284 | Loaded Runhouse config from /home/ytang/.rh/config.yaml
INFO | 2023-05-28 04:11:36,858 | Running command on rh-cluster: ray start --head
INFO | 2023-05-28 04:11:37,663 | Running command on rh-cluster: mkdir -p ~/.rh; touch /.rh/cluster_config.yaml; echo '{"name": "/rh-cluster", "resource_type": "cluster", "resource_subtype": "Cluster", "ips": ["127.0.0.1"], "ssh_creds": {"ssh_user": "ytang", "ssh_private_key": "/home/ytang/.ssh/id_rsa"}}' > ~/.rh/cluster_config.yaml
INFO | 2023-05-28 04:11:37,773 | Setting up Function on cluster.
INFO | 2023-05-28 04:11:38,044 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-05-28 04:11:38,105 | Authentication (publickey) successful!
INFO | 2023-05-28 04:11:38,361 | Running command on rh-cluster: ray start --head
INFO | 2023-05-28 04:11:39,023 | Running command on rh-cluster: mkdir -p ~/.rh; touch /.rh/cluster_config.yaml; echo '{"name": "/rh-cluster", "resource_type": "cluster", "resource_subtype": "Cluster", "ips": ["127.0.0.1"], "ssh_creds": {"ssh_user": "ytang", "ssh_private_key": "/home/ytang/.ssh/id_rsa"}}' > ~/.rh/cluster_config.yaml
INFO | 2023-05-28 04:11:39,137 | Copying local package scripts to cluster
INFO | 2023-05-28 04:11:39,327 | Installing packages on cluster rh-cluster: ['./', 'torch', 'diffusers']
Traceback (most recent call last):
File "/home/ytang/scripts/./rh.py", line 15, in
sd_generate = rh.function(sd_generate).to(gpu, reqs=["./", "torch", "diffusers"])
File "/home/ytang/.local/lib/python3.10/site-packages/runhouse/rns/function.py", line 119, in to
new_function.system.install_packages(new_function.reqs)
File "/home/ytang/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 205, in install_packages
self.client.install_packages(to_install)
File "/home/ytang/.local/lib/python3.10/site-packages/runhouse/servers/grpc/unary_client.py", line 59, in install_packages
server_res = self.stub.InstallPackages(message)
File "/home/ytang/.local/lib/python3.10/site-packages/grpc/_channel.py", line 946, in call
return _end_unary_response_blocking(state, call, False, None)
File "/home/ytang/.local/lib/python3.10/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1685247106.845643418","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1685247106.845642647","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
when run runhouse start --screen
, it shows error like
python3 command was not found. Make sure you have python3 installed.
but when running without --screen, it's ok
Versions
Please run the following and paste the output below.
wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
python collect_env.py
Python Platform: Linux-3.10.0-957.21.3.el7.x86_64-x86_64-with-glibc2.17
Python Version: 3.11.4 (main, Jul 5 2023, 13:45:01) [GCC 11.2.0]
Relevant packages:
boto3==1.33.11
fastapi==0.103.1
fsspec==2023.5.0
pyarrow==13.0.0
rich==13.5.2
runhouse==0.0.13
skypilot==0.4.0
sshfs==2023.10.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.2
wheel==0.38.4
Checking credentials to enable clouds for SkyPilot.
AWS: disabled
Reason: AWS credentials are not set. Run the following commands:
$ pip install boto3
$ aws configure
$ aws configure list # Ensure that this shows identity is set.
For more info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
Details: `aws sts get-caller-identity` failed with error: [botocore.exceptions.NoCredentialsError] Unable to locate credentials.
Azure: disabled
Reason: ~/.azure/msal_token_cache.json does not exist. Run the following commands:
$ az login
$ az account set -s <subscription_id>
For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
GCP: disabled
Reason: GCP tools are not installed. Run the following commands:
$ pip install google-api-python-client
$ conda install -c conda-forge google-cloud-sdk -y
Credentials may also need to be set. Run the following commands:
$ gcloud init
$ gcloud auth application-default login
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#google-cloud-platform-gcp
Details: [builtins.ModuleNotFoundError] No module named 'googleapiclient'
IBM: disabled
Reason: Missing credential file at /home/admins/.ibm/credentials.yaml.
Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
iam_api_key: <IAM_API_KEY>
resource_group_id: <RESOURCE_GROUP_ID>
Kubernetes: disabled
Reason: Credentials not found - check if ~/.kube/config exists.
Lambda: disabled
Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
https://cloud.lambdalabs.com/api-keys
to generate API key and add the line
api_key = [YOUR API KEY]
to ~/.lambda_cloud/lambda_keys
OCI: disabled
Reason: `oci` is not installed. Install it with: pip install oci
For more details, refer to: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#oracle-cloud-infrastructure-oci
SCP: disabled
Reason: Failed to access SCP with credentials. To configure credentials, see: https://cloud.samsungsds.com/openapiguide
Generate API key and add the following line to ~/.scp/scp_credential:
access_key = [YOUR API ACCESS KEY]
secret_key = [YOUR API SECRET KEY]
project_id = [YOUR PROJECT ID]
Cloudflare (for R2 object store): disabled
Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
$ pip install boto3
$ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
$ mkdir -p ~/.cloudflare
$ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2
SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
No existing clusters.
Managed spot jobs
No in progress jobs. (See: sky spot -h)
Additional context
fulll logs:
runhouse start --port 2222
INFO | 2023-12-11 02:29:30.713426 | NumExpr defaulting to 8 threads.
INFO | 2023-12-11 02:29:32.342877 | Using port: 2222.
INFO | 2023-12-11 02:29:32.343102 | Starting API server using the following command: /home/admins/miniconda3/bin/python3 -m runhouse.servers.http.http_server.
Executing `/home/admins/miniconda3/bin/python3 -m runhouse.servers.http.http_server --port 2222`
INFO | 2023-12-11 02:29:34.061997 | NumExpr defaulting to 8 threads.
INFO | 2023-12-11 02:29:36.233910 | Launching HTTP server on port: 2222.
INFO | 2023-12-11 02:29:36.234118 | Launching Runhouse API server with den_auth=False and use_local_telemetry=False on host: 0.0.0.0 and port: 32300
INFO: Started server process [15764]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:32300 (Press CTRL+C to quit)
^CINFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [15764]
runhouse start --port 2222 --screen
INFO | 2023-12-11 02:29:45.997178 | NumExpr defaulting to 8 threads.
INFO | 2023-12-11 02:29:46.455935 | Using port: 2222.
INFO | 2023-12-11 02:29:46.456143 | Starting API server using the following command: /home/admins/miniconda3/bin/python3 -m runhouse.servers.http.http_server.
Executing `screen -dm bash -c "/home/admins/miniconda3/bin/python3 -m runhouse.servers.http.http_server --port 2222 2>&1 | tee -a '/home/admins/.rh/server.log' 2>&1"`
python3 command was not found. Make sure you have python3 installed.
The feature
Support for HPU Habana hardware accelerator in runhouse
Motivation
With the increasing demand for high-performance computing and the need for faster processing of large-scale machine learning and deep learning workloads, HPUs have emerged as powerful hardware accelerators. These accelerators offer significant performance advantages over traditional CPUs and GPUs when it comes to tasks involving LLM, neural networks, large-scale data processing, and scientific simulations.
What the ideal solution looks like
By integrating support for HPUs in runhouse, you would provide developers with a platform that enables them to leverage these advanced hardware accelerators seamlessly. This would open up new possibilities for building and running computationally intensive applications and workflows directly on runhouse infrastructure.
For example: the client would be able to remote launch applications on a HPU aws server by:
rh.cluster(name='rh-gaudi', instance_type='dl1.24xlarge', provider='aws').save()
https://aws.amazon.com/ec2/instance-types/dl1/
https://developer.habana.ai/
Additional context
HPU self hosted server as well.
Describe the bug
The following fails on a M1 Macbook Pro:
conda create -n runhouse python==3.10
conda activate runhouse
pip install --no-cache "runhouse[aws]"
The error is:
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [48 lines of output]
running egg_info
writing lib3/PyYAML.egg-info/PKG-INFO
writing dependency_links to lib3/PyYAML.egg-info/dependency_links.txt
writing top-level names to lib3/PyYAML.egg-info/top_level.txt
Traceback (most recent call last):
File "/Users/abeatson/mambaforge/envs/runhouse3/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
main()
File "/Users/abeatson/mambaforge/envs/runhouse3/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/Users/abeatson/mambaforge/envs/runhouse3/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
self.run_setup()
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 311, in run_setup
exec(code, locals())
File "<string>", line 271, in <module>
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/__init__.py", line 103, in setup
return distutils.core.setup(**attrs)
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
self.run_command(cmd)
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/dist.py", line 963, in run_command
super().run_command(command)
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 321, in run
self.find_sources()
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 329, in find_sources
mm.run()
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 551, in run
self.add_defaults()
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/command/egg_info.py", line 589, in add_defaults
sdist.add_defaults(self)
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/command/sdist.py", line 112, in add_defaults
super().add_defaults()
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/command/sdist.py", line 251, in add_defaults
self._add_defaults_ext()
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/command/sdist.py", line 336, in _add_defaults_ext
self.filelist.extend(build_ext.get_source_files())
File "<string>", line 201, in get_source_files
File "/private/var/folders/1n/t9p25xtd4sl4zxdt57hjlc7m0000gn/T/pip-build-env-obpqea6w/overlay/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 107, in __getattr__
raise AttributeError(attr)
AttributeError: cython_sources
[end of output]
Versions
Please run the following and paste the output below.
wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
Output:
python collect_env.py
Python Platform: macOS-13.4-arm64-arm-64bit
Python Version: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:27:15) [Clang 11.1.0 ]
Relevant packages:
wheel==0.42.0
sh: sky: command not found
sh: sky: command not found
Currently, the traceback from a remote error prints to the logs above the rest of the traceback, not to stderr (or formatted separately from other logs)
From SyncLinear.com | KIT-66
Describe the bug
Hi, for the runhouse version 0.0.9, I consistently hit error when run the following script ( it worked before for previous version)
import runhouse as rh
gpu = rh.cluster(ips=['127.0.0.1'],
ssh_creds={'ssh_user': 'rhclient', 'ssh_private_key':'/home/rhclient/.ssh/id_rsa'},
name='rh-cls')
print("#################Restart server")
print("Exit now")
....
INFO | 2023-07-31 18:30:20,983 | No auth token provided, so not using RNS API to save and load configs
INFO | 2023-07-31 18:30:21,832 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-07-31 18:30:21,944 | Authentication (publickey) failed.
INFO | 2023-07-31 18:30:21,951 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-07-31 18:30:22,010 | Authentication (publickey) failed.
2023-07-31 18:30:22,010| ERROR | Could not open connection to gateway
ERROR | 2023-07-31 18:30:22,010 | Could not open connection to gateway
2023-07-31 18:30:22,011| ERROR | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-07-31 18:30:22,011 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
INFO | 2023-07-31 18:30:22,011 | Server rh-cls is up, but the HTTP server may not be up.
INFO | 2023-07-31 18:30:22,011 | Restarting HTTP server on rh-cls.
INFO | 2023-07-31 18:30:22,011 | Running command on rh-cls: pkill -f "python -m runhouse.servers.http.http_server"
Warning: Permanently added '127.0.0.1' (ED25519) to the list of known hosts.
Permission denied, please try again.
Permission denied, please try again.
[email protected]: Permission denied (publickey,password).
INFO | 2023-07-31 18:30:22,123 | Running command on rh-cls: screen -dm bash -c 'python -m runhouse.servers.http.http_server |& tee -a ~/.rh/cluster_server_rh-cls.log 2>&1'
Warning: Permanently added '127.0.0.1' (ED25519) to the list of known hosts.
Permission denied, please try again.
Permission denied, please try again.
[email protected]: Permission denied (publickey,password).
INFO | 2023-07-31 18:30:27,237 | Checking server rh-cls again.
Traceback (most recent call last):
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 357, in check_server
self.connect_server_client()
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 324, in connect_server_client
self._rpc_tunnel, connected_port = self.ssh_tunnel(
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 411, in ssh_tunnel
ssh_tunnel.start()
File "/home/rhclient/.local/lib/python3.10/site-packages/sshtunnel.py", line 1331, in start
self._raise(BaseSSHTunnelForwarderError,
File "/home/rhclient/.local/lib/python3.10/site-packages/sshtunnel.py", line 1174, in _raise
raise exception(reason)
sshtunnel.BaseSSHTunnelForwarderError: Could not establish session to SSH gateway
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/devspace/test_self_hosted_llm.py", line 14, in
gpu = rh.cluster(ips=['127.0.0.1'],
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster_factory.py", line 59, in cluster
return Cluster(ips=ips, ssh_creds=ssh_creds, name=name, dryrun=dryrun)
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 58, in init
self.check_server()
File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 379, in check_server
self.client.check_server(cluster_config=cluster_config)
AttributeError: 'NoneType' object has no attribute 'check_server'
Versions
Please run the following and paste the output below
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
Python Platform: Linux-5.15.0-60-lowlatency-x86_64-with-glibc2.35
Python Version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]
Relevant packages:
boto3==1.28.15
fastapi==0.99.0
fsspec==2023.5.0
pyarrow==12.0.1
pycryptodome==3.12.0
rich==13.5.1
runhouse==0.0.9
skypilot==0.3.3
sshfs==2023.7.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.2
wheel==0.38.4
Additional context
I started:
The feature
I wonder if you have any plans to add features and interfaces that allow runhouse to manage local network GPU (not native cloud ) devices?
Motivation
Because I need to deploy localization devices. Instead of relying entirely on cloud devices
From SyncLinear.com | KIT-81
Rather than always append "./" to reqs, etc.
From SyncLinear.com | KIT-64
Describe the bug
Hi,
I'm trying to use a gpu system on our local network. However I'm running into issues.
Basic question: Does the runhouse package need to be installed on the remote gpu system? Couldn't figure this out from the documentation.
Here is the snippet of code I'm trying to run:
import runhouse as rh
import pdb;pdb.set_trace()
cluster = rh.cluster(
name="mlw-cluster",
ips=['xx.xx.xx.xx'],
ssh_creds={'ssh_user': 'lab', 'ssh_private_key':'/export/lab/.ssh/mlw01.key'},
)
def num_cpus():
import multiprocessing
return f"Num cpus: {multiprocessing.cpu_count()}"
num_cpus()
num_cpus_cluster = rh.function(name="num_cpus_cluster", fn=num_cpus).to(system=cluster, reqs=["./"])
I get following error in creating the cluster:
(Pdb) c
2023-07-20 10:17:54,985| WAR | MainThrea/1032@sshtunnel | Could not read SSH configuration file: ~/.ssh/config
WARNING | 2023-07-20 10:17:54,985 | Could not read SSH configuration file: ~/.ssh/config
2023-07-20 10:17:54,987| INF | MainThrea/1060@sshtunnel | 1 keys loaded from agent
INFO | 2023-07-20 10:17:54,987 | 1 keys loaded from agent
2023-07-20 10:17:54,988| INF | MainThrea/1117@sshtunnel | 1 key(s) loaded
INFO | 2023-07-20 10:17:54,988 | 1 key(s) loaded
2023-07-20 10:17:54,988| ERR | MainThrea/1314@sshtunnel | Password is required for key /export/lab/.ssh/mlw01.key
ERROR | 2023-07-20 10:17:54,988 | Password is required for key /export/lab/.ssh/mlw01.key
2023-07-20 10:17:54,988| INF | MainThrea/0978@sshtunnel | Connecting to gateway: xx.x.xxx.x:22 as user 'lab'
INFO | 2023-07-20 10:17:54,988 | Connecting to gateway: 172.17.10.110:22 as user 'lab'
2023-07-20 10:17:54,988| DEB | MainThrea/0983@sshtunnel | Concurrent connections allowed: True
2023-07-20 10:17:54,989| DEB | MainThrea/1400@sshtunnel | Trying to log in with key: b'asdWEQWEQWe'
2023-07-20 10:17:55,012| DEB | MainThrea/1204@sshtunnel | Transport socket info: (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 0), timeout=0.1
2023-07-20 10:17:55,043| INF | Thread-1/1893@transport | Connected (version 2.0, client OpenSSH_7.6p1)
INFO | 2023-07-20 10:17:55,043 | Connected (version 2.0, client OpenSSH_7.6p1)
2023-07-20 10:17:55,278| INF | Thread-1/1893@transport | Authentication (publickey) successful!
INFO | 2023-07-20 10:17:55,278 | Authentication (publickey) successful!
2023-07-20 10:17:55,279| ERR | MainThrea/1230@sshtunnel | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-07-20 10:17:55,279 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
2023-07-20 10:17:55,280| WAR | MainThrea/1032@sshtunnel | Could not read SSH configuration file: ~/.ssh/config
WARNING | 2023-07-20 10:17:55,280 | Could not read SSH configuration file: ~/.ssh/config
2023-07-20 10:17:55,282| INF | MainThrea/1060@sshtunnel | 1 keys loaded from agent
INFO | 2023-07-20 10:17:55,282 | 1 keys loaded from agent
2023-07-20 10:17:55,282| INF | MainThrea/1117@sshtunnel | 1 key(s) loaded
INFO | 2023-07-20 10:17:55,282 | 1 key(s) loaded
2023-07-20 10:17:55,283| ERR | MainThrea/1314@sshtunnel | Password is required for key /export/lab/.ssh/mlw01.key
ERROR | 2023-07-20 10:17:55,283 | Password is required for key /export/lab/.ssh/mlw01.key
2023-07-20 10:17:55,283| INF | MainThrea/0978@sshtunnel | Connecting to gateway: 172.17.10.110:22 as user 'lab'
INFO | 2023-07-20 10:17:55,283 | Connecting to gateway: 172.17.10.110:22 as user 'lab'
2023-07-20 10:17:55,283| DEB | MainThrea/0983@sshtunnel | Concurrent connections allowed: True
2023-07-20 10:17:55,283| WAR | MainThrea/1618@sshtunnel | It looks like you didn't call the .stop() before the SSHTunnelForwarder obj was collected by the garbage collector! Running .stop(force=True)
WARNING | 2023-07-20 10:17:55,283 | It looks like you didn't call the .stop() before the SSHTunnelForwarder obj was collected by the garbage collector! Running .stop(force=True)
2023-07-20 10:17:55,284| INF | MainThrea/1374@sshtunnel | Closing all open connections...
INFO | 2023-07-20 10:17:55,284 | Closing all open connections...
2023-07-20 10:17:55,284| DEB | MainThrea/1378@sshtunnel | Listening tunnels: None
2023-07-20 10:17:55,284| WAR | MainThrea/1450@sshtunnel | Tunnels are not started. Please .start() first!
WARNING | 2023-07-20 10:17:55,284 | Tunnels are not started. Please .start() first!
2023-07-20 10:17:55,284| INF | MainThrea/1453@sshtunnel | Closing ssh transport
INFO | 2023-07-20 10:17:55,284 | Closing ssh transport
2023-07-20 10:17:55,284| DEB | MainThrea/1477@sshtunnel | Transport is closed
2023-07-20 10:17:55,285| DEB | MainThrea/1400@sshtunnel | Trying to log in with key: b'463095aa1803da78647cd548f37173ef'
2023-07-20 10:17:55,305| DEB | MainThrea/1204@sshtunnel | Transport socket info: (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 0), timeout=0.1
2023-07-20 10:17:55,334| INF | Thread-3/1893@transport | Connected (version 2.0, client OpenSSH_7.6p1)
INFO | 2023-07-20 10:17:55,334 | Connected (version 2.0, client OpenSSH_7.6p1)
2023-07-20 10:17:55,578| INF | Thread-3/1893@transport | Authentication (publickey) successful!
INFO | 2023-07-20 10:17:55,578 | Authentication (publickey) successful!
2023-07-20 10:17:55,579| INF | Srv-50053/1433@sshtunnel | Opening tunnel: 0.0.0.0:50053 <> 127.0.0.1:50052
INFO | 2023-07-20 10:17:55,579 | Opening tunnel: 0.0.0.0:50053 <> 127.0.0.1:50052
INFO | 2023-07-20 10:17:55,580 | Checking server mlw-cluster
2023-07-20 10:17:55,814| TRA | Thread-5 /0360@sshtunnel | #1 <-- ('127.0.0.1', 44364) connected
2023-07-20 10:17:55,815| TRA | Thread-5 /0316@sshtunnel | >>> OUT #1 <-- ('127.0.0.1', 44364) send to ('127.0.0.1', 50052): b'504f5354202f636865636b2f20485454502f312e310d0a486f73743a203132372e302e302e313a35303035330d0a557365722d4167656e743a20707974686f6e2d72657175657374732f322e33312e300d0a4163636570742d456e636f64696e673a20677a69702c206465666c6174650d0a4163636570743a202a2f2a0d0a436f6e6e656374696f6e3a206b6565702d616c6976650d0a436f6e74656e742d4c656e6774683a203330300d0a436f6e74656e742d547970653a206170706c69636174696f6e2f6a736f6e0d0a0d0a7b2264617461223a20227b5c6e202020205c226e616d655c223a205c227e2f6d6c772d636c75737465725c222c5c6e202020205c227265736f757263655f747970655c223a205c22636c75737465725c222c5c6e202020205c227265736f757263655f737562747970655c223a205c22436c75737465725c222c5c6e202020205c226970735c223a205b5c6e20202020202020205c223137322e31372e31302e3131305c225c6e202020205d2c5c6e202020205c227373685f63726564735c223a207b5c6e20202020202020205c227373685f757365725c223a205c226c61625c222c5c6e20202020202020205c227373685f707269766174655f6b65795c223a205c222f6578706f72742f6c61622f2e7373682f6d6c7730312e6b65795c225c6e202020207d5c6e7d227d' >>>
2023-07-20 10:17:55,816| TRA | Thread-5 /0333@sshtunnel | <<< IN #1 <-- ('127.0.0.1', 44364) recv: b'5353482d322e302d4f70656e5353485f372e367031205562756e74752d347562756e7475302e350d0a' <<<
INFO | 2023-07-20 10:17:55,816 | Server mlw-cluster is up, but the HTTP server may not be up.
INFO | 2023-07-20 10:17:55,817 | Restarting HTTP server on mlw-cluster.
INFO | 2023-07-20 10:17:55,817 | Running command on mlw-cluster: pkill -f "python -m runhouse.servers.http.http_server"
2023-07-20 10:17:55,817| TRA | Thread-5 /0311@sshtunnel | >>> OUT #1 <-- ('127.0.0.1', 44364) recv empty data >>>
2023-07-20 10:17:55,820| TRA | Thread-5 /0375@sshtunnel | #1 <-- ('127.0.0.1', 44364) connection closed.
INFO | 2023-07-20 10:17:56,571 | Running command on mlw-cluster: screen -dm bash -c 'python -m runhouse.servers.http.http_server |& tee -a ~/.rh/cluster_server_mlw-cluster.log 2>&1'
INFO | 2023-07-20 10:18:02,291 | Checking server mlw-cluster again.
2023-07-20 10:18:02,318| ERR | Thread-3/1893@transport | Secsh channel 1 open FAILED: Connection refused: Connect failed
ERROR | 2023-07-20 10:18:02,318 | Secsh channel 1 open FAILED: Connection refused: Connect failed
2023-07-20 10:18:02,318| TRA | Thread-14/0357@sshtunnel | #2 <-- ('127.0.0.1', 47456) open new channel ssh error: ChannelException(2, 'Connect failed')
2023-07-20 10:18:02,318| ERR | Thread-14/0394@sshtunnel | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
ERROR | 2023-07-20 10:18:02,318 | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
Traceback (most recent call last):
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen
httplib_response = self._make_request(
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 461, in _make_request
httplib_response = conn.getresponse()
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 1375, in getresponse
response.begin()
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 287, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 798, in urlopen
retries = retries.increment(
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise
raise value.with_traceback(tb)
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen
httplib_response = self._make_request(
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 461, in _make_request
httplib_response = conn.getresponse()
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 1375, in getresponse
response.begin()
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 287, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/export/lab/work/learn_runhouse/testmlw01.py", line 4, in <module>
cluster = rh.cluster(
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/rns/hardware/cluster_factory.py", line 59, in cluster
return Cluster(ips=ips, ssh_creds=ssh_creds, name=name, dryrun=dryrun)
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 60, in __init__
self.check_server()
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 381, in check_server
self.client.check_server(cluster_config=cluster_config)
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/servers/http/http_client.py", line 48, in check_server
self.request(
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/servers/http/http_client.py", line 35, in request
response = req_fn(
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/api.py", line 115, in post
return request("post", url, data=data, json=json, **kwargs)
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Versions
Please run the following and paste the output below.
wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
Python Platform: Linux-5.19.0-46-generic-x86_64-with-glibc2.35
Python Version: 3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0]
Relevant packages:
boto3==1.28.6
fastapi==0.99.0
fsspec==2023.6.0
pyarrow==12.0.1
pycryptodome==3.12.0
rich==13.4.2
runhouse==0.0.9
skypilot==0.3.3
sshfs==2023.7.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.1
wheel==0.38.4
SkyPilot collects usage data to improve its services. `setup` and `run` commands are not collected to ensure privacy.
Usage logging can be disabled by setting the environment variable SKYPILOT_DISABLE_USAGE_COLLECTION=1.
Checking credentials to enable clouds for SkyPilot.
AWS: disabled
Reason: AWS credentials are not set. Run the following commands:
$ pip install boto3
$ aws configure
$ aws configure list # Ensure that this shows identity is set.
For more info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
Details: `aws sts get-caller-identity` failed with error: [botocore.exceptions.NoCredentialsError] Unable to locate credentials.
Azure: disabled
Reason: ~/.azure/msal_token_cache.json does not exist. Run the following commands:
$ az login
$ az account set -s <subscription_id>
For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
GCP: disabled
Reason: GCP tools are not installed. Run the following commands:
$ pip install google-api-python-client
$ conda install -c conda-forge google-cloud-sdk -y
Credentials may also need to be set. Run the following commands:
$ gcloud init
$ gcloud auth application-default login
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#google-cloud-platform-gcp
Details: [builtins.ModuleNotFoundError] No module named 'googleapiclient'
Lambda: disabled
Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
https://cloud.lambdalabs.com/api-keys
to generate API key and add the line
api_key = [YOUR API KEY]
to ~/.lambda_cloud/lambda_keys
IBM: disabled
Reason: Missing credential file at /export/lab/.ibm/credentials.yaml.
Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
iam_api_key: <IAM_API_KEY>
resource_group_id: <RESOURCE_GROUP_ID>
SCP: disabled
Reason: Failed to access SCP with credentials. To configure credentials, see: https://cloud.samsungsds.com/openapiguide
Generate API key and add the following line to ~/.scp/scp_credential:
access_key = [YOUR API ACCESS KEY]
secret_key = [YOUR API SECRET KEY]
project_id = [YOUR PROJECT ID]
OCI: disabled
Reason: `oci` is not installed. Install it with: pip install oci
For more details, refer to: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#oracle-cloud-infrastructure-oci
Cloudflare (for R2 object store): disabled
Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
$ pip install boto3
$ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
$ mkdir -p ~/.cloudflare
$ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2
SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
No existing clusters.
Managed spot jobs
No in progress jobs. (See: sky spot -h)
Additional context
Add any other context about the problem here.
Batching is critical for good compute utilization in ML. Assuming fn is written to accept a list of inputs, calling fn.batch(single_item, batch_size=10) should accumulate the inputs on the server and only call fn(list_of_items) when it has a full batch. Open questions:
From SyncLinear.com | KIT-71
Basic API ideas (WIP):
Create Run object (captures logs, inputs, outputs, other artifacts read or written within call, who ran, where):
res = fn(**kwargs, name=”my_run”)
A run is a folder (created inside local rh directory by default), and can be sent elsewhere to persist logs, results, artifact info, etc.:
rh.run(name=“my_run”).to("s3", path="runhouse/nlp_team/bert_ft/results")
Ideally, we can have a "default log store" setting in the user config so the logs from their runs can be sent to the same place by default when they save, rather than having to send each run one by one.
This could be the way for users to configure for artifacts/logs to flow to an existing MLFlow store, or to flow to W&B, Grafana, Datadog, etc.
Save the run to local or RNS (not all runs need to be saved)
rh.run(name=“my_run”).save()
Creates a run object by tracing the activity within the block - no inputs and outputs, but captures logs (perhaps several logfiles for different calls) and artifacts used:
with rh.run(name=”my_run”) as r:
Big feature, essentially the same as auto-caching in orchestrators - check if this run was already completed, and load results if so, otherwise run:
res = fn.get_or_run(name=”yelp_review_preproc_test”)
Create/name a CLI run:
r = my_cluster.run(["python test_bert.py --gpus 4 --model distilbert"], name="test_distilbert_ddp")
Inspiration: this MLFlow example
We can also support event (failure or completion) notifications through knocknock or pagerduty!
Cc @Caroline
From SyncLinear.com | KIT-67
I am trying Run House with a local pre-configured server. But that server needs to use "ProxyCommand" option to SSH into. Is there a way the PxoxyCommand can be specified in the Cluster API (like in the ssh_creds dict)?
Typical way to SSH into the server is something like this:
ssh -i -o ProxyCommand="ssh -W %h:%p <user>>@<frontendproxyhost>" <user>@<targethost>>
I do have a workaround to add the ProxyCommand in ~/.ssh/config but would be nice to specify as params in the rh.cluster API for cases where the SSH command are a bit dynamic (like in my case).
Please add support for Python 3.11
(some basic notes below, feel free to edit/comment)
GHA Setup
Types of testing, split using pytest.mark
Test profiling
Refactoring
From SyncLinear.com | KIT-85
From SyncLinear.com | KIT-82
Eliminate need to .write() a data object to the filesystem before returning from a function, which can be quite expensive - e.g. after I've preprocessed a dataset that's already backed by files in the filesystem, calling .write() is just copying them for no reason, including a costly partition.
Basically, we have the cluster object store, we should use it to avoid fs reads and writes we don't need (and save the user the trouble of knowing they need to .write() before using data remotely). This also saves us the trouble of finding places to write down data when the user doesn't feel like providing a path, or is just working with an anonymous data object (e.g. returning an rh.Table from a preprocessing fn). This will also clean up a sort of API wrinkle where a pinned object is markedly different from a blob (there doesn't need to be a real difference in terms of user intent), and the relationship between data passed to a resource constructor and the written-down data is a little unclear (e.g. if I do rh.table(my_ray_table, path="real/path/to/existing.parquet")
which data should fetch return?).
Basic API concepts:
rh.table(my_table)
system=this_cluster
and name=f"table_{random_hex}"
(just like we do to generate random run_keys). The rns_address (whether random or user-provided) is the key in the object store..save
just persists in the RNS that the object lives on that cluster in the object store. If the cluster goes down, the table is obviously gone._data
field because there's no need for a local object store (nothing can .get
the object from the local interpreter anyway).rh.table(my_table).write()
would actually save the table down (same as present behavior), but return a new table object with path set to the fs path. That eliminates the current ._cached_data
ambiguity (multiple sources of truth), because the original object still holds the original data, and the new returned object just points to the fs data. rh.table(my_table).write(path="local/path.parquet")
is clearer than the present constructor accepting both (we should probably throw an error if both are passed in, because it's ambiguous). One gotcha: if the user sets the name for the in-memory table and then writes it, should the new table have the same name? If they .save it, should we delete the existing object out of the object store so it's clear that there's only one table with that rns_address (and its not really accessible anymore)? In general, if a user loads an object from_name, the one stored in RNS should be the source of truth, even if there's a local one in the object store.
my_table.fetch()
and my_table.stream()
from elsewhere should still work, but now via RPCs - the cluster's .get should already work for fetch, but we'd likely need a new one for stream. For fetch, the object needs to be pickleable (not cloudpickle-able) for us to be able to send it over the wire without dealing with python version mismatches (I don't think this is unreasonable).
We need a way to tell for a given blob or table if we need to use the RPCs instead of the existing fs-based operations, and I'm leaning toward actually breaking out the folder-backed table or blob to be separate classes from the in-memory ones. It would probably make the most sense if the in-memory Blob/Table/KVstore etc. classes were actually the base classes, and the folder-backed ones were subclasses. There are a number of advantages to doing this:
path
field.rh.blob(my_model)
saves into the object store with key blob.name
orf"blob_{randomhex}"
. rh.blob(my_model, name="my_model")
and rh.Blob.from_name("my_model")
should behave identically to rh.pin_to_memory("my_model", my_model)
and rh.get_pinned_object("my_model")
, (except with rns_address as the obj_store key instead of name, but that's an implementation detail) and ideally replace it. The current pinning system isn't very elegant and eats too much user brainspace.
An immediate implication of the above (because we use pinning for storing results when a user calls fn.remote
), is that fn.remote
can just wrap the result in a blob
before returning instead of returning the run_key. Wrapping a result in rh.blob is common enough that it makes sense for .remote
to mean "please return a remote object." The current .remote
behavior of returning the run_key is actually "run this async and return a key to retrieve the result", which I think would make more sense to be called fn.async
or fn.submit
, considering the fact that most users don't seem to know we support async because the naming is unclear (submit could make it clearer that the function will continue to run in the background even if they kill the interpreter locally). Also, right now we need to INFO log a bunch of instructions for killing or retrieving for every .remote
call, but this isn't necessary and looks ugly when the user just wants a remote object back.
Lastly, supporting remote in-memory objects opens the door to remote calls on those objects. We could pretty easily support this just by intercepting any call on the object, and if the rh.blob doesn't have that function/attr, we try RPCing the call over to the cluster. Like this:
class Blob(Resource):
...
def __getattribute__(self,name):
if not_a_blob_attr(name):
remote_attr = self.get_attr_over_rpc(name)
if name == "__call__" or hasattr(remote_attr, "__call__"):
def newfunc(*args, **kwargs):
result = self.call_on_obj_via_rpc(name, *args, **kwargs)
return result if self.is_primitive(result) else rh.blob(result)
return newfunc
else:
return remote_attr
else:
return attr
This would make our remote objects real remote objects, and save a lot of trouble creating one-off functions to send to the cluster to call methods on objects. You can do something crazy like:
model = rh.blob(my_model).to(gpu).cuda() # But can't use .to("cuda") because it'd call blob's .to
local_pil_image = model("my_input_string").fetch()
So overall the benefits of this change are:
From SyncLinear.com | KIT-83
Cc @Caroline
From SyncLinear.com | KIT-38
Integrate via Rest API (slurmrestd)
SlurmCluster
subclass which can submit jobs to existing slurm clusterFrom SyncLinear.com | KIT-78
Hi, I consistently see my script hanging when it copies local package to the server, is there any way from the server side which can display the packages are actually been copying?
/work/rh/scripts/self-hosted.py
INFO | 2023-05-31 20:38:49,626 | No auth token provided, so not using RNS API to save and load configs
INFO | 2023-05-31 20:39:24,493 | Running command on rh-cluster: ray start --head
INFO | 2023-05-31 20:39:46,019 | Running command on rh-cluster: mkdir -p ~/.rh; touch /.rh/cluster_config.yaml; echo '{"name": "/rh-cluster", "resource_type": "cluster", "resource_subtype": "Cluster", "ips": ["127.0.0.1"], "ssh_creds": {"ssh_user": "ytang", "ssh_private_key": "/home/ytang/.ssh/id_rsa"}}' > ~/.rh/cluster_config.yaml
Warning: Identity file /home/ytang/.ssh/id_rsa not accessible: No such file or directory.
INFO | 2023-05-31 20:39:50,904 | Setting up Function on cluster.
INFO | 2023-05-31 20:39:51,059 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-05-31 20:39:51,127 | Authentication (publickey) successful!
2023-05-31 20:39:51,128| ERROR | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-05-31 20:39:51,128 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
INFO | 2023-05-31 20:39:51,288 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-05-31 20:39:51,418 | Authentication (publickey) successful!
INFO | 2023-05-31 20:39:51,674 | Copying local package work to cluster
root@35c45fe5c801:/work/rh# cd /work/rh ; /usr/bin/env /usr/bin/python3 /root/.vscode-server/extensions/ms-python.python-2023.8.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 54577 -- /work/rh/scripts/self-hosted.py
INFO | 2023-05-31 20:46:22,533 | No auth token provided, so not using RNS API to save and load configs
INFO | 2023-05-31 20:46:27,993 | Running command on rh-cluster: ray start --head
INFO | 2023-05-31 20:46:28,686 | Running command on rh-cluster: mkdir -p ~/.rh; touch /.rh/cluster_config.yaml; echo '{"name": "/rh-cluster", "resource_type": "cluster", "resource_subtype": "Cluster", "ips": ["127.0.0.1"], "ssh_creds": {"ssh_user": "ytang", "ssh_private_key": "/home/ytang/.ssh/id_rsa"}}' > ~/.rh/cluster_config.yaml
Warning: Identity file /home/ytang/.ssh/id_rsa not accessible: No such file or directory.
INFO | 2023-05-31 20:46:29,852 | Setting up Function on cluster.
INFO | 2023-05-31 20:46:29,917 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-05-31 20:46:30,028 | Authentication (publickey) successful!
2023-05-31 20:46:30,028| ERROR | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-05-31 20:46:30,028 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
INFO | 2023-05-31 20:46:30,081 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-05-31 20:46:30,157 | Authentication (publickey) successful!
INFO | 2023-05-31 20:46:30,413 | Copying local package work to cluster
Describe the bug
Hi recently I constantly hit BadStatusLine issue as follows, may be it is related to urllib library issue?
client@4c31ddeb9ade:/zip$ python test_self_hosted_llm.py
INFO | 2023-06-14 18:24:16,048 | No auth token provided, so not using RNS API to save and load configs
INFO | 2023-06-14 18:24:16,921 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-06-14 18:24:16,981 | Authentication (publickey) successful!
2023-06-14 18:24:16,982| ERROR | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-06-14 18:24:16,982 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
INFO | 2023-06-14 18:24:17,115 | Connected (version 2.0, client OpenSSH_8.9p1)
INFO | 2023-06-14 18:24:17,174 | Authentication (publickey) successful!
INFO | 2023-06-14 18:24:17,224 | Running command on rh-cluster: pkill -f "python -m runhouse.servers.http.http_server"
Warning: Identity file /home/server/.ssh/id_rsa not accessible: Permission denied.
pkill: killing pid 255251 failed: Operation not permitted
pkill: killing pid 255253 failed: Operation not permitted
INFO | 2023-06-14 18:24:17,274 | Running command on rh-cluster: screen -dm bash -c 'python -m runhouse.servers.http.http_server |& tee -a ~/.rh/cluster_server_rh-cluster.log 2>&1'
Warning: Identity file /home/server/.ssh/id_rsa not accessible: Permission denied.
INFO | 2023-06-14 18:24:20,324 | Running command on rh-cluster: ray start --head
WARNING | 2023-06-14 18:24:21,357 | /home/client/.local/lib/python3.10/site-packages/runhouse/rns/function.py:110: UserWarning: reqs
and setup_cmds
arguments has been deprecated. Please use env
instead.
warnings.warn(
INFO | 2023-06-14 18:24:21,358 | Setting up Function on cluster.
INFO | 2023-06-14 18:24:21,495 | Installing packages on cluster rh-cluster: ['transformers', 'torch', 'Package: zip']
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 714, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 466, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 461, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.10/http/client.py", line 1374, in getresponse
response.begin()
File "/usr/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.10/http/client.py", line 300, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: ?ÿÿ?ÿÿ ?
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 798, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.10/dist-packages/urllib3/util/retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.10/dist-packages/urllib3/packages/six.py", line 769, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 714, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 466, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 461, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.10/http/client.py", line 1374, in getresponse
response.begin()
File "/usr/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.10/http/client.py", line 300, in _read_status
raise BadStatusLine(line)
urllib3.exceptions.ProtocolError: ('Connection aborted.', BadStatusLine('\x00\x00\x18\x04\x00\x00\x00\x00\x00\x00\x04\x00?ÿÿ\x00\x05\x00?ÿÿ\x00\x06\x00\x00 \x00þ\x03\x00\x00\x00\x01\x00\x00\x04\x08\x00\x00\x00\x00\x00\x00?\x00\x00'))
Versions
Please run the following and paste the output below.
wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
python collect_env.py
Python Platform: Linux-5.15.0-60-lowlatency-x86_64-with-glibc2.35
Python Version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]
Relevant packages:
awscli==1.27.153
boto3==1.26.153
fsspec==2023.5.0
pyarrow==12.0.1
pycryptodome==3.12.0
rich==13.4.2
runhouse @ file:///tmp/runhouse-0.0.6-py3-none-any.whl
skypilot==0.3.1
sshfs==2023.4.1
sshtunnel==0.4.0
typer==0.9.0
wheel==0.38.4
sh: 1: sky: not found
sh: 1: sky: not found
**Additional context**
Add any other context about the problem here.
Please help prioritize our roadmap! We have a long list of projects we'd like to complete to make Runhouse robust 🦾, comprehensive 🎨, and flexible 🙆♀️ across research and production usage. Please comment which items resonate for your use cases, or let us know if there are features we've missed!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.