Giter VIP home page Giter VIP logo

ray-legacy's Introduction

Ray

Build Status

Ray is an experimental distributed extension of Python. It is under development and not ready to be used.

The goal of Ray is to make it easy to write machine learning applications that run on a cluster while providing the development and debugging experience of working on a single machine.

Before jumping into the details, here's a simple Python example for doing a Monte Carlo estimation of pi (using multiple cores or potentially multiple machines).

import ray
import numpy as np

# Start a scheduler, an object store, and some workers.
ray.init(start_ray_local=True, num_workers=10)

# Define a remote function for estimating pi.
@ray.remote
def estimate_pi(n):
  x = np.random.uniform(size=n)
  y = np.random.uniform(size=n)
  return 4 * np.mean(x ** 2 + y ** 2 < 1)

# Launch 10 tasks, each of which estimates pi.
result_ids = []
for _ in range(10):
  result_ids.append(estimate_pi.remote(100))

# Fetch the results of the tasks and print their average.
estimate = np.mean(ray.get(result_ids))
print "Pi is approximately {}.".format(estimate)

Within the for loop, each call to estimate_pi.remote(100) sends a message to the scheduler asking it to schedule the task of running estimate_pi with the argument 100. This call returns right away without waiting for the actual estimation of pi to take place. Instead of returning a float, it returns an object ID, which represents the eventual output of the computation (this is a similar to a Future).

The call to ray.get(result_id) takes an object ID and returns the actual estimate of pi (waiting until the computation has finished if necessary).

Next Steps

Example Applications

ray-legacy's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ray-legacy's Issues

Ctrl-C causes IPC error

To reproduce this error, do the following

  1. Start a shell with python scripts/shell.py
  2. After the shell has stared, press ctrl-c.

This gives me the following error message (and kills the workers/scheduler/object store)

In [1]: fffafaaftttaaataallt l aa llee e rrererrfora toaorlrrc oc rueorrcerc odu:rr oeco dcc:cCu uhrrefeceCakddff:htaa ae :cl f kttCCahie eaarflraocll eke rd erif l aroarrcticl ooerrud  r elooidccfa ntaaecc tuut rrleeia 2ldd9ln::ei0 n  ie  en C2 hC2/r99rheU00s ceiek onrirncs  // / ko frUafkUcisns/aleeieWrordrsl /errskk/sndrp/r Waokcorernk//rarW astoopyr o lci/arcscnoerec/ukrrac /ycs2w/p9uar0oecrskreceh//rr w.iacoycrne:kc e err/d/s.ek:Uc ecs frciceC/:vhr se/ecr_kkqrn euf/ceeuaieivWle_eo.drr ke_saqctpeuaiecv eeule(i/&_rncme.a uryree2sc/9eseais0dgv riceen/)(: w&  mw/oeirUCkssheesrratgehec).skc/ c  fwmr:aik ietrlneshc/ seamegedWisevos rea _gekqausetep  ruaerlecie_onrre./  rrror2aeecrcey i/9vrees(r0cei/&cv mweioeniirsns kagve gr/o.ev)cUiesrncg eI: r Pw so/virrteChk
r  emncIe/esWisvoPeraC_gkqe
s upeearucreeo_/rr. arryee/cceseirivcvi/enw(go& rmoevkseesrra .gIcePc)C: 
 wrietche imvees_sqaugeeuae _.errercoeri vree(c&emievsisnagg eo)v ewri tIhP Cme
ssage error receiving over IPC
t line 290 in /Users/rkn/Workspace/ray/src/worker.cc: receive_queue_.receive(&message) with message error receiving over IPC
edad: : iled Cahte clki nfCeahie lcekd  fwaaitl2 o9ler0dki en rea.itcn c  l/2iU:9n s0eer  ericsne/2 r9ik/v0Unes _ei/rnWqso /ru/reUkusnke/e_sWr.orpsra/kecrcekse/pnra/aciyeW//ovsrrrkecsa/(pwya/o&csme/rerrsacksya/e/wrso.rgrck/eewr)o.rc kccweci:rt .hrc: c em:rce eesrisevcceae_giqevu eeeeiu_veq_ru.erer_oqeucuree _ie.rvreueee(c&_cme.eiirevesicsvneagige veoe(v(&e&)mmre es wssIisaPatgCgehe
) ) mw eiwstishta hgm eem seesrsarsgoaerg  er eeercrrreooirrv  irrneegcc eeoiivveirn gI PoCve
r IPC
ving over IPC

KeyboardInterrupt

replace fatal logging with check and dcheck

Lots of the code currently looks like

if (condition) {
  HALO_LOG(HALO_FATAL, "error message");
}

This should be replaced by something like

HALO_CHECK(condition, "error message");

Similar to the code in apache/arrow#33, we may also want the following

HALO_DCHECK(condition)
HALO_DCHECK_EQ(val1, val2)
HALO_DCHECK_NE(val1, val2)
HALO_DCHECK_LE(val1, val2)
HALO_DCHECK_LT(val1, val2)
HALO_DCHECK_GE(val1, val2)
HALO_DCHECK_GT(val1, val2)

These can be used to automatically generate error messages. We should also include the ability to include custom error messages, and the error messages should include the file and line number where they occur.

scheduling more intelligently (brainstorm)

  • check if a same task has been executed
  • greedily check if there is a worker more "suitable" based on what objects are local already (L0: count, L1: size)
  • add a couple perf tests: task dependency graph structure, number of workers, task types.

time: one week.

System modifications in setup script

Should system modifications be removed from the setup script ./setup.sh? If yes, how should we handle the sudo apt-et install ..., sudo pip install ..., and brew install?

Support ObjectID arguments

Consider the following code:

@ray.remote([np.ndarray], [])
def f(x)
  pass

@ray.remote([np.ndarray], [])
def g(x)
  f(x)

If we have an ObjectID objectid to a numpy array, and we call g(objectid), then the scheduler will ship the numpy array to the worker that is assigned to execute g. Then the worker will call f and pass the array by value. This is bad because we have unnecessarily shipped the array when we didn't need to (three times in fact, once to the worker, once to the scheduler when calling by value, and once from the scheduler to the worker that excecutes f).

The code should actually be expressed as follows.

@ray.remote([np.ndarray], [])
def f(x)
  pass

@ray.remote([ObjectID[np.ndarray]], [])
def g(x)
  f(x)

A function that has ObjectID[np.ndarray] in the remote decorator can be used exactly the same way that a function with np.ndarray in the remote decorator can be used. The difference is that when the function is executed, the function is given an ObjectID instead of the numpy array.

This allows library developers to specify more clearly what they want, but the user of the function doesn't need to do anything differently.

bug in array_test.py

The following example segfaults on some machines (run in ray/test).

import ray
import ray.array.distributed as da

ray.services.start_ray_local(num_workers=2)

x = da.zeros([234, 432])
y = da.copy(x)
ray.get(da.assemble(y))

Inspecting the core dump gives

(lldb) bt
* thread #1: tid = 0x0000, 0x00000001043c452f libraylib.so`PyObjRef_dealloc(PyObjRef*) + 703, stop reason = signal SIGSTOP
  * frame #0: 0x00000001043c452f libraylib.so`PyObjRef_dealloc(PyObjRef*) + 703
    frame #1: 0x0000000103be9fdd multiarray.so`___lldb_unnamed_function3507$$multiarray.so + 484
    frame #2: 0x0000000103b52d8d multiarray.so`___lldb_unnamed_function21$$multiarray.so + 139
    frame #3: 0x000000010357c86f Python`___lldb_unnamed_function685$$Python + 129
    frame #4: 0x000000010359cb73 Python`___lldb_unnamed_function1155$$Python + 575
    frame #5: 0x000000010359247c Python`___lldb_unnamed_function974$$Python + 107
    frame #6: 0x000000010356d8f6 Python`___lldb_unnamed_function490$$Python + 110
    frame #7: 0x00000001035ca591 Python`PyEval_EvalCodeEx + 2047
    frame #8: 0x00000001035d04ae Python`___lldb_unnamed_function1476$$Python + 117
    frame #9: 0x00000001035cd30c Python`PyEval_EvalFrameEx + 11609
    frame #10: 0x00000001035ca3c1 Python`PyEval_EvalCodeEx + 1583
    frame #11: 0x00000001035d04ae Python`___lldb_unnamed_function1476$$Python + 117
    frame #12: 0x00000001035cd30c Python`PyEval_EvalFrameEx + 11609
    frame #13: 0x00000001035ca3c1 Python`PyEval_EvalCodeEx + 1583
    frame #14: 0x00000001035c9d8c Python`PyEval_EvalCode + 54
    frame #15: 0x00000001035e9a42 Python`___lldb_unnamed_function1599$$Python + 53
    frame #16: 0x00000001035e9ae5 Python`PyRun_FileExFlags + 133
    frame #17: 0x00000001035e9634 Python`PyRun_SimpleFileExFlags + 698
    frame #18: 0x00000001035fb011 Python`Py_Main + 3137
    frame #19: 0x00007fff82b025ad libdyld.dylib`start + 1
    frame #20: 0x00007fff82b025ad libdyld.dylib`start + 1

Properly check exceptions in raylib

Right now, some exceptions are only exposed if we use the value. For example, if you do the following in scripts/shell.py.

>>> ray.put(10 ** 100)
>>> ray.get(ray.put(10 ** 100))
-1L
>>> ref = ray.put(10 ** 100)
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
error: serialization: long overflow

Also, if you start a regular python shell,

>>> import ray
>>> ray.serialization.serialize(ray.worker.global_worker.handle, 1)
(<capsule object "obj" at 0x106ed9db0>, [])
>>> x = ray.serialization.serialize(ray.worker.global_worker.handle, 1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: must be a 'worker' capsule

Handle driver submitting a task with a function that hasn't been registered.

Consider the script

import ray
import ray.services as services
import test_functions

# assume that we run this script from inside ray/test/
test_dir = os.path.dirname(os.path.abspath(__file__))
test_path = os.path.join(test_dir, "test_worker.py")
services.start_singlenode_cluster(return_drivers=False, num_workers_per_objstore=1, worker_path=test_path)

x = test_functions.keyword_fct1(1)
xval = ray.pull(x) # xval should be "1 hello"

Two things happen:

  1. The line servics.start_singlenode_cluster starts a worker process, which connects to the scheduler and registers all of the functions in the test_functions module (that is, it lets the scheduler know that it knows how to execute those functions).
  2. The line x = test_functions.keyword_fct1(1) submits a task to the scheduler for executing the function test_functions.keyword_fct1.

There are two possibilities.

  • If 1) happens before 2), then everything works great.
  • If 2) happens before 1), then things are trickier. The method SubmitTask currently throws a fatal error. The advantage of throwing a fatal error is that a lot of the time when 2) happens before 1) it will be because the user forgot to register a particular module, and we will want to draw this to the user's attention to make debugging easier. The disadvantage of throwing a fatal error is of course that this situation can happen legitimately. For example, if test_worker.py takes a while before registering its functions (e.g., it first creates a tensorflow model), then 2) will probably happen first. In this situation, what should we do? Perhaps the answer is that the call to x = test_functions.keyword_fct1(1) should block until the function has been registered on some worker. The advantages of this are
    • It should be easier to debug if the user forgot to import a module.
    • If we are waiting for a slow worker, then it won't fail.
    • The function SubmitTask needs to return the correct number of object references to the driver program. Right now, it gets that information when the function gets registered, so if we wait, then we will have that information. If we don't wait and want to return immediately, then that information will have to be in the SubmitTaskRequest, which feels a bit ugly.

Thoughts? @pcmoritz @mehrdadn @Wapaul1.

Attach shell (in cluster setting) without passing in addresses

Once we've started a ray cluster using cluster.py, we should be able to ssh to the head node and attach a shell to the cluster with python scripts/shell.py --attach without having to pass in the addresses of the scheduler, object store, and shell.

One way to do this is to have cluster.py save the addresses in a file somewhere on he head node and let shell.py read from it.

IPC error when attaching multiple drivers

This bug arose while I was testing cluster.py and restarting the workers multiple times.

To reproduce this error. In build, start a scheduler

./scheduler 127.0.0.1:10001

In build, start an object store

./objstore 127.0.0.1:10001 127.0.0.1:20001

Start a Python shell and do

>>> import ray
>>> ray.connect("127.0.0.1:10001", "127.0.0.1:20001", "127.0.0.1:30001", is_driver=True)
>>> exit()

Then start another Python shell and do

>>> import ray
>>> ray.connect("127.0.0.1:10001", "127.0.0.1:20001", "127.0.0.1:30002", is_driver=True)

The only difference is that I incremented the shell port number to avoid reusing the same address. This fails with

fatal error occured: Check failed at line 71 in /Users/rkn/Workspace/ray/src/ipc.cc: false with message boost::interprocess exception: No such file or directory

Note that if we omit the exit() command the first time, then this error does not happen.

Passing large objects by value

Objects can be passed to remote functions by reference or value. For example, consider this function.

@remote([np.ndarray], [])
def f(x):
  pass

We can pass the argument by value, as below.

x = np.zeros([1000, 1000, 1000])
f(x)

We can also pass the argument by object reference, as below.

x = np.zeros([1000, 1000, 1000])
xref = ray.put(x)
f(xref)

The first approach serializes the object into the task message that gets sent to the scheduler (and stored in the scheduler). The second approach stores the object in the object store and sends a very lightweight message to the scheduler. For large objects (like x), the second approach is preferable because it doesn't take up so much space in the scheduler. However, this is relatively subtle, and users will misuse it.

There are two possible solutions.

  1. Print an warning when the user attempts to pass a large object by value to a remote function.
  2. When the user attempts to pass a large object by value, automatically call ray.put and pass the object reference. That is, pass it by reference but hide the details from the user.

The disadvantage of the first approach is that the warning may originate from any node on the cluster and so bringing it to the user's attention can be a little unnatural. Also, warnings are easy to ignore.

check for RayFailedObject should happen in get_object, not get

If an object reference for a failed task is passed into a remote function, the task should throw an exception the same way that a call to ray.get would. E.g.,

@ray.remote([], [int])
def f():
  raise Exception("This task failed")

@ray.remote([int], [int])
def g(x):
  return 0

Then if we do

ref = f()
g(ref)

the task g(ref) should throw an exception saying that a task that created one of the arguments failed. Right now the type checking will fail (it expected an int and got a RayFailedObject).

Killing workers with running tasks

We want the ability to kill workers with running tasks. Some things to think through and figure out beforehand:

  1. When workers notify the scheduler that the task has been completed, the scheduler should not mark the worker as available.
  2. Think through what to do if the scheduler receives a SubmitTask or PutObj from a worker that has been marked for dying.
  3. The driver or a worker may have called pull on an object that may never be delivered. (Maybe send some kind of Object Unavailable reply to the object stores so that they notify the workers?)

Key value store on workers for storing global variables

Workers are Python processes, and global variables defined in the worker processes can be shared between tasks that execute on the same worker process. There are two conflicting requirements:

  1. If users allow tasks to mutate global variables, then workers are stateful and the behavior of an application will depend on which nodes tasks get scheduled on. This will likely cause subtle bugs and prevents users from thinking about tasks in a purely functional way.
  2. Global variables whose state can be mutated may be required in some situations to amortize the cost of initializing these variables. For example, an OpenAI gym environment or a TensorFlow graph should probably only be created once per worker. When it is used, the gym environment is necessarily mutated (note that the mutated state is not in Python but in C). The user may have to manually reset the gym environment (by calling env.reset()).

Therefore, the solution should make it possible to have stateful global variables that can be mutated within a remote task, but it should feel wrong or hard to mutate them without resetting them at the end of a remote task. Additionally, it should make it easy to catch bugs that result from not resetting these global variables.

Possible approach:

  1. Allow global variables, but require them to be accessed through a key value store (e.g., through ray.put_global and ray.get_global).
  2. Allow ray.put_global to be called only in a special function worker_initialization_function, which runs on every worker (including workers created later on).

Possibility 1

import numpy as np
import gym
import ray

num_steps = 20

def initialize_env():
    return gym.make("Pong-v0") # env is not serializable, so this line must run on all workers.

def reinitialize_env(env):
    env.reset() # some use cases may require the ability to call put_worker_variable again

ray.register_worker_variable(“env”, initialize_env, reinitialize_env)

ray.services.start_ray_local() # Start Ray

@ray.remote([], [float])
def play_gym_game():
    env = ray.get_worker_variable(“env”) # This gets the global variable.
    cumulative_reward = 0
    for i in range(num_steps):
        action = np.random.randint(0, 5) # choose a random action
        _, reward, _, _ = env.step(action) # This mutates the state of the global variable “env”
        cumulative_reward += reward
    return cumulative_reward

Possibility 2

import numpy as np
import gym
import ray

num_steps = 20

def initialization_function(): # this takes no arguments, and returns nothing
    env = gym.make("Pong-v0") # env is not serializable, so this line must run on all workers.
    ray.put_worker_variable(“env”, env)

def reinitialization_function():
    ray.get_worker_variable(“env”).reset()

ray.services.start_ray_local(worker_initialization_function=initialization_function) # Start Ray and run initialization_function on every worker.

@ray.remote([], [float])
def play_gym_game():
    env = ray.get_worker_variable(“env”) # This gets the global variable.
    cumulative_reward = 0
    for i in range(num_steps):
        action = np.random.randint(0, 5) # choose a random action
        _, reward, _, _ = env.step(action) # This mutates the state of the global variable “env”
        cumulative_reward += reward
    return cumulative_reward

Possibility 3

import numpy as np
import gym
import ray

num_steps = 20

def initialization_function(): # this takes no arguments, and returns nothing
    return gym.make("Pong-v0") # env is not serializable, so this line must run on all workers.

def reinitialization_function(env):
    env.reset()

ray.reusable.env = (initialization_function, reinitialization_function)

ray.services.start_ray_local(worker_initialization_function=initialization_function) # Start Ray and run initialization_function on every worker.

@ray.remote([], [float])
def play_gym_game():
    env = ray.reusable.env # This gets the global variable.
    cumulative_reward = 0
    for i in range(num_steps):
        action = np.random.randint(0, 5) # choose a random action
        _, reward, _, _ = env.step(action) # This mutates the state of the global variable “env”
        cumulative_reward += reward
    return cumulative_reward

Fix serialization of python objects when passing by value

Right now, the method serialization.serialize lets a python object serialize itself if it knows how, otherwise it calls ray.lib.serialize_object, which calls ray.lib.serialize. However, we also call ray.lib.serialize from ray.lib.serialize_call, which gets called from worker.remote_call. This doesn't give python objects the opportunity to serialize themselves. Therefore, the following code fails:

x = da.zeros([10, 20])
da.assemble(ray.get(x))

though the following code succeeds:

x = da.zeros([10, 20])
da.assemble(x)

import causes segfault when script ends

Running the script

import tensorflow as tf
import ray

print "hello world"

produces the output

hello world
Segmentation fault (core dumped)

The print statement is not necessary to create the segfault, it just shows that the code in the body of the script runs and that the segfault happens when the script ends.

The problem goes away if we switch the order of the imports or remove the tensorflow import.

This bug was reported by @Wapaul1.

reference counting interacts with pull

In the following code, the result x will be garbage.

x = ray.pull(ra.zeros([10, 10]))

When ra.zeros is called, a worker will create an array of zeros and store it in an object store. An object reference to the output is returned. The call to ray.pull will not copy data from the object store process to the worker process, but will instead give the worker process a pointer to shared memory. After the ray.pull call completes, the object reference returned by ra.zeros will go out of scope, and the object it refers to will be deallocated from the object store. This will cause the memory that x points to to be garbage.

Currently, as a workaround (we do this in the tests), we can write this code as

xref = ra.zeros([10, 10])
x = ray.pull(xref)

assuming xref and x go out of scope at the same time.

One potential solution is for the call to pull to increment a reference count and to tell the destructor of x to decrement the reference count.

Enable running in Python mode

We should allow the user to run scripts in Python mode (perhaps as a driver_mode argument to start_single_node_cluster). In which case the code is equivalent to serial Python. It may be possible to do this by replacing each remote function call with just the regular (blocking) function call and replacing put/get with no-ops.

Support returning objrefs

We should be able to write code like the following:

@remote([], [np.ndarray])
def f():
  return np.zeros(5)

@remote([], [np.ndarray])
def g():
  return f()

However, currently, the call to f returns an ObjRef to an array, and so the call to g will return an ObjRef to an ObjRef.

Both the calls to f and g are non-blocking and immediately return ObjRefs to their outputs, so we can't get around the fact that we will end up with two separate ObjRefs. However, the outputs of the two functions should be the same object, so we need to support ObjRef aliasing.

When store_outputs_in_objstore is called for the function g, it should check that one of the return values is an ObjRef (call it objref) and call a function to alias the corresponding ObjRef from g to point to the same object as objref.

Error message when trying to attach a shell in one of the slave nodes

If we try to run the shell command that the cluster.py script prints at the end from one of the slave nodes, the following happens:

fatal error occured: Check failed at line 74 in /home/ubuntu/ray/src/ipc.cc: false with message boost::interprocess exception: No such file or directory

We should give a better error message here.

Make arrow serialization of composite Python types possible

This was reported by @Wapaul1

We need to be able for example to serialize weights of neural networks quickly and without hitting the protobuf size limit.

This will require changing the NumBuf library so that most of the types that are
currently only implemented in or ProtoBuf serialization can be serialized using arrow.

Implement serialization of lambdas

This is important for implementing map and reduce.

Need to think through the handling of imports, for example, lambda a, b: np.dot(a, b) since the module may be imported under a different name on a different worker.

To what extent should we support closures?

Deep copying of Python objects backed by the object store

If we do a deep copy of a Python object (like a numpy array) whose memory is backed by the object store, it might happen that the Destructor object (RayDealloc in worker.py) is copied, but might keep the same handle and segmentid. This would result in the object getting unmapped twice! Similar problems might happen with the object reference that is stored in the object upon "pull". We should investigate this more.

Error handling for failed tasks

Development in Ray should be a good experience. That means that the user/developer has to be alerted about errors. Say the driver launches a task, and that task launches another task, which launches another task, and so on. And the last task throws an exception.

Then what is the appropriate way to alert the user about the exception? This is hard because the driver (say a Python shell) launches tasks asynchronously without waiting for them to execute.

We probably want to implement a method called something like get_task_statuses() in worker.cc, which gets a list of tasks that are running, tasks that have failed, and maybe tasks that have completed (or recently completed). This could be called and displayed whenever the user submits a new task or calls get (here we should probably distinguish between the user doing this and a worker doing this). It could also be run continuously on the driver.

segfault in runtest.py

Sometimes the Travis segfaults in runtest.py on Mac OS X with the message

$ python runtest.py
scheduler: writing to log file /tmp/raylogs/2016-06-26=20:17:34-scheduler.log
D0626 20:17:34.866177000 140735276749568 ev_posix.c:101] Using polling engine: poll
object store: writing to log file /tmp/raylogs/2016-06-26=20:17:35-objstore-127.0.0.1:20001.log
D0626 20:17:35.026280000 140735276749568 ev_posix.c:101] Using polling engine: poll
D0626 20:17:35.576193000 140735276749568 ev_posix.c:101] Using polling engine: poll
D0626 20:17:36.901697000 140735276749568 ev_posix.c:101] Using polling engine: poll
Attempting to kill process at address 127.0.0.1:10001.
Successfully killed process at address 127.0.0.1:10001.
Attempting to kill process at address 127.0.0.1:20001.
Successfully killed process at address 127.0.0.1:20001.
Attempting to kill process at address 127.0.0.1:40001.
Successfully killed process at address 127.0.0.1:40001.
.scheduler: writing to log file /tmp/raylogs/2016-06-26=20:17:37-scheduler.log
D0626 20:17:37.347365000 140735276749568 ev_posix.c:101] Using polling engine: poll
object store: writing to log file /tmp/raylogs/2016-06-26=20:17:37-objstore-127.0.0.1:20002.log
D0626 20:17:37.460378000 140735276749568 ev_posix.c:101] Using polling engine: poll
D0626 20:17:39.912683000 140735276749568 ev_posix.c:101] Using polling engine: poll
D0626 20:17:40.093953000 140735276749568 ev_posix.c:101] Using polling engine: poll
D0626 20:17:40.122660000 140735276749568 ev_posix.c:101] Using polling engine: poll
Attempting to kill process at address 127.0.0.1:10002.
E0626 20:17:40.251382000 140735276749568 tcp_client_posix.c:173] failed to connect to 'ipv4:127.0.0.1:10002': socket error: connection refused
Successfully killed process at address 127.0.0.1:10002.
Attempting to kill process at address 127.0.0.1:20002.
E0626 20:17:40.311871000 140735276749568 tcp_client_posix.c:173] failed to connect to 'ipv4:127.0.0.1:10002': socket error: connection refused
Successfully killed process at address 127.0.0.1:20002.
Attempting to kill process at address 127.0.0.1:40003.
Successfully killed process at address 127.0.0.1:40003.
Attempting to kill process at address 127.0.0.1:40004.
Successfully killed process at address 127.0.0.1:40004.
Attempting to kill process at address 127.0.0.1:40005.
Successfully killed process at address 127.0.0.1:40005.
.scheduler: writing to log file /tmp/raylogs/2016-06-26=20:17:40-scheduler.log
D0626 20:17:40.496432000 140735276749568 ev_posix.c:101] Using polling engine: poll
object store: writing to log file /tmp/raylogs/2016-06-26=20:17:40-objstore-127.0.0.1:20003.log
D0626 20:17:40.609605000 140735276749568 ev_posix.c:101] Using polling engine: poll
D0626 20:17:41.120116000 140735276749568 ev_posix.c:101] Using polling engine: poll
D0626 20:17:42.252428000 140735276749568 ev_posix.c:101] Using polling engine: poll
Attempting to kill process at address 127.0.0.1:10003.
Successfully killed process at address 127.0.0.1:10003.
Attempting to kill process at address 127.0.0.1:20003.
Successfully killed process at address 127.0.0.1:20003.
Attempting to kill process at address 127.0.0.1:40007.
Successfully killed process at address 127.0.0.1:40007.
.scheduler: writing to log file /tmp/raylogs/2016-06-26=20:17:42-scheduler.log
D0626 20:17:42.844424000 140735276749568 ev_posix.c:101] Using polling engine: poll
object store: writing to log file /tmp/raylogs/2016-06-26=20:17:42-objstore-127.0.0.1:20004.log
D0626 20:17:42.950088000 140735276749568 ev_posix.c:101] Using polling engine: poll
object store: writing to log file /tmp/raylogs/2016-06-26=20:17:43-objstore-127.0.0.1:20005.log
D0626 20:17:43.465256000 140735276749568 ev_posix.c:101] Using polling engine: poll
Attempting to kill process at address 127.0.0.1:10004.
Successfully killed process at address 127.0.0.1:10004.
Attempting to kill process at address 127.0.0.1:20004.
Successfully killed process at address 127.0.0.1:20004.
Attempting to kill process at address 127.0.0.1:20005.
Successfully killed process at address 127.0.0.1:20005.
/Users/travis/build.sh: line 45: 19498 Segmentation fault: 11  python runtest.py
The command "python runtest.py" exited with 139.

I can reproduce this error on Mac OS X (not deterministically) by removing everything from runtest.py except for ObjStoreTest. Running the test in gdb seems to indicate that the error occurs when an object reference goes out of scope and its destructor calls decrement_reference_count, which in turn calls scheduler_stub_->DecrementRefCount(&context, request, &reply), which segfaults.

The gdb backtrace is

Program received signal SIGSEGV, Segmentation fault.
0x00000001011537ce in Scheduler::Stub::DecrementRefCount(grpc::ClientContext*, DecrementRefCountRequest const&, AckReply*) () from /Users/fortesting/ray/lib/python/ray/libraylib.so
#0  0x00000001011537ce in Scheduler::Stub::DecrementRefCount(grpc::ClientContext*, DecrementRefCountRequest const&, AckReply*) () from /Users/fortesting/ray/lib/python/ray/libraylib.so
#1  0x00000001010c2c5b in Worker::decrement_reference_count(std::__1::vector<unsigned long, std::__1::allocator<unsigned long> >&) () from /Users/fortesting/ray/lib/python/ray/libraylib.so
#2  0x00000001010aff69 in PyObjRef_dealloc(PyObjRef*) () from /Users/fortesting/ray/lib/python/ray/libraylib.so
#3  0x0000000100044333 in list_dealloc () from /anaconda/lib/libpython2.7.dylib
#4  0x000000010003cd8d in frame_dealloc () from /anaconda/lib/libpython2.7.dylib
#5  0x00000001000c57e3 in PyEval_EvalFrameEx () from /anaconda/lib/libpython2.7.dylib
#6  0x00000001000c83a3 in PyEval_EvalCodeEx () from /anaconda/lib/libpython2.7.dylib
#7  0x000000010003e100 in function_call () from /anaconda/lib/libpython2.7.dylib
#8  0x000000010000c552 in PyObject_Call () from /anaconda/lib/libpython2.7.dylib
#9  0x00000001000c2538 in PyEval_EvalFrameEx () from /anaconda/lib/libpython2.7.dylib
#10 0x00000001000c83a3 in PyEval_EvalCodeEx () from /anaconda/lib/libpython2.7.dylib
#11 0x000000010003e100 in function_call () from /anaconda/lib/libpython2.7.dylib
#12 0x000000010000c552 in PyObject_Call () from /anaconda/lib/libpython2.7.dylib
#13 0x000000010001dbed in instancemethod_call () from /anaconda/lib/libpython2.7.dylib
#14 0x000000010000c552 in PyObject_Call () from /anaconda/lib/libpython2.7.dylib
#15 0x00000001000796aa in slot_tp_call () from /anaconda/lib/libpython2.7.dylib
#16 0x000000010000c552 in PyObject_Call () from /anaconda/lib/libpython2.7.dylib
#17 0x00000001000c0013 in PyEval_EvalFrameEx () from /anaconda/lib/libpython2.7.dylib
#18 0x00000001000c83a3 in PyEval_EvalCodeEx () from /anaconda/lib/libpython2.7.dylib
#19 0x000000010003e100 in function_call () from /anaconda/lib/libpython2.7.dylib
#20 0x000000010000c552 in PyObject_Call () from /anaconda/lib/libpython2.7.dylib
#21 0x00000001000c2538 in PyEval_EvalFrameEx () from /anaconda/lib/libpython2.7.dylib
#22 0x00000001000c83a3 in PyEval_EvalCodeEx () from /anaconda/lib/libpython2.7.dylib
#23 0x000000010003e100 in function_call () from /anaconda/lib/libpython2.7.dylib
#24 0x000000010000c552 in PyObject_Call () from /anaconda/lib/libpython2.7.dylib
#25 0x000000010001dbed in instancemethod_call () from /anaconda/lib/libpython2.7.dylib
#26 0x000000010000c552 in PyObject_Call () from /anaconda/lib/libpython2.7.dylib
#27 0x00000001000796aa in slot_tp_call () from /anaconda/lib/libpython2.7.dylib
#28 0x000000010000c552 in PyObject_Call () from /anaconda/lib/libpython2.7.dylib
#29 0x00000001000c0013 in PyEval_EvalFrameEx () from /anaconda/lib/libpython2.7.dylib
#30 0x00000001000c83a3 in PyEval_EvalCodeEx () from /anaconda/lib/libpython2.7.dylib
#31 0x000000010003e100 in function_call () from /anaconda/lib/libpython2.7.dylib
#32 0x000000010000c552 in PyObject_Call () from /anaconda/lib/libpython2.7.dylib
#33 0x00000001000c2538 in PyEval_EvalFrameEx () from /anaconda/lib/libpython2.7.dylib
#34 0x00000001000c83a3 in PyEval_EvalCodeEx () from /anaconda/lib/libpython2.7.dylib
#35 0x000000010003e100 in function_call () from /anaconda/lib/libpython2.7.dylib
#36 0x000000010000c552 in PyObject_Call () from /anaconda/lib/libpython2.7.dylib
#37 0x000000010001dbed in instancemethod_call () from /anaconda/lib/libpython2.7.dylib
#38 0x000000010000c552 in PyObject_Call () from /anaconda/lib/libpython2.7.dylib
#39 0x00000001000796aa in slot_tp_call () from /anaconda/lib/libpython2.7.dylib
#40 0x000000010000c552 in PyObject_Call () from /anaconda/lib/libpython2.7.dylib
#41 0x00000001000c0013 in PyEval_EvalFrameEx () from /anaconda/lib/libpython2.7.dylib
#42 0x00000001000c57bf in PyEval_EvalFrameEx () from /anaconda/lib/libpython2.7.dylib
#43 0x00000001000c57bf in PyEval_EvalFrameEx () from /anaconda/lib/libpython2.7.dylib
#44 0x00000001000c83a3 in PyEval_EvalCodeEx () from /anaconda/lib/libpython2.7.dylib
#45 0x000000010003e100 in function_call () from /anaconda/lib/libpython2.7.dylib
#46 0x000000010000c552 in PyObject_Call () from /anaconda/lib/libpython2.7.dylib
#47 0x000000010001dbed in instancemethod_call () from /anaconda/lib/libpython2.7.dylib
#48 0x000000010000c552 in PyObject_Call () from /anaconda/lib/libpython2.7.dylib
#49 0x00000001000792d8 in slot_tp_init () from /anaconda/lib/libpython2.7.dylib
#50 0x0000000100076155 in type_call () from /anaconda/lib/libpython2.7.dylib
#51 0x000000010000c552 in PyObject_Call () from /anaconda/lib/libpython2.7.dylib
#52 0x00000001000c0013 in PyEval_EvalFrameEx () from /anaconda/lib/libpython2.7.dylib
#53 0x00000001000c83a3 in PyEval_EvalCodeEx () from /anaconda/lib/libpython2.7.dylib
#54 0x00000001000c84c6 in PyEval_EvalCode () from /anaconda/lib/libpython2.7.dylib
#55 0x00000001000ede5e in PyRun_FileExFlags () from /anaconda/lib/libpython2.7.dylib
#56 0x00000001000ee0fa in PyRun_SimpleFileExFlags () from /anaconda/lib/libpython2.7.dylib
#57 0x000000010010517d in Py_Main () from /anaconda/lib/libpython2.7.dylib
#58 0x0000000100000f54 in start ()
warning: (Internal error: pc 0x0 in read in psymtab, but not in symtab.)

Manual Killall after python exception

If regular python code throws an exception, then the object store, scheduler, and python processes are not automatically cleaned up and halted, forcing you to manually kill them in order to run the script again..

Fix corner case for opening and closing memory segments

see the following lines in runtest.py

    # The following currently segfaults: The second "result = " closes the
    # memory segment as soon as the assignment is done (and the first result
    # goes out of scope).
    """
    data = np.zeros([10, 20])
    objref = ray.put(data)
    result = worker.get(objref)
    result = worker.get(objref)
    self.assertTrue(np.alltrue(result == data))
    """

Correctly handle passing ObjRefs by value to distributed functions

Consider a remote function

@ray.remote([ObjRef], [])
def f(objref):
    return

If we call f(objref), then the argument objref should be passed by value. However, currently, the scheduler will attempt to ship the corresponding object and get_arguments_for_execution in worker.py will attempt to pull the object.

Inspect object store

Users should be able to inspect the contents of an object store (something like ls).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.