Giter VIP home page Giter VIP logo

cook's People

Contributors

ahaysx avatar bolina avatar brianbao avatar calebhar12 avatar cge0516 avatar daowen avatar dependabot[bot] avatar dgrnbrg avatar diegoalbertotorres avatar dposada avatar gerrymanoim avatar icexelloss avatar jhn avatar kathryn-zhou avatar laurameng avatar leifwalsh avatar lewisheadden avatar mayurjpatel avatar mforsyth avatar nsinkov avatar pschorf avatar rmanyari avatar samincheva avatar scrosby avatar shamsimam avatar sophaskins avatar sradack avatar wenbozhao avatar wyegelwel avatar yueri avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cook's Issues

Document Scheduler configuration

Make sure that we have a sample dev config (should work out of the box) & prod config (should have comments to explain some choices).

This also should have details on the recommended production JVM options, and why to use them (Datomic using extra heap as cache, debugging GC pauses, etc).

All options should be documented in the asciidoc.

Add support for terminal task failure

Currently cook scheduler always retries job when it fails. However, sometimes executor can determine a job fails permanently and therefore there is no point in retrying, in this case, we should allow executor to tell cook scheduler not to retry the job.

To implement this, we can leverage data field in TaskStatus. We can start including metadata (a json map, maybe) along with TaskStatus and this will just be a "terminal-failure": "true" entry.

Cook scheduler can simply set job state to complete when it sees "terminal-failure": "true" from a task status

Unable to start server from checkout

I checked out a8e1c67 and tried to run lein run dev-config.edn but it failed to run because of missing dependencies. It seems tags referenced on line 318 of components are not defined anywhere. I commented out the expression, but then I got another error about cook.reporter on line 324 and commented that expression out as well. After those changes I was able to run.

I was able to get it all running in less than 30 minutes, including pulling dependencies and debugging this. Thanks for making it easy.

Document how to build cook with datomic pro

Currently, datomic free edition jars are available in public maven repos, so lein is happy with building against it. But to use datomic pro, one has to maven install the licensed jars in local maven repo before building it. The documentation on that whole process is a little sparse. I found http://aan.io/datomic-pro-and-leiningen/ this useful after googling around.

Current documentation suggests that switching to datomic pro is as simple as s/datomic-free/datomic-pro/g project.clj.

Move to Metrics library in mesos/monitor.clj to standardize metrics reporting

In mesos/monitor.clj, we currently report metrics on user waiting/running jobs/cpus/mem by sending riemann events directly. Instead, we should:

(1) Have a chime process query database and store the results in atoms/async-channels
(2) Have a go-loop that looks at the atoms/async-channels and register/deregister gauges.

Per discussion with @dgrnbrg

Cannot start cook with dev/prod datomic

The issue is dev/prod datomic needs metatrasaction jar, but currenly metatrasaction is inside scheduler project so I cannot compile a standalone metatrasaction jar.

Cook shouldn't change instance status to failed unless it knows the task has failed

instance status should reflect the fact. However, we currently change instance status to failed to kill it. I think this is not ideal because when user sees instance status = failed in Cook, it should be the case that the task indeed fails, that is, cook receives task-failed/task-error/(maybe task-lost in some cases) from mesos.

The places we currently change instance status to failed are:
(1) To preempt a task
(2) To kill a task due to heartbeat timeout

@dgrnbrg let me know what you think

Document where libmesos is

Usually in libmesos.so / libmesos.dylib (Linux/Mac) are in /usr/lib or /usr/local/lib, but in my case, were in my $MESOS_BUILD_DIR/src/.libs/ -- need to understand why this was and document each edge.

Zookeeper needed for dev-config

In running lein run dev-config.edn I get

2015-09-21 19:21:17,472:22178(0x116b06000):ZOO_ERROR@handle_socket_error_msg@1697: Socket [::1:2181] zk retcode=-4, errno=61(Connection refused): server refused to accept the client
2015-09-21 19:21:17,472:22178(0x116b06000):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:2181] zk retcode=-4, errno=61(Connection refused): server refused to accept the client

which resolves once I start running a Zookeeper locally

2015-09-21 19:21:20,806:22178(0x116b06000):ZOO_INFO@check_events@1703: initiated connection to server [fe80::1:2181]
2015-09-21 19:21:21,191:22178(0x116b06000):ZOO_INFO@check_events@1750: session establishment complete on server [fe80::1:2181], sessionId=0x14ff236287a0000, negotiated timeout=10000
I0921 19:21:21.191696 327958528 group.cpp:313] Group process (group(1)@127.0.0.1:56667) connected to ZooKeeper
I0921 19:21:21.191776 327958528 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0921 19:21:21.191836 327958528 group.cpp:385] Trying to create path '/mesos' in ZooKeeper

Is this expected? From the documentation,

Cook is written in Clojure. To develop Cook, all you need is a JVM and Mesos installed and configured. Cook will automatically start embedded copies of the rest of its dependencies."

I thought I would not need any dependencies when running in dev-mode.

Provide an HTTP based job tracking endpoint

Users would like to be able to query Cook for the list of their running and waiting jobs. We've discussed this at length internally but I'd like to bring this to the open source for design, review and implementation.

Test protobuf <-> datomic roundtrips

This is meant to test that we can submit some JSON through the rest api, see it hit Datomic, then convert that to a protobuf, then follow the whole roundtrip back. This could catch potentially unknown serialization/format munging bugs, since we represent job data as Clojure datastructures, Mesos protobufs, Datomic datoms, and JSON objects.

Add support for other databases

The first step here is determining the types of queries we do. This issue should be updated with the current list:

  • Find all jobs of a particular status
  • Find all non-terminal instances
  • Query a particular job or instance by ID

The status-related queries require second indices, but we could change instance IDs to be [jobid instanceid] pairs, so that we only need to implement lookup by job id, and then we'd just store a "document" with the full job & instance state.

Still to be analyzed:

  • What's the impact on the metrics reporter's user stats?
  • How critical are the transaction functions? Could we change them to run locally, or all be CAS-based?
  • Can we refactor the use of the Datomic txn log tailer to be totally local, core.async, and per-process?

Running job status not updated in mesos 0.23 and Cook

I submitted a job via cook to a mesos 0.23 cluster. Everything seems to have worked fine, but the instances[0].status and framework_id are not getting set. On the mesos page, I do see the job as running and cook scheduler as a registered framework.

[
{
mem: 16,
max_retries: 3,
max_runtime: 86400000,
name: "cookjob",
command: "while [ true ]; do echo hello cook I am "$(whoami)" and MY_VAR="${MY_VAR}"; sleep 10; done",
env: {
MY_VAR: "foo1"
},
framework_id: null,
instances: [
{
start_time: 1444169356373,
task_id: "cd66e79b-9272-4d54-bbd3-e89cff8c78c0",
hostname: "some.host.domain.com",
slave_id: "20151006-201511-738201772-5050-93146-S8",
executor_id: "cd66e79b-9272-4d54-bbd3-e89cff8c78c0",
status: "unknown"
}
],
priority: 50,
status: "waiting",
uuid: "f76aa5bd-e4bb-4ef3-9ad4-5b2938efc0fd",
uris: null,
cpus: 0.5
}
]

lein uberjar from scheduler subdir failed

ljin@hsljin:~/ws/github/Cook/scheduler$ lein uberjar
Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 5555; nested exception is:
java.net.BindException: Address already in use
Compilation failed: Subprocess failed

Add Spark parameters for configuring Cook binding

Besides setting the CPUs and Memory for each executor, we should be able to specify additional URIs or environment variables to retrieve for the executor, and the min threshold of running executors to wait for until we start computing.

Benchmark time to schedule a workload

This will give us an idea of how long it should take to start some number of jobs, of various sizes.

The motivation is to understand how long it should take to launch a Spark cluster, so that we can figure out how multitenancy affects this, and if something special is needed.

Add support for host constraints

This should be for things like "only on hosts w/ a specific attribute". This will enable things like GPU or machine class aware scheduling.

This will need to be added to the client-facing API, as well as to the scheduler & db.

Change the way we load Mesos in travis to enable moving to travis container infra

This will require submitting a request to here: https://github.com/travis-ci/apt-source-whitelist

Or we can download and install/unpack/build (or grab binaries) Mesos ourselves

But this also has the problem/downside that to get them added to the whitelist, they seem to need source packages. And to use the cache (necessary for building the package), we'd need to be a paying Travis customer.

This is trickier than I initially thought.

Add federation to Cook REST API

Here's an example of what the config file could look like:

 :federation {:remotes ["http://localhost:12322"]
              :priviledged-principal "admin"
              :threads 4
              :circuit-breaker {:failure-threshold 0
                                :lifetime-ms 60000
                                :response-timeout-ms 60000
                                :reset-timeout-ms 60000
                                :failure-logger-size 10000}}

Update spark to latest build

This includes using the new Cook Environment variables and URI APIs, integrating the latest code into spark 1.5, documenting the instructions for building off spark 1.5.

This should also add support so that the URI either uses Basic auth or kerberos, depending on if the URI is of the form cook://user:pass@host:port or simply cook://host:port.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.