Giter VIP home page Giter VIP logo

distributed_job_scheduling's Introduction

Distributed-Job-Scheduling

Goal

Implement an infrastructure to manage jobs submitted to a cluster of Executors. Each client may submit a job to any of the executors receiving a job id as a return value. Through such job id, clients may check (contacting the same executor they submitted the job to) if the job has been executed and may retrieve back the results produced by the job. Executors communicate and coordinate among themselves in order to share load such that at each time every Executor is running the same number of jobs (or a number as close as possible to that). Assume links are reliable but processes (i.e., Executors) may fail (and resume back, re-joining the system immediately after). Use stable storage to cope with failures of Executors.

Ring

Our choices

We implement the infrastructure in Python, using Python Socket to have Client-Executor and Executor-Executor communications. We have decided to deploy the cluster of Executor as a Ring Structure, in order to reduce message exchange. The Token, passing through the Executors, manage the load balancing by updating an internal dataframe which has a report about the number of jobs (executing and waiting) of each Executor.

Token

Consulting this dataframe the Token is able to manage load balancing by telling to an Executor if it has to send jobs other Executors, and eventually to which one. If job(s) forward is needed, the two Executors establish a connection usign Pyhthon Socket, the recipient add the received job(s) to its waiting list (or directly executes the first one) and only when the result of a job is computed it is sent to the sender. In this way we have, again, message exchange reduction because the sender never asks for the forwarded job status.

To manage fault tolerance we use at least once semantic: messages may be duplicated, but not lost. In fact every time there is a connection (Client-Executor or Executor-Executor) the sender always wait for an ACK from the receiver; we set a timeout timer and if the ACK if not received the sender send again the message.

Bash Script

In order to simulate failures we create a bash script (each machine will execute one) in which two Executors and a Client are created. After a random time the bash script simulate the failure of an Executor in order to test the fault tolerance (test if the state is store correctly and then restored, and if after the re-joining of Executor everything works well).

Demo

The Demo version that you can find executing the above code is composed by two Machines, each one of them must execute a bash script, and so each machine has two Executors and one Client. The messages recipients are set by default in order to show that the connection between components of the same, or different, machine works in the right way.

Team Members

distributed_job_scheduling's People

Contributors

manuelsalamino avatar

Stargazers

 avatar Francesco Amorosini avatar Alessio Russo Introito avatar

Watchers

James Cloos avatar  avatar

distributed_job_scheduling's Issues

License ?

Hello, under what license is this project released under ? I would like to study it to learn from it. Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.