Giter VIP home page Giter VIP logo

wrench-project / wrench Goto Github PK

View Code? Open in Web Editor NEW
34.0 34.0 18.0 520.93 MB

WRENCH: Cyberinfrastructure Simulation Workbench

Home Page: https://wrench-project.org

License: GNU Lesser General Public License v3.0

CMake 0.89% C++ 84.53% C 0.09% Shell 0.03% HTML 14.39% Python 0.08%
batch-job distributed-computing distributed-systems hpc reproducible-research scheduling-simulator scientific-workflows simulation-framework simulation-modeling workflow workflow-management-system workflow-simulator

wrench's People

Contributors

code-factor avatar erick-orozco-ciprian avatar frs69wq avatar gjethwani avatar henricasanova avatar james-oeth avatar jesse-mcdonald avatar lpottier avatar mesurajpandey avatar pfdutot avatar rafaelfsilva avatar rileymiyamoto avatar rsreds avatar ryantanaka avatar spenceralbrecht avatar sukaryo-heilscher avatar wanyuzha avatar willkoch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wrench's Issues

BatchService: prediction error with batsched?

It seems that if one asks multiple times a queue wait time prediction with the same key then we get some "job already in the system" error. For instance:

Assertion '_jobs.count(job_id) == 0' failed (ERROR)
in file json_workload.cpp, line 62
function: void Workload::add_job_from_json_object(const Value &, const string &, double)
with message: Job 'config_XXXX' already exists in the Workload

If we generate distinct keys, then we don't get that error. So it's as if those prediction jobs are actually inserted into the workload... WRENCH issue? BATSCHED issue?

Needed Development: SimpleStorageService using S4U Storage

Update the implementation of the SimpleStorageService so that it uses the Storage abstraction provided by S4U, whenever available/documented. One issue to pay attention to is that pipelining of network transfer and disk writes (a "store-and-forward" approach is really not realistic).

Feature Request: Augment the set of simulation events in the simulation output

What simulation "events" to add:

  • Task start
  • Task failure
  • File copy begins
  • File copy ends
  • File copy failure

One thing is that with our current design we will sprinkle our code with "add timestamp" everywhere in the code... is there a better way?

We should also augment the WorkflowTask object to keep track of detailed time info. For starters:

  • task_start_date ( already in there)
  • task_computation_start_date
  • task_computation_end_date

Implement a Workflow::getReadyClusters()

If one doesn't use task clusters, it's annoying to get a map of cluster IDs when doing a Workflow::getReadyTasks(). So we should simply have getReadyTasks() return a vector of tasks, and getReadyClusters a map of cluster ideas. A "ready cluster" is a cluster that contains only ready tasks.

General Compute Service API refactoring

At the moment, every compute service has two boolean arguments ( "support pilot jobs", "support standard jobs"), a double argument (the scratch space size), and a plist. Wouldn't it make more sense to have ALL these arguments part of the plist? (so that a compute service would only have a hostname and list of compute resources argument, and then an optional plist).

Feature Request:"copy and register", "delete and unregister" abilities

It would be good to augment the DataMovementManager and other components (e.g., job executors perhaps) with the option to do a combined "create/copy a file AND add an entry in the FileRegistryService". Similarly, when removing a file from a storage Service, it would be good to have a "remove and unregister". The objective is for a WMS developer who wants everything to be registered to not have to do tons of explicit separate register/unregister operations.

First step:

  • Modify the StorageService::deleteFile() API to take in a FileRegistryService pointer. It not nullptr is passed, then update the file registry service ONLY if the delete operation is successful
  • Modify the DataMovementManager (synchronous and asynchronous) to take in a FileRegistryService pointer. It not nullptr is passed, then update the file registry service ONLY if the copy operation is successful.
    • This will require adding a "pending file copies" data structure to the DataMovementManager
    • Algorithm is:
      • For each file copy request keep track of "update file registry service" or "don't update"
      • When receiving a "file copy done or failed" message:
        • If not failed: (i) check whether file registry service should be updated; (ii) if yes, update it.

BatchService: what about RAM?

At the moment, the way in which the BatchService is handling RAM is strange:
[DONE] Updated the constructor to handle heterogenity

  • IGNORE RAM constraints (and document this)

Scratch space isolation between standard jobs

We need to clarify semantics for scratch space. Here is the proposal:

  • a compute service has a single scratch storage of a given size specified at construction time
  • When a standard job runs on the compute service (NOT within a pilot job), that standard job has its own temporary "directory" in scratch. As a result, the same file could be stored multiple times in scratch, each copy for a different standard job. This requires that a scratch storage service provide a bit more functionality than a normal storage service.
  • When standard jobs run within a pilot job then these standard jobs shared a single scratch temporary directory (and if one of them wipes out a file another needs, then too bad, that's the WMS's fault).

So, in a nutshell, we need to extend the StorageService and/or SimpleStorageService API and implementation to include a "temp directories" abstraction, to be defined.

Software Engineering: use "const" more :)

In the current code, there is very little use of the const keyword, even though this is really a great feature of C++. As we go forward, adding const here and there is a good thing.

Weird Doxygen problem

The "internal documentation" shows the "...Event" classes outside the wrench namespace, which is not correct. The "developer" documentation, however, shows these classes correctly inside the namespace. Not sure what's happening here....

Needed Development: Integration of BatchService and BatSched (from BATSIM)

Batsched integration milestones:

  • Make sure that the integration works with the updated Batsched protocol (waiting for confirmation from the batsched people that the protocol documentation on github is up to date)

  • Using the wrench or fast_conservative branch of Batsched on gitlab, implement the QUERY/ANSWER feature, that should be implemented

  • Modify the current wrench::BatchService API to add a getQueueWaitingTimeEstimate() function. That function will handle all messaging with the batch service. Once that's done, remove the ServiceInformationMessage handling in the Job Manager.

Feature Request: scratch space for compute services?

It would be useful to have a notion of scratch space for each compute service. Motivation: Files can be implicitly deleted from scratch.

ComputeService:

  • Constructor should always have a (optional) argument which is : scratch space in bytes?
    • if specified: create a (not visible to the whole world) storage service that is "attached" to the compute service
  • NO MORE Default Storage Service

StandardJob:

  • "pre file copies"
    -Copies CAN BE To scratch (if there is some), even though scratch is not visible from the outside

  • tasks:
    - If a task is told to read/write a file from a particular SS, then fine
    - If not, it looks for it / creates it in the scratch (if there is some)

    • post file copies
      • Copies CAN BE From scratch (if there is some), even though scratch is not visible from the outside
  • file deletions:

    • Explicit in whatever SS, fine
    • Implicit in scratch (if there is some), but NOT for a standard job within a PILOT job

==== UGLY IMPLEMENTATION OPTION ===

{File, StorageService*, StorageService*}
{File, StorageService*, ComputeService::Scratch}

#define ComputeService::Scratch ((StorageService *)((unsigned long)666);

StandardJobExecutor:
...
storageService *src = std::get<1>(copy);
storageService *dst = std::get<2>(copy);
if (dst == ComputeService::Scratch) {
if (this->compute_service->hasScratch() {
dst = this->compute_service->getScratch():
} else {
// EXCEPTION
}
}

Something to watch out for: using addresses as search keys can cause an ABA address-recycling bug

In many parts of the code we use addresses of objects to search for their presence in lists. This is susceptible to the ABA address-recycling bug. For instance:

  • I allocate a StandardJob, which has adress 0xAAAA
  • I start an Alarm to let me know that that job has expired
  • The job completes normally well ahead of the expiration
  • I allocate ANOTHER StandardJob, which ALSO has address 0xAAAA because the heap allocator reuses the same location
  • That other job runs, and at some point I get the "job expired" message from the Alarm for job 0xAAAA

In this way, I am mistaking an "old message that I should ignore" for a "oh no, a job has expired" message.

The way to fix this: create a unique sequence number for each StandardJob (static variable inside the constructor that gets incremented). Then, before sending the message, the Alarm could, for instance, check that the sequence number of the job at address 0xAAAAA has not changed. Or, the message could be sent regardless, and the recipient of the message would then do the check. In essence, the check is: "yes, there is a job at that address you're telling me about, but let me checked if it's really the job you mean".

BatchServiceTesting: FIRSTFIT, BESTFIT

SImplify the tests by relying on the WorkflowTask::getExecutionHost() method instead of reverse-engineering schedule based on task completion times. (Just like what's done for ROUNDROBIN).

Feature Request: Memory specifications

I've been looking for a way to specify the amount of main memory per compute node, but haven't found one in here (like in SimGrid). Am I missing something, or is this intentional?

Workflow::getReadyClusters() Weirdness?

For some reason I looked at the code for Workflow::getReadyClusters(). I am a bit puzzled by this method and don't quite understand it (I never actually used the "cluster" feature). One thing that caught my eye first is that it calls setInternalState() and calls setState(). That seems really against our overall design. The state updates are made by the services, job manager, and by the WMS itself in waitForNextExecutionEvent(). Instead, the "get ready tasks" methods should just look at states, not update them. I am cut-and-pasting the method below.

The last else clause in this method is as:

   } else {
      if (task_map.find(task->getClusterID()) != task_map.end()) {
        if (task->getState() == WorkflowTask::State::NOT_READY) {
          task->setInternalState(WorkflowTask::InternalState::TASK_READY);
          task->setState(WorkflowTask::State::READY);
        }
        task_map[task->getClusterID()].push_back(task);
      }
    }

I have no idea why we need to do anything in that else in the first place, and definitely not what's in there... I commented out this entire else clause and all tests and examples run fine (but then, we don't use this method a lot).

@rafaelfsilva I believe you implemented this method? what do you think?

Consistent use of ComputeService::ALL_RAM and ComputeService::ALL_CORES

In class ComputeService we have the convenient constants ALL_RAM and ALL_CORES to specify "on that host use all ram" and "on that host use all cores". This is use throughout the WRENCH code, and documented, but I just noticed that it's not used everywhere. For instance, in the VirtualizedClusterService class, we're still on the "old way" of using "zero" to mean "all". We should fix this before the release...

Feature Request: Ability to simulate multiple workflows with arrival times

At the moment, WRENCH only simulates one workflow execution. Users (e.g., Eddy Caron) have request a much more powerful model in which multiple workflows can arrive dynamically throughout the simulation. This requires some software engineering (and likely some thought). Furthermore, there should be the possibility of multiple WMS instances running concurrently, OR a single WMS instance managing multiple arriving worflows.

Before we get there we need:

  1. Take the "shutdown all services" functionality out of the WMS (e.g., create a Terminator service that is given some termination condition by the user)
  2. Make it possible to create a WMS that runs on a constrained set of services
  3. Add a "start time" to a WMS
  4. At this point should be easy to have concurrent WMS instances
  5. Then think about the WMS that can handle a stream of workflows

Problem with Workflow::loadFromJSON on Mac?

I haven't had time to look into it, but one of our users has written a small simulator, and the loadFromJSON works on Linux, but not on Mac. I am attaching here the JSON file that causes problems. (had to rename is .json.txt so that Github would allow me to attach)
E1S51u.json.txt

Add timouts to service API functions

In the design of most services, the API functions that "use" the service are as follows:
A) Check that the service is up
B) Send a message
C) Wait for a reply

It seems that:
A) is missing in some cases [TODO: add it]
B) is sometimes asynchronous, but synchronous is better [TODO: fix it]
C) is often without a time out (and thus may hang if the service has been killed in the meantime, which is a "feature" for a dumb implementation, but should likely be a bug) [TODO: add Service::setTimeout() and Service::getTimout() methods!]

Batsched compilation optional?

Would it be useful/convenient to make the Batsched integration optional? This is because there are so many dependencies and users who don't need Batsched then have to install so many packages. Perhaps we don't care though. Not a huge deal either way I guess.

Task states updated before notifications are received

As I am writing WRENCH-based simulators, I am noticing something: task states are updated before notifications are received. Task states are tricky, which is why I had a while back split the task state into "state" and "internal state". This was because, e.g., when a compute service sets a task state to completed, from the WMS's perspective the task is still pending until a notification is sent. This has made things much easier, but now another but similar issue is coming up. Here is a scenario:

  1. A compute service sends back a "job done" notification to a job manager for task T
  2. The job manager gets the notification and updates (non-internal) task states (i.e., T is now COMPLETED and some of T's children become READY)
  3. The job manager sends the notification to the WMS which will be an event

In the meantime, after 3) above but before the WMS does a waitForAndProcessNextEvent(), the WMS is doing something like: "hmmm... what tasks are ready again?" And by looking at task states, it will see some of T's children as ready. It may even see T as completed. And then later, it will be told "task T has completed", although it already new that because it happened to look on its own at task states.

So far, in the simulators I've written, it's been weird in terms of the output I see (which caused me to wonder: "how could T's child be ready when T hasn't completed yet?", because I was only printing some "task completed" message upon receiving an actual event). For instance, by output could have been, for a T1->T2 workflow:

  • Submitting T1
  • T2 is ready
  • Submitting T2
  • T1 has completed!

which appears out-of-order, but it ok.

One question is : is this a bug or a feature?

I am thinking bug because it seems more coherent to say that "task states cannot change arbitrarily in between job submissions/cancellations and calls to waitForAndProcessNextEvent().

The fix wouldn't be super straightforward, since right now the logic in the Job Manager is, as mentioned above:

  1. Wait for a Job Completion (or Failure) message
  2. Update task states
  3. Send a message to the WMS (which will be caught by waitForAndProcessNextEvent())

So, now, 2) has to happen in the waitForAndProcessNextEvent() method, which is awkward...

anyway, something to discuss/think about. Distributed computing, even in simulation, is never easy is it?

Desired Development: Decouple Service Creation from Service Start

In the current implementation, the constructor of a service also starts that service (i.e., it creates the S4U actor for it). This leads to a problem. For instance:

  • I create a WMS service
  • I launch the simulation
  • launch() throws, as it should, the exception "You should have at least one compute service"
  • I decide to terminate my program
  • SimGrid complains that there is a running actor (the WMS service)

The alternative is that the constructor of a service does not start the actor. A separate start() call is used. This way, launch() can first check that it has all it needs, and then starts the services.

This seems like a better approach overall....

Mutiple namespaces?

It may be a good idea to have a namespace for "user" and a namespace for "developer".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.