wrench-project / wrench Goto Github PK
View Code? Open in Web Editor NEWWRENCH: Cyberinfrastructure Simulation Workbench
Home Page: https://wrench-project.org
License: GNU Lesser General Public License v3.0
WRENCH: Cyberinfrastructure Simulation Workbench
Home Page: https://wrench-project.org
License: GNU Lesser General Public License v3.0
It seems that if one asks multiple times a queue wait time prediction with the same key then we get some "job already in the system" error. For instance:
Assertion '_jobs.count(job_id) == 0' failed (ERROR)
in file json_workload.cpp, line 62
function: void Workload::add_job_from_json_object(const Value &, const string &, double)
with message: Job 'config_XXXX' already exists in the Workload
If we generate distinct keys, then we don't get that error. So it's as if those prediction jobs are actually inserted into the workload... WRENCH issue? BATSCHED issue?
Update the implementation of the SimpleStorageService so that it uses the Storage abstraction provided by S4U, whenever available/documented. One issue to pay attention to is that pipelining of network transfer and disk writes (a "store-and-forward" approach is really not realistic).
What simulation "events" to add:
One thing is that with our current design we will sprinkle our code with "add timestamp" everywhere in the code... is there a better way?
We should also augment the WorkflowTask object to keep track of detailed time info. For starters:
If one doesn't use task clusters, it's annoying to get a map of cluster IDs when doing a Workflow::getReadyTasks(). So we should simply have getReadyTasks() return a vector of tasks, and getReadyClusters a map of cluster ideas. A "ready cluster" is a cluster that contains only ready tasks.
At the moment, every compute service has two boolean arguments ( "support pilot jobs", "support standard jobs"), a double argument (the scratch space size), and a plist. Wouldn't it make more sense to have ALL these arguments part of the plist? (so that a compute service would only have a hostname and list of compute resources argument, and then an optional plist).
It would be good to augment the DataMovementManager and other components (e.g., job executors perhaps) with the option to do a combined "create/copy a file AND add an entry in the FileRegistryService". Similarly, when removing a file from a storage Service, it would be good to have a "remove and unregister". The objective is for a WMS developer who wants everything to be registered to not have to do tons of explicit separate register/unregister operations.
First step:
It would be nice at some point to implement the Vivaldi system as a Network Proximity Service so that it can be used out of the box by WRENCH developers.
At the moment, the way in which the BatchService is handling RAM is strange:
[DONE] Updated the constructor to handle heterogenity
We need to clarify semantics for scratch space. Here is the proposal:
So, in a nutshell, we need to extend the StorageService and/or SimpleStorageService API and implementation to include a "temp directories" abstraction, to be defined.
In the current code, there is very little use of the const keyword, even though this is really a great feature of C++. As we go forward, adding const here and there is a good thing.
The "internal documentation" shows the "...Event" classes outside the wrench namespace, which is not correct. The "developer" documentation, however, shows these classes correctly inside the namespace. Not sure what's happening here....
Should the number of concurrent connection be a constructor argument (as it is now) or just an optional property (default being: unlimited). Although I implemented the former, I now think the latter is better.
We should evolve WRENCH so that it exposes the "energy" functionality in SimGrid
Evolve VirtualizedClusterService so that it:
Batsched integration milestones:
Make sure that the integration works with the updated Batsched protocol (waiting for confirmation from the batsched people that the protocol documentation on github is up to date)
Using the wrench or fast_conservative branch of Batsched on gitlab, implement the QUERY/ANSWER feature, that should be implemented
Modify the current wrench::BatchService API to add a getQueueWaitingTimeEstimate() function. That function will handle all messaging with the batch service. Once that's done, remove the ServiceInformationMessage handling in the Job Manager.
It would be useful to have a notion of scratch space for each compute service. Motivation: Files can be implicitly deleted from scratch.
ComputeService:
StandardJob:
"pre file copies"
-Copies CAN BE To scratch (if there is some), even though scratch is not visible from the outside
tasks:
- If a task is told to read/write a file from a particular SS, then fine
- If not, it looks for it / creates it in the scratch (if there is some)
file deletions:
==== UGLY IMPLEMENTATION OPTION ===
{File, StorageService*, StorageService*}
{File, StorageService*, ComputeService::Scratch}
#define ComputeService::Scratch ((StorageService *)((unsigned long)666);
StandardJobExecutor:
...
storageService *src = std::get<1>(copy);
storageService *dst = std::get<2>(copy);
if (dst == ComputeService::Scratch) {
if (this->compute_service->hasScratch() {
dst = this->compute_service->getScratch():
} else {
// EXCEPTION
}
}
so that we no longer have to use flops=1 as the reference speed, which is confusing
At the moment, BatchService implements FIRST_FIT and BEST_FIT. We should augment this with a ROUND_ROBIN option that will spread jobs across hosts as much as possible, which is something some real systems do.
In many parts of the code we use addresses of objects to search for their presence in lists. This is susceptible to the ABA address-recycling bug. For instance:
In this way, I am mistaking an "old message that I should ignore" for a "oh no, a job has expired" message.
The way to fix this: create a unique sequence number for each StandardJob (static variable inside the constructor that gets incremented). Then, before sending the message, the Alarm could, for instance, check that the sequence number of the job at address 0xAAAAA has not changed. Or, the message could be sent regardless, and the recipient of the message would then do the check. In essence, the check is: "yes, there is a job at that address you're telling me about, but let me checked if it's really the job you mean".
Implement:
--help
--help-simgrid
--version
handle the differences between WRENCH help and SimGrid help.
We need to add a creation overhead property
SImplify the tests by relying on the WorkflowTask::getExecutionHost() method instead of reverse-engineering schedule based on task completion times. (Just like what's done for ROUNDROBIN).
We need to develop pedagogic modules that can be used stand-alone and/or integrated into university courses to teach concepts related work workflows, HPC, distributed systems.
And then enable the TerminateStandardJobsTest in the BATSCHED case
I've been looking for a way to specify the amount of main memory per compute node, but haven't found one in here (like in SimGrid). Am I missing something, or is this intentional?
For some reason I looked at the code for Workflow::getReadyClusters(). I am a bit puzzled by this method and don't quite understand it (I never actually used the "cluster" feature). One thing that caught my eye first is that it calls setInternalState() and calls setState(). That seems really against our overall design. The state updates are made by the services, job manager, and by the WMS itself in waitForNextExecutionEvent(). Instead, the "get ready tasks" methods should just look at states, not update them. I am cut-and-pasting the method below.
The last else clause in this method is as:
} else {
if (task_map.find(task->getClusterID()) != task_map.end()) {
if (task->getState() == WorkflowTask::State::NOT_READY) {
task->setInternalState(WorkflowTask::InternalState::TASK_READY);
task->setState(WorkflowTask::State::READY);
}
task_map[task->getClusterID()].push_back(task);
}
}
I have no idea why we need to do anything in that else in the first place, and definitely not what's in there... I commented out this entire else clause and all tests and examples run fine (but then, we don't use this method a lot).
@rafaelfsilva I believe you implemented this method? what do you think?
S4U does/will provide a way to daemonize actors, which may be used to simply a bit the WRENCH code? For instance, the Alarm services?
[Formerly: We should evolve WRENCH so that it handles host failures as dictated by SimGrid availability traces]
What changed to a SimGrid Bug issue, and copied to a general issue
In class ComputeService we have the convenient constants ALL_RAM and ALL_CORES to specify "on that host use all ram" and "on that host use all cores". This is use throughout the WRENCH code, and documented, but I just noticed that it's not used everywhere. For instance, in the VirtualizedClusterService class, we're still on the "old way" of using "zero" to mean "all". We should fix this before the release...
At the moment, WRENCH only simulates one workflow execution. Users (e.g., Eddy Caron) have request a much more powerful model in which multiple workflows can arrive dynamically throughout the simulation. This requires some software engineering (and likely some thought). Furthermore, there should be the possibility of multiple WMS instances running concurrently, OR a single WMS instance managing multiple arriving worflows.
Before we get there we need:
I haven't had time to look into it, but one of our users has written a small simulator, and the loadFromJSON works on Linux, but not on Mac. I am attaching here the JSON file that causes problems. (had to rename is .json.txt so that Github would allow me to attach)
E1S51u.json.txt
In the design of most services, the API functions that "use" the service are as follows:
A) Check that the service is up
B) Send a message
C) Wait for a reply
It seems that:
A) is missing in some cases [TODO: add it]
B) is sometimes asynchronous, but synchronous is better [TODO: fix it]
C) is often without a time out (and thus may hang if the service has been killed in the meantime, which is a "feature" for a dumb implementation, but should likely be a bug) [TODO: add Service::setTimeout() and Service::getTimout() methods!]
Would it be useful/convenient to make the Batsched integration optional? This is because there are so many dependencies and users who don't need Batsched then have to install so many packages. Perhaps we don't care though. Not a huge deal either way I guess.
As I am writing WRENCH-based simulators, I am noticing something: task states are updated before notifications are received. Task states are tricky, which is why I had a while back split the task state into "state" and "internal state". This was because, e.g., when a compute service sets a task state to completed, from the WMS's perspective the task is still pending until a notification is sent. This has made things much easier, but now another but similar issue is coming up. Here is a scenario:
In the meantime, after 3) above but before the WMS does a waitForAndProcessNextEvent(), the WMS is doing something like: "hmmm... what tasks are ready again?" And by looking at task states, it will see some of T's children as ready. It may even see T as completed. And then later, it will be told "task T has completed", although it already new that because it happened to look on its own at task states.
So far, in the simulators I've written, it's been weird in terms of the output I see (which caused me to wonder: "how could T's child be ready when T hasn't completed yet?", because I was only printing some "task completed" message upon receiving an actual event). For instance, by output could have been, for a T1->T2 workflow:
which appears out-of-order, but it ok.
One question is : is this a bug or a feature?
I am thinking bug because it seems more coherent to say that "task states cannot change arbitrarily in between job submissions/cancellations and calls to waitForAndProcessNextEvent().
The fix wouldn't be super straightforward, since right now the logic in the Job Manager is, as mentioned above:
So, now, 2) has to happen in the waitForAndProcessNextEvent() method, which is awkward...
anyway, something to discuss/think about. Distributed computing, even in simulation, is never easy is it?
It would be useful for the FileRegistryService to have an option to not just get the list of replicas for a file, but instead to pick one based on whatever network proximity services are running, if any.
After all, this would make things very consistent, and would be pretty simple.
For now, we should likely enforce homogeneity of hosts
Later, think about how to support heterogeneity
It would be useful to have a "Developer 101" page that would guide a little bit people wanting to implement a WMS
One day, S4U will provide a clean exception hierarchy, at which point we'll nee to revisit/fix the low-level S4U calls in WRENCH so as to clean up and robustify our own code.
Simple objective: no longer have any use for the xbt_ex class/structure, only simgrid::*Exception classes.
In the current implementation, the constructor of a service also starts that service (i.e., it creates the S4U actor for it). This leads to a problem. For instance:
The alternative is that the constructor of a service does not start the actor. A separate start() call is used. This way, launch() can first check that it has all it needs, and then starts the services.
This seems like a better approach overall....
It may be a good idea to have a namespace for "user" and a namespace for "developer".
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.