Giter VIP home page Giter VIP logo

hpccloud's Introduction

HPCCloud

codecov.io Build Status Dependency Status semantic-release npm-download npm-version-requirement node-version-requirement

Goal

Web interface to the HPCCloud infrastructure that abstract simulation environment and resources on which you can run those simulations.

Installation

Observe the instructions for HPCCloud deploy;

Development

$ git clone https://github.com/Kitware/HPCCloud.git
$ cd HPCCloud
$ npm install
$ npm start

Troubleshooting

(With the vm running from HPCCloud-Deploy)

$ vagrant ssh
$ sudo -iu hpccloud

Fixing celery Girder URL

$ vi /opt/hpccloud/cumulus/cumulus/conf/config.json
  +-> Fix host to be localhost
  +-> baseUrl: "http://localhost:8080/api/v1",
$ sudo service celeryd restart

Documentation

See the documentation in this repository for a getting started guide, advanced documentation, and workflow descriptions.

Licensing

HPCCloud is licensed under Apache 2.

Getting Involved

Fork our repository and do great things. At Kitware, we've been contributing to open-source software for 15 years and counting, and we want to make hpc-cloud useful to as many people as possible.

hpccloud's People

Contributors

aronhelser avatar chetnieter avatar cjh1 avatar jourdain avatar notiv-nt avatar patrickoleary avatar psavery avatar sbhtw avatar scottwittenburg avatar tristanwright avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hpccloud's Issues

In simulation start dialog cap the number instance in the cluster

We have the following two endpoints:

GET /user/{id}/aws/profiles/{profileId}/maxinstances
GET /user/{id}/aws/profiles/{profileId}/runninginstances

The max number instances that a user can create in a cluster should be the result of the first request minus the second.

The stderr and stdout file from sge should be viewable when a simulation is complete

When simulation is complete these files are uploaded to Girder, we should provide a mechanism for them to be viewed in the client. They will be particularly important for debugging setup issues in a traditional cluster, for example this is the output for a job where the hydra executable was incorrectly defined:

--------------------------------------------------------------------------
mpiexec was unable to launch the specified application as it could not access
or execute an executable:

Executable: /hydra
Node: ulmus

while attempting to start process rank 0.
--------------------------------------------------------------------------

Register button is a little confusing

I keep clicking on it when trying to login. The arrow icon should probably say 'Sign in' or something like that and register button should be a smaller link.

Add new parameter to trad sim run dialog

It should be called:

"Output directory"
This should an optional full path to directory where the output of a job will be written to.

The value should be pass to /tasks/<task_id>/run as 'jobOuputDir'

AWS profile region restrictions

AWS creates a default VPC per region. Starcluster creates its instances in this VPC by default so that the cluster instance are not accessible from the outside world. In order to do this Starcluster must be run from a machine that is in the VPC. Starcluster also has the option to assign public ips to all node in the cluster that then removes this restriction although there is potentially a greater security risk. So what does this mean client? We need to expose and check box indicating whether to use public ip on cluster nodes. This check box will have different behavior depending if the HPCCloud stack is deployed on EC2 or not.

EC2 deployment

  • Check box is checked -
    We configure Starcluster to use public IP and the pull list of AWS regions is available to choose from.
  • Check box is unchecked - We have to restrict the regions that can be used to the region the stack has been deployed in so Starcluster will have access to the default VPC. The region will be configured into a JSON file as part of the deployment process.

Non EC2 deployment

  • In this case the only option is to use public IP for every node as Starcluster will be running outside EC2.
  • The checkbox should be checked and disabled and all regions will be availble to the user to create clusters in.

Look at supporting Google OAuth login

Girder has a plugin that support this, it might be nice to add support in the UI for this. I have done this for another project so have a bit of a idea of what we need to do.

Use SSE event for status updates

So once Kitware/cumulus#35 is merged the backend will produce server side events for status update, this should reduce the amount of polling that we will need to do in the UI. There might be some nice angular way of access these events, but here is some sample code in raw javascript.

        if (window.EventSource) {
            // The timeout is the number of second we should wait without an event before closing the connection ...
            url = girder.apiRoot + '/notification/stream?timeout=300';

            var eventSource = new window.EventSource(url);

            eventSource.onmessage = function (e) {
                var obj;
                try {
                    obj = window.JSON.parse(e.data);
                } catch (err) {
                    console.error('Invalid JSON from SSE stream: ' + e.data + ',' + err);
                    return;
                }
                /* Do something useful with the event. See description of event object below */
            };
        } else {
            console.error('EventSource is not supported on this platform.');
        }

The event object have the following form:

{
  "type": "<profile.status|job.status|cluster.status|task.status>",
  "data": {
    "_id": "<the id of the resource>",
    "status": "the new status"
  }
  // There are other property but these can be ignored for our purposes
}

gulp serve throws an error

@TristanWright Did we every resolve this issue?

[12:18:50] 'watch' errored after 196 ms
[12:18:50] TypeError: Cannot read property 'prototype' of undefined
at Object.exports.inherits (util.js:556:43)
at Object. (/home/cjh/work/source/HPCCloud/node_modules/browser-sync/node_modules/http-proxy/lib/http-proxy/index.js:111:17)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Module.require (module.js:364:17)
at require (module.js:380:17)
at Object. (/home/cjh/work/source/HPCCloud/node_modules/browser-sync/node_modules/http-proxy/lib/http-proxy.js:4:17)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Module.require (module.js:364:17)
at require (module.js:380:17)
at Object. (/home/cjh/work/source/HPCCloud/node_modules/browser-sync/node_modules/http-proxy/index.js:13:18)

Split RESTful endpoints into separate services

kw.girder.js is getting unreasonably large for one single service much less file. We should split it up into specific endpoints e.g. /user, /tasks, etc; and injecting them as they're needed. Further splitting them into specific girder and plugin endpoints would help a lot too.

Add logging information to simulation

The task resource has a log property that we should expose to the user. The property is of the following form:

  "log": [
    {
      "$ref": "http://ulex:9001/api/v1/jobs/56030771f657105ab7913b18/log", ...
    }

This is basically an aggregate log of all the resource create by the task for example the cluster and jobs that are submitted to it. So the pseudo code for tail the task log would be as follows:

while task is running:
fetch task:
for each log in task['log']:
fetch log provide offset of lines already fetched:
add lines to log in ui

The log entries from the resources are of the following form:

{
      "args": [],
      "created": 1442957579.78141,
      "exc_info": null,
      "exc_text": null,
      "filename": "job.py",
      "funcName": "_tail_output",
      "levelname": "INFO",
      "levelno": 20,
      "lineno": 288,
      "module": "job",
      "msecs": 781.4099788665771,
      "msg": "Skipping tail of 5601c8e4f657105a8f7ce936/output/stat.txt as file doesn't currently exist",
      "name": "starcluster",
      "pathname": "/home/cjh/work/source/cumulus/cumulus/starcluster/tasks/job.py",
      "process": 22983,
      "processName": "Worker-2",
      "relativeCreated": 1952257.1358680725,
      "thread": 139862483904320,
      "threadName": "MainThread"
}

For now I think we what to present the following information:

With created converted to a human readable format. I might be nice to allow a user to double click on an entry which would reveal all the log information.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.