Giter VIP home page Giter VIP logo

cyclecloud-lsf's Introduction

LSF

CycleCloud project for Spectrum LSF.

Azure Cyclecloud is integrated with LSF RC as a resource provider.

See the IBM docs for details. https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_resource_connector/lsf_rc_cycle_config.html

Prerequisites

IBM Spectrum LSF

This product requires LSF FP9 (532214).

To use the fully automated cluster, or the vm image builder in this project LSF binaries and entitlement file must be added to the blobs/ directory.

  • lsf10.1_lnx310-lib217-x86_64-532214.tar.Z
  • lsf10.1_lnx310-lib217-x86_64.tar.Z
  • lsf10.1_lsfinstall_linux_x86_64.tar.Z
  • lsf_std_entitlement.dat

To use the install automation in this project add these files (or appropriate kernel packages) to the blobs/ directory.

Azure CycleCloud

This project requires running Azure CycleCloud version 7.7.4 or later.

Supported Scenarios

Externally Managed Master Node (Scenario 1)

The most common introductory approach is to manually configure LSF master nodes to work with the CycleCloud LSF cluster type. The cluster type is available in the CycleCloud new cluster menu. The CycleCloud LSF cluster type does not have a master node(s), it's assumed that is a pre-existing resource.

This cluster also requires that the user creates a VM image with LSF pre-installed in a slave configuration. To facilitate this we supply some automation using Packer to create this image. These tools can be found in the /vm-image

Fully Managed Cluster (Scenario 2)

CycleCloud has an example project that can deploy a fully managed LSF cluster to Azure.

LSF Configurations for CycleCloud Provider

LSF Resources for CycleCloud

CycleCloud LSF cluster is designed to support a number of compute scenarios including tightly-coupled MPI jobs, high-throughput parallel tasks, gpu-accelerated workloads and low priority VirtualMachines.

To enable these scenarios Azure recommends configuring a number of custom shared resource types.

Add these properties to lsb.shared

   cyclecloudhost  Boolean  ()       ()       (instances from Azure CycleCloud)
   cyclecloudmpi  Boolean   ()       ()       (instances that support MPI placement)
   cyclecloudlowprio  Boolean ()     ()       (instances that low priority / interruptible from Azure CycleCloud)
   nodearray  String     ()       ()       (nodearray from CycleCloud)
   placementgroup String ()       ()       (id used to note locality of machines)
   instanceid String     ()       ()       (unique host identifier)

A Special Note on PlacementGroups

Azure Datacenters have Infiniband network capability for HPC scenarios. These networks, unlike the normal ethernet, have limited span. The Infiniband network extents are described by "PlacementGroups". If VMs reside in the same placement group and are special Infiniband-enabled VM Types, then they will share an Infiniband network.

These placement groups necessitate special handling in LSF and CycleCloud.

Here is an example LSF template for Cyclecloud from cyclecloudprov_templates.json:

{
  "templateId": "ondemandmpi-1",
  "attributes": {
    "nodearray": ["String", "ondemandmpi" ],
    "zone": [  "String",  "westus2"],
    "mem": [  "Numeric",  8192.0],
    "ncpus": [  "Numeric",  2],
    "cyclecloudmpi": [  "Boolean",  1],
    "placementgroup": [  "String",  "ondemandmpipg1"],
    "ncores": [  "Numeric",  2],
    "cyclecloudhost": [  "Boolean",  1],
    "type": [  "String",  "X86_64"],
    "cyclecloudlowprio": [  "Boolean",  0]
  },
  "maxNumber": 40,
  "nodeArray": "ondemandmpi",
  "placementGroupName": "ondemandmpipg1",
  "priority": 448,
  "customScriptUri": "https://aka.ms/user_data.sh",
  "userData" : "nodearray_name=ondemandmpi;placement_group_id=ondemandmpipg1"
}

The placementGroupName in this file can be anything but will determine the name of the placementGroup in CycleCloud. Any nodes borrowed from CycleCloud from this template will reside in this placementGroup and, if they're Infiniband-enabled VMs, will share an IB network.

Note that placementGroupName matches the host attribute placementgroup, this intentional and necessary. Also that the placement_group_id is set in userData to be used in user_data.sh at host start time. The additional ondemandmpi attribute is used to prevent this job from matching on hosts where placementGroup is undefined.

We advise this template be used with a RES_REQ as follows:

-R "span[ptile=2] select[nodearray=='ondemandmpi' && cyclecloudmpi] same[placementgroup]" my_job.sh

By inspecting cyclecloudprov_templates.json and user_data.sh see how GPU jobs, both MPI and parallel can be supported, eg. for MPI job:

-R "span[ptile=1] select[nodearray=='gpumpi' && cyclecloudmpi] same[placementgroup] -gpu "num=1:mode=shared:j_exclusive=yes"

or parallel job (no placement group needed):

-R select[nodearray=='gpu' && !cyclecloudmpi] -gpus "num=2:mode=shared:j_exclusive=yes"

Additional LSF Template Attributes for CycleCloud

The only strictly required attributes in a LSF template are:

  • templateId
  • nodeArray

Others are inferred from the CycleCloud configuration, can be ommited, or aren't necessary at all.

  • imageId - Azure VM Image eg. "/subscriptions/xxxxxxxx-xxxx-xxxx-xxx-xxxxxxxxxxxx/resourceGroups/my-images-rg/providers/Microsoft.Compute/images/lsf-execute-201910230416-80a9a87f" override for CycleCloud cluster configuration.
  • subnetId - Azure subnet eg. "resource_group/vnet/subnet" override for CycleCloud cluster configuration.
  • vmType - eg. "Standard_HC44rs" override for CycleCloud cluster configuration.
  • keyPairLocation - eg. "~/.ssh/id_rsa_beta" override for CycleCloud cluster configuration.
  • customScriptUri - eg. "http://10.1.0.4/user_data.sh", no script if not specified.
  • userData - eg. "nodearray_name=gpumpi;placement_group_id=gpumpipg1" empty if not specified.

Environment Variables for user_data.sh

Cyclecloud/LSF automatically sets certain variables in the run environment of user_data.sh. These variables are:

  • rc_account
  • template_id
  • providerName (default: cyclecloud)
  • clustername
  • cyclecloud_nodeid
  • anything specified in userData template attribute.

Setup involving LSF prerequisites

  • Choose an LSF install location; eg. LSF_TOP=/grid/lsf and use throughout.
  • Create a VM image with LSF installed
    • Add installers and entitlement file to the /blobs directory.
    • Follow instructions found in the vm_image directory.
  • Configure the cyclecloud host provider on the LSF Master.
  • Edit user_data.sh script to appropriately set MASTER_LIST.
  • Host the updated script in a URL allowing anonymous authentication, Azure Storage Account in public mode works well.

Setup Cluster in CycleCloud

  • Create a LSF cluster in the CycleCloud UI
    • Along with VM types, Networking, and ImageId, set the LSF_TOP for the execute nodes when configuring.
  • Start the cluster
  • Restart mbatchd on the master node and LSF should be integrated with the CycleCloud cluster.
  • Start a job requesting resources from cyclecloudprov_templates.json

Setup the Fully-Managed LSF Cluster Type

This repo contains the cyclecloud project. The fully-managed LSF cluster is a completely automated cluster which will start a filesystem for LSF_TOP, high-availability LSF master nodes, as well as all the LSF configuration files, and worker nodes.

The cluster template for this scenario is lsf-full.txt. To prepare the environment to run this cluster:

  1. Copy LSF installers into the blobs/ directory.
  2. Upload the lsf binaries to the cyclecloud locker e.g. pogo sync blobs/ az://<storage-account>/cyclecloud/blobs/lsf/
  3. Import the cluster as a service offering cyclecloud import_cluster LSF-full -c lsf -f lsf-full.txt -t
  4. Add the cluster to your managed cluster list in the CycleCloud UI with the +add cluster button.
  5. Follow the configuration menu, save the cluster and START it.

NOTE : to avoid race conditions in HA master setup, transient software installation failures with recovery are expected.

NOTE : cyclecloudprov_templates.json is not automatically updated. The automation will initialize this file, but if you change the machine type then the host attributes (mem, ncpus, etc) will need to be updated and mbatchd restarted.

Submit jobs

Once the cluster is running you can log into one of the master nodes and submit jobs to the scheduler:

  1. cyclecloud connect master-1 -c my-lsf-cluster
  2. bsub sleep 300
  3. You'll see an ondemand node start up and prepare to run jobs.
  4. When the job queue is cleared, nodes will autoscale back down.

There are a number of default queue types in the CycleCloud LSF cluster.

QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
ondemand         30  Open:Active       -    -    -    -     0     0      0     0
ondemandmpi      30  Open:Active       -    -    -    -     0     0      0     0
lowprio          30  Open:Active       -    -    -    -     0     0      0     0
gpu              30  Open:Active       -    -    -    -     0     0      0     0
gpumpi           30  Open:Active       -    -    -    -     0     0      0     0
  • ondemand - a general queue (default), for pleasantly parallel jobs.
  • ondemandmpi - a queue for tightly-coupled jobs.
  • lowprio - a queue for pre-emptible jobs which will run on low priority machines.
  • gpu - parallel queue for jobs needing GPU co-processor.
  • gpumpi - gpu mpi jobs.

Once the cluster is running you can log into one of the master nodes and submit jobs to the scheduler. Examples of supported job submissions:

  • bsub -J "testArr[100]" my-job.sh (ondemand is default)
  • bsub -n 4 -q ondemandmpi -R "span[ptile=2]" my-job.sh
  • bsub -gpu "num=2:mode=shared:j_exclusive=yes" -q gpu my-job.sh

Start a submit-only host

There is a nodearray dedicated to login hosts. These hosts don't run jobs, but can submit jobs to the queue. Start them from the UI by going to Actions -> Add and follow the menu to add submit-type hosts.

NOTE : The submit hosts will be visible with the lshosts command but not the bhosts command.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

cyclecloud-lsf's People

Contributors

aditigaur4 avatar anhoward avatar bwatrous avatar dpwatrous avatar edwardsp avatar hmeiland avatar jamesongithub avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar mvrequa avatar ryanhamel avatar staer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cyclecloud-lsf's Issues

NPE when creating nodes

$ cyclecloud -v
CycleCloud 7.9.6-1280

$ uname -a
Linux cyclecloud 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

LSF version: 10.1.0.9 (SPK9)

When creating a node from LSF, the cyclecloud throws NPE immediately:

2020-07-12 14:15:54,177 INFO [com.cyclecomputing.web.RequestHandler request-1367] - **BEGIN POST request to /clusters/lsf-cluster/nodes/create
2020-07-12 14:15:54,371 INFO [com.cyclecomputing.appsupport.action.ActionExecutor request-1367] - Running action Add:Cloud.Node...
2020-07-12 14:15:54,379 ERROR [com.cyclecomputing.grid.common.plugin.PluginServlet request-1367] - Error while processing request
java.lang.NullPointerException
        at com.cyclecomputing.cloud.node.CachedQuotaTracker$CachedReservation.getRegionCache(CachedQuotaTracker.java:422)
        at com.cyclecomputing.cloud.node.CachedQuotaTracker$CachedReservation.getEffectiveQuota(CachedQuotaTracker.java:429)
        at com.cyclecomputing.cloud.node.CachedQuotaTracker$CachedReservation.getAvailableCount(CachedQuotaTracker.java:347)
        at com.cyclecomputing.cloud.nodearray.NodeCreator$SetValidator.validateSet(NodeCreator.java:410)
        at com.cyclecomputing.cloud.nodearray.NodeCreator.createNodes(NodeCreator.java:189)
        at com.cyclecomputing.cloud.node.NodeArrayController.createNodes(NodeArrayController.java:91)
        at com.cyclecomputing.cloud.node.rest.CreateNodesPlugin.execute(CreateNodesPlugin.java:371)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.cyclecomputing.apex.plugin.java.JavaPluginFunction.call(JavaPluginFunction.java:157)
        at com.cyclecomputing.apex.plugin.PluginUtilities.call(PluginUtilities.java:151)
        at com.cyclecomputing.apex.plugin.PluginBase.call(PluginBase.java:96)
        at com.cyclecomputing.appsupport.action.ActionExecutor.runAction(ActionExecutor.java:245)
        at com.cyclecomputing.appsupport.restlet.action.ActionHandlerProvider$CustomHandler.execute(ActionHandlerProvider.java:205)
        at com.cyclecomputing.appsupport.restlet.action.ActionHandlerProvider$CustomHandler.acceptRepresentation(ActionHandlerProvider.java:139)
        at org.restlet.resource.Resource.post(Resource.java:683)
        at org.restlet.resource.Resource.handlePost(Resource.java:537)
        at com.cyclecomputing.apex.rest.plugin.FilteredHandler.handleInternal(FilteredHandler.java:119)
        at com.cyclecomputing.apex.rest.plugin.FilteredHandler.access$000(FilteredHandler.java:22)
        at com.cyclecomputing.apex.rest.plugin.FilteredHandler$1.run(FilteredHandler.java:80)
        at com.cyclecomputing.apex.rest.plugin.FilteredHandler.handle(FilteredHandler.java:90)
        at com.cyclecomputing.apex.rest.plugin.FilteredHandler.handlePost(FilteredHandler.java:63)
        at org.restlet.Finder.handle(Finder.java:357)
        at org.restlet.Filter.doHandle(Filter.java:150)
        at org.restlet.Filter.handle(Filter.java:195)
        at org.restlet.Router.handle(Router.java:504)
        at org.restlet.Filter.doHandle(Filter.java:150)
        at org.restlet.Filter.handle(Filter.java:195)
        at org.restlet.Filter.doHandle(Filter.java:150)
        at org.restlet.Filter.handle(Filter.java:195)
        at org.restlet.Filter.doHandle(Filter.java:150)
        at org.restlet.Filter.handle(Filter.java:195)
        at org.restlet.Filter.doHandle(Filter.java:150)
        at org.restlet.Filter.handle(Filter.java:195)
        at org.restlet.Filter.doHandle(Filter.java:150)
        at com.noelios.restlet.StatusFilter.doHandle(StatusFilter.java:130)
        at org.restlet.Filter.handle(Filter.java:195)
        at org.restlet.Filter.doHandle(Filter.java:150)
        at org.restlet.Filter.handle(Filter.java:195)
        at com.noelios.restlet.ChainHelper.handle(ChainHelper.java:124)
        at com.noelios.restlet.application.ApplicationHelper.handle(ApplicationHelper.java:112)
        at org.restlet.Application.handle(Application.java:341)
        at org.restlet.Filter.doHandle(Filter.java:150)
        at org.restlet.Filter.handle(Filter.java:195)
        at org.restlet.Router.handle(Router.java:504)
        at org.restlet.Filter.doHandle(Filter.java:150)
        at org.restlet.Filter.handle(Filter.java:195)
        at org.restlet.Router.handle(Router.java:504)
        at org.restlet.Filter.doHandle(Filter.java:150)
        at org.restlet.Filter.handle(Filter.java:195)
        at com.noelios.restlet.ChainHelper.handle(ChainHelper.java:124)
        at org.restlet.Component.handle(Component.java:673)
        at org.restlet.Server.handle(Server.java:331)
        at com.noelios.restlet.ServerHelper.handle(ServerHelper.java:68)
        at com.noelios.restlet.http.HttpServerHelper.handle(HttpServerHelper.java:147)
        at com.noelios.restlet.ext.servlet.ServerServlet.service(ServerServlet.java:881)
        at com.cyclecomputing.grid.common.plugin.PluginServlet.service(PluginServlet.java:123)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:741)
        at org.springframework.web.servlet.handler.SimpleServletHandlerAdapter.handle(SimpleServletHandlerAdapter.java:63)
        at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:857)
        at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:792)
        at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:475)
        at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:440)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:660)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:741)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
        at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
        at com.cyclecomputing.web.RequestHandler$CustomRunnable.delegate(RequestHandler.java:155)
        at com.cyclecomputing.web.RequestHandler$CustomRunnable.run(RequestHandler.java:127)
        at com.cyclecomputing.apex.plugin.impl.PluginTaskAdapter$PythonRunnable.run(PluginTaskAdapter.java:37)
        at com.cyclecomputing.apex.ad.auth.impl.ThreadContextTaskAdapter$CustomRunnable.run(ThreadContextTaskAdapter.java:42)
        at com.cyclecomputing.apex.event.impl.EventLogRunnable.run(EventLogRunnable.java:23)
        at com.cyclecomputing.core.concurrency.logging.LoggingRunnable.run(LoggingRunnable.java:31)
        at com.cyclecomputing.web.RequestHandler.doFilter(RequestHandler.java:68)
        at org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:138)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:185)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:608)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343)
        at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:408)
        at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
        at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:764)
        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1388)
        at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:748)
2020-07-12 14:15:54,380 INFO [com.cyclecomputing.web.RequestHandler request-1367] - **END   POST request to /clusters/lsf-cluster/nodes/create, 500, 0.2 sec
2020-07-12 14:15:57,269 INFO [chef.host_timeout timer-1233] - Deleting Chef hosts that have not been heard from in 86400.000000 seconds
2020-07-12 14:15:57,288 INFO [chef.host_timeout timer-1233] - Finished deleting timed-out chef hosts

project.ini-entitled invalid

line 7 is single filename; which should be added to files list in line 6; current file is invalid and gives this error while uploading project:
cyclecloud project upload my-storage
**** Error: Unexpected command failure: No project at specified path: /mnt/c/Users/

LSF_UNIT_FOR_LIMITS=MB can effect tcl runtime

this project overrides the KB default with LSF_UNIT_FOR_LIMITS=MB in lsf.conf.

bsub -Is /user/user1/sample.tcl

<<Waiting for dispatch ...>>
<>
Tcl_InitNotifier: unable to start notifier thread

The sample.tcl is just like:

#!/usr/local/bin/tclsh
puts "Hello World"

Custom script can fail in several ways - too hard to disambiguate for user

curl -L command in the custom script can fail if:

  1. resource doesn't exist
  2. authentication invalid

1:
<U+FEFF>ResourceNotFoundThe specified resource does not exist.
RequestId:e5436447-901e-0047-087d-ee38af000000
Time:2020-02-28T21:22:02.1247028Z

Status: Error [Software Configuration] (retrying)
Start Time: 2020-02-18T16:06:02.190Z

Description: Failed to execute cluster-init script '/mnt/cluster-init/lsf/execute/scripts/02-run-custom-script-uri.sh' in project 'lsf' (return code: 2)

Detail:
Script output:
running https://storage4lge.blob.core.windows.net/cyclecloud/projects/lsf/3.0.3/default/cluster-init/files/user_data-full.sh...

Try to add sensible error handling in this area that decodes what the issue is.

How to import a a template into a cyclecloud server?

HI Guys, I have a cyclecloud template and I also have a cyclecloud server. How do I import the cyclecloud template into my cyclecloud server. Please help! I know the command is something like that " cyclecloud import_template ...... ....
What should be my full command and how do I make sure it is imported on my cyclecloud server?

Please help!

Autostop process uses a pgrep flag that is not valid in all distros

The pgrep -c option used for auto stopping does not work in some versions of pgrep. For example, on RH 6.8:

pgrep: invalid option -- 'c'
Usage: pgrep [-flvx] [-d DELIM] [-n|-o] [-P PPIDLIST] [-g PGRPLIST] [-s SIDLIST]
    [-u EUIDLIST] [-U UIDLIST] [-G GIDLIST] [-t TERMLIST] [PATTERN]
$```

Need an alternatively way to figure out if the node is idle. A stop gap is `pgrep -x res | wc -l`

https://github.com/Azure/cyclecloud-lsf/blob/5de080bfc31a27a8313892dd03b5670f21d96fb6/specs/default/chef/site-cookbooks/lsf/files/autostop.rb#L18

CC can exceed quota with two VM sizes in a node array

VM0 1000 vCPU quota
VM1 10000 vCPU quota
Two templates in cyclecloudprov_templates with MaxNumber set. Initially MaxNumber was set slightly higher where vCPU * MaxNumber > vCPU Quota for VM0. MaxCoreCount on nodearray set to 11000.

The jobs specify which VM they are running on. So it's not a priority. They were able to exceed the quota for VM0.

Expose package name and version in the cluster template

Currently, if the customer has a different LSF installer package, they can edit blobs.ini to get the right blob uploaded, but it's not clear that they need to specify lsf.{version,kernel,arch} in their cluster template. We should probably put those in the template and expose them in the parameters section.

jobs are not dispatched on execute nodes with v 3.2.0

Facility to stand up execute nodes works and nodes are added to the queue in lsf - but jobs do not get dispatched using the lsf-full.txt template. The problem is that the environment for the user is not pointing to the correct LSF_ENVDIR. An export needs to be set when restarting daemons on execute nodes:


needs to be:
export LSF_ENVDIR=$LSF_ENVDIR_LOCAL

Also incase a user logs into the compute node interactively the LSF_ENVDIR wont be set - needs an update to /etc/profile.d

Master size 1 array is failing with invalid host-list.

[2018-08-27T15:26:17+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2018-08-27T15:26:17+00:00] ERROR: ruby_block[check_valid_masterlist] (lsf::master line 13) had an error: RuntimeError: Hostname mismatch in master list.
[2018-08-27T15:26:17+00:00] FATAL: Chef::Exceptions::ChildConvergeError: Chef run process exited unsuccessfully (exit code 1)

[DOC] cyclecloud import_cluster or import_template

Hello

I am following the readme, and I am wondering in the section "Start an LSF cluster" if you want to run
cyclecloud import_cluster LSF -f lsf.txt -t

or

cyclecloud import_template LSF -f lsf.txt -t

Thanks,
Alex

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.