Giter VIP home page Giter VIP logo

qtop / qtop Goto Github PK

View Code? Open in Web Editor NEW
33.0 7.0 16.0 2.54 MB

qtop (pronounced queue-top) is a tool written in order to monitor the state of Queueing Systems, along with related information relevant on HPC & grid clusters. At present it supports **PBS, SGE & OAR** families. There is a historic reference for the prior shell version of the tool, at former CERN source:

Home Page: https://cern.ch/fotis/QTOP

License: MIT License

Shell 0.99% Python 98.83% HTML 0.16% Makefile 0.03%
lsf oar pbs pbs-torque pbspro sge sgecore slurm slurm-cluster slurm-job-scheduler

qtop's Introduction

qtop.py Build Status python versions

qtop: the fast text mode way to monitor your cluster’s utilization and status; the time has come to take back control of your cluster’s scheduling business

Python port by Sotiris Fragkiskos / Original bash version by Fotis Georgatos

Summary

Example

Example

qtop.py is the python rewrite of qtop, a tool to monitor Torque, PBS, OAR or SGE clusters, etc. This release provides for the instant replay feature, which is handy for debugging scheduling mishaps as they occur. qtop is and will remain a work-in-progress project; it is intended to be built upon and extended - please come along ;)

This is an initial release of the source code, and work continues to make it better. We hope to build an active open source community that drives the future of this tool, both by providing feedback and by actively contributing to the source code.

This program is currently in pre-release mode, with experimental features. If it works, peace :)

Installation

To install qtop, you can either do

git clone https://github.com/qtop/qtop.git
cd qtop
./qtop --version

or

pip install qtop --user ## run it without --user to install it as root
$HOME/.local/bin/qtop --version

Usage

To run a demo, just run

./qtop -b demo -FGTw  ## show demo, -F for full node names, -T to transpose the matrix, -G for full GECOS field, and -w for watch mode

Otherwise, for daily usage you can run

./qtop -b sge -FGw ## replace sge with pbs or oar, depending on your setup (this is often picked up automagically) 

Try --help for all available options.

Documentation

Documentation/tutorial here.

Profile

Description: the fast text mode way to monitor your cluster’s utilization and status; the time has come to take back control of your cluster’s scheduling business
License: MIT
Version: 0.9.20161222 / Date: 2016-12-22
Homepage: https://github.com/qtop/qtop

qtop's People

Contributors

fgeorgatos avatar lookfwd avatar phillipnordwall avatar sfranky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

qtop's Issues

Bugs with and without `-c`, there is something for everybody!

[gef@ui01 qtop]$ ./qtop.py -s /home/gef/arena/qtop-sge2/results/gef_J70JeQFSYYOW9jDg6XfxzQ -c
PBS report tool. All bugs added by [email protected]. Cross fingers now...
Please try: watch -d + /home/gef/arena/qtop-sge3/qtop/qtop.py -s /home/gef/arena/qtop-sge2/results/gef_J70JeQFSYYOW9jDg6XfxzQ

===> Job accounting summary <=== (Rev: 3000 $) 2015-09-25 17:23:30.798377 WORKDIR = /home/gef/arena/qtop-sge3/qtop
Usage Totals:   17/17    Nodes | 61/408  Cores |   61+30 jobs (R + Q) reported by qstat -q
Queues: |  alice: 60 | biomed: 1 | cta: 0 | ctaprod: 0 | drihm: 0 | dteam.cg: 0 | gisela.cg: 0 | ibchem.cg: 0 | ibearth.cg: 0 | ibeng.cg: 0 | iber.cg: 0 | ibhpc.cg: 0 | ibict.cg: 0 | iblife.cg: 0 | ibops.cg: 0 | ibphys.cg: 0 | ibsocial.cg: 0 | ibtut.cg: 0 | ops.cg: 0 | sqm.test: 0 | Pending: 0 + 30| * implies blocked

===> Worker Nodes occupancy <=== (you can read vertically the node IDs; nodes in free state are noted with - )
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000111111111111111111111111111111111111111111111111111111111111111111={_Worker_}
000000000111111111122222222223333333333444444444455555555556666666666777777777788888888889999999999000000000011111111112222222222333333333344444444445555555555666666={__Node__}
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345={___ID___}
Traceback (most recent call last):
  File "./qtop.py", line 1022, in <module>
    _func(*args)
  File "./qtop.py", line 749, in display_wn_occupancy
    display_selected_occupancy_parts(print_char_start, print_char_stop, wn_vert_labels, core_user_map, pattern_of_id, workernodes_occupancy)
  File "./qtop.py", line 589, in display_selected_occupancy_parts
    fn(*args, **kwargs)
  File "./qtop.py", line 601, in print_single_attr_line
    attr_line = insert_separators(line, SEPARATOR, options.WN_COLON) + '={}'.format(label)
ValueError: zero length field name in format
[gef@ui01 qtop]$
[gef@ui01 qtop]$ ./qtop.py -s /home/gef/arena/qtop-sge2/results/gef_J70JeQFSYYOW9jDg6XfxzQ
PBS report tool. All bugs added by [email protected]. Cross fingers now...
Please try: watch -d + /home/gef/arena/qtop-sge3/qtop/qtop.py -s /home/gef/arena/qtop-sge2/results/gef_J70JeQFSYYOW9jDg6XfxzQ

Traceback (most recent call last):
  File "./qtop.py", line 1022, in <module>
    _func(*args)
  File "./qtop.py", line 208, in display_job_accounting_summary
    print colorize('===> ', '#') + colorize('Job accounting summary', 'Normal') + colorize(' <=== ', '#') + colorize(
  File "./qtop.py", line 75, in colorize
    if ((not options.NOCOLOR) and pattern != 'account_not_coloured' and text != ' ') else text
ValueError: zero length field name in format
[gef@ui01 qtop]$

Allow multiple concurrent runs

Reentrant code - how to ensure that multiple users can run the code concurrently (even if from the same unix account)
Improve code quality.

Parameter priority

Parameters accepted by qtop should be prioritized in the following order:

  1. cmd line switches
  2. environmental variables
  3. config file

[VIS] Remapping file

User should be able to decide on node naming, by providing a remap mapping file, such as:
torvalds --> tor
stroustrup --> strou
chomsky --> chm
workernode250.blahblah-workernode450.blahblah --> wn001-wn250

Non-interactive runs break out with an error

This occured while sending qtop to be executed on a grid machine

2015-12-17 01:24:14 - ERROR - Uncaught exception
Traceback (most recent call last):
  File "./qtop.py", line 1572, in <module>
    with raw_mode(sys.stdin):
  File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__
    return self.gen.next()
  File "./qtop.py", line 42, in raw_mode
    old_attrs = termios.tcgetattr(file.fileno())
error: (25, 'Inappropriate ioctl for device')

strict checking

add a --strict-checking switch that cross-checks between all input files for consistency of jobs (are counted R+Q jobs in Accounting Summary exactly as many as displayed in the Worker Nodes Occupancy table?)

Refactoring

Rename some variables, move constants and text strings to the proper place

introduce -vvv combination with -d

a -vvv (triple verbosity) switch, when combined with a -d (debug) switch, should spit debug messages both on the log file AND on screen, in between normal output

where to place configuration file

Do we supply a full example qtopconf.yaml in the current dir (how would it be placed elsewhere?)
and at the some time check for a user-placed conf file in $HOME/.local/qtop which should override the default one?

tarball creation

Input files (fetched by issuing the appropriate queueing commands, qstat, oarstat, etc) and intermediate yaml files should be gathered in a directory and put into a tarball.

Sorting discrepancy

file: gef_8e3OrbuWti6wAR59UMqR7Q/QTOP.OUT

check unix account llr199
fotis: id: B, 0+2/2
me: id: E, 0+2/2, but much lower in the ranking.
fotis also has other 0+2/2 ids much later
ranking_discrepancy

[VIS] Modularity of sections

The 3 separate qtop sections displayed should be modular, the user should be able to display additional information/sections, provided in a config file

fix buggy attitude under watch, when combined with `|head`

not to solve it quickly, just reporting the issue here:

[fgeorgatos@ui qtop]$ watch -d './qtop.py -c|head -20'
close failed in file object destructor:
                                       sys.excepthook is missing
                                                                lost sys.stderr
[fgeorgatos@ui qtop]$ watch -d './qtop.py -c|head -20'
close failed in file object destructor:
                                       sys.excepthook is missing
                                                                lost sys.stderr
[fgeorgatos@ui qtop]$

vertical stabilizer detached from the airplane...

                                                                  Decided Remapping: True
INFO - DISPLAY AREA
ERROR - Uncaught exception
Traceback (most recent call last):
  File "./qtop.py", line 1412, in <module>
    display_func(*args) if not sections_off[idx] else None
  File "./qtop.py", line 860, in display_wn_occupancy
    display_matrix(workernodes_occupancy)
  File "./qtop.py", line 616, in display_matrix
    fn(*args, **kwargs)
  File "./qtop.py", line 496, in display_wnid_lines
    color_func=highlight_alternately, args=(ALT_LABEL_HIGHLIGHT_COLORS))
  File "./qtop.py", line 507, in print_wnid_lines
    wn_id_str = insert_separators(d[line_nr][start:stop], SEPARATOR, config['vertical_separator_every_X_columns'])
KeyError: 'vertical_separator_every_X_columns'

Consider a DSL for either/both visualization and testing purposes

In the future, it would be great if some ideas from other teams could make it here,
in order to both reduce probability for any bug and, take advantage of testing concepts:

Last one, a good read on DSLs! :
http://safehammad.com/downloads/domain-specific-languages-and-python-2011-04-21.pdf

if possible, do not rely on intermediate file

In fact, we should not only avoid relying upon an intermediate yaml file (it could be dumped on the fs for debugging, yet left in memory as data structure for usage), we should also get away with:

  • qstat -F -xml -u "*" >/tmp/q && mv /tmp/q qstat.F.xml.stdout

The above expression is currently necessary, in order to update the information in an atomic manner.

Bugaki, node state should consume one char or, be presented in an alternate way

(fi. node01 below has state au, but node02 is clean)

Every 2.0s: ./qtop.py -a                                                                                                     Fri Oct  9 21:35:57 2015

Log file created in /home/fgeorgatos/.local/qtop/qtop_20003.log

=== WARNING: --- Remapping WN names and retrying heuristics... good luck with this... ---
PBS report tool. All bugs added by [email protected]. Cross fingers now...
Please try: watch -d /home/fgeorgatos/git/qtop/qtop.py -s /home/fgeorgatos/git/qtop

===> Job accounting summary <=== (Rev: 3000 $) 2015-10-09 21:35:57.825648 WORKDIR = /home/fgeorgatos/git/qtop
Usage Totals:   14/14    Nodes | 0/332  Cores |   0+0 jobs (R + Q) reported by qstat -q
Queues: |  slice: 0 | transfer: 0 | whole: 0 | Pending: 0 | * implies blocked

===> Worker Nodes occupancy <=== (you can read vertically the node IDs; nodes in free state are noted with - )
00000000011111={_Worker_}
12345678901234={__Node__}
au-----auau---=Node state
______________=Core0
______________=Core1
######________=Core2
######________=Core3
######________=Core4
######________=Core5
[...]

Why not bail out (or bork out!) with better error messages?!

I overcharged a queue and things started looking fishy, it ended up crying in command prompt:
(fine with that, let's make it more clear what went wrong!)

[fg@hello qtop]$ qstat -F -xml>qstat.F.xml.stdout && ./qtop.py -s . -c
===> Worker Nodes occupancy <=== (you can read vertically the node IDs; nodes in free state are noted with - )
12345={__WNID__}
----d=Node state
_____=Core0
_____=Core1
_____=Core2
_____=Core3
_____=Core4
_____=Core5
_____=Core6
_____=Core7
_____=Core8
_____=Core9
_____=Core10
_____=Core11
_____=Core12
_____=Core13
_____=Core14
_____=Core15
_____=Core16
_____=Core17
_____=Core18
_____=Core19
_____=Core20
_____=Core21
_____=Core22
_____=Core23
_____=Core24
_____=Core25
_____=Core26
_____=Core27
_____=Core28
_____=Core29
_____=Core30
_____=Core31

===> User accounts and pool mappings <===   ('all' also includes those in C and W states, as reported by qstat)
id|    R +    Q /  all |    unix account | Grid certificate DN (info only available under elevated privileges)

Thanks for watching!

[fg@hello qtop]$ watch -d 'qstat -F -xml>qstat.F.xml.stdout && ./qtop.py -s . -c'
[fg@hello qtop]$ Traceback (most recent call last):
  File "./qtop.py", line 16, in <module>
    from pbs import *
  File "/home/fgeorgatos/arena/qtop/pbs.py", line 4, in <module>
    import yaml
  File "/home/fgeorgatos/.local/easybuild/software/Python/2.7.3-goolf-1.4.10/lib/python2.7/site-packages/yaml/__init__.py", line 8, in <module>
    from loader import *
  File "/home/fgeorgatos/.local/easybuild/software/Python/2.7.3-goolf-1.4.10/lib/python2.7/site-packages/yaml/loader.py", line 4, in <module>
    from reader import *
  File "/home/fgeorgatos/.local/easybuild/software/Python/2.7.3-goolf-1.4.10/lib/python2.7/site-packages/yaml/reader.py", line 45, in <module>
    class Reader(object):
  File "/home/fgeorgatos/.local/easybuild/software/Python/2.7.3-goolf-1.4.10/lib/python2.7/site-packages/yaml/reader.py", line 137, in Reader
    NON_PRINTABLE = re.compile(u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]')
  File "/home/fgeorgatos/.local/easybuild/software/Python/2.7.3-goolf-1.4.10/lib/python2.7/re.py", line 190, in compile
    return _compile(pattern, flags)
  File "/home/fgeorgatos/.local/easybuild/software/Python/2.7.3-goolf-1.4.10/lib/python2.7/re.py", line 240, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/home/fgeorgatos/.local/easybuild/software/Python/2.7.3-goolf-1.4.10/lib/python2.7/sre_compile.py", line 504, in compile
    code = _code(p, flags)
  File "/home/fgeorgatos/.local/easybuild/software/Python/2.7.3-goolf-1.4.10/lib/python2.7/sre_compile.py", line 486, in _code
    _compile_info(code, p, flags)
  File "/home/fgeorgatos/.local/easybuild/software/Python/2.7.3-goolf-1.4.10/lib/python2.7/sre_compile.py", line 464, in _compile_info
    _compile_charset(charset, flags, code)
  File "/home/fgeorgatos/.local/easybuild/software/Python/2.7.3-goolf-1.4.10/lib/python2.7/sre_compile.py", line 183, in _compile_charset
    for op, av in _optimize_charset(charset, fixup):
  File "/home/fgeorgatos/.local/easybuild/software/Python/2.7.3-goolf-1.4.10/lib/python2.7/sre_compile.py", line 226, in _optimize_charset
    return _optimize_unicode(charset, fixup)
  File "/home/fgeorgatos/.local/easybuild/software/Python/2.7.3-goolf-1.4.10/lib/python2.7/sre_compile.py", line 303, in _optimize_unicode
    import array
KeyboardInterrupt

[fg@hello qtop]$

Anonymize data

There should be an cmdline option to anonymize specific data items, e.g. node names and queue names. This is related to issue #60, creating a tar file after qtop output for debugging etc. and should make communication with "sensitive" clusters much easier.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.