Giter VIP home page Giter VIP logo

colmet's People

Contributors

adfaure avatar augu5te avatar bzizou avatar elegaanz avatar joachimff avatar lambertrocher avatar salemharrache avatar yanng23 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

colmet's Issues

Some graphs are cumulative by default

Some graphs in the dashboard seem to be cumulative/sum by default, which gives counter-intuitive results?
Maybe changing the default behaviour would make the results more intuitive?

Colmet stopped to record data into hdf5 file with "bad heap free list"; file corrupted?

I noticed that my colmet was not updating the hdf5 file anymore with an arror into the log file (below).
I can reproduce the same error "bad heap free list" if I try to read data inside the hdf5 file using h5py.

Error logged:

Traceback (most recent call last):
File "/applis/ciment/v2/stow/x86_64/gcc_4.6.2/python_2.7.2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
callback(_args, *_kwargs)
File "/var/lib/colmet/colmet-0.3.0/colmet/common/backends/zeromq.py", line 48, in
intern_callback = lambda x: self.unpack(x, callback)
File "/var/lib/colmet/colmet-0.3.0/colmet/common/backends/zeromq.py", line 45, in unpack
callback(counters)
File "/var/lib/colmet/colmet-0.3.0/colmet/collector/main.py", line 58, in collect
self.push()
File "/var/lib/colmet/colmet-0.3.0/colmet/collector/main.py", line 47, in push
backend.push(self.counters_list)
File "/var/lib/colmet/colmet-0.3.0/colmet/collector/hdf5.py", line 222, in push
jobstat.append_stats(c_list)
File "/var/lib/colmet/colmet-0.3.0/colmet/collector/hdf5.py", line 361, in append_stats
self._init_job_file_if_needed(stats[0].metric_backend)
File "/var/lib/colmet/colmet-0.3.0/colmet/collector/hdf5.py", line 332, in _init_job_file_if_needed
if table_path not in self.job_file:
File "/applis/ciment/v2/stow/x86_64/gcc_4.6.2/python_2.7.2/lib/python2.7/site-packages/tables-2.4.0-py2.7-linux-x86_64.egg/tables/file.py", line 1506, in contains
self.getNode(path)
File "/applis/ciment/v2/stow/x86_64/gcc_4.6.2/python_2.7.2/lib/python2.7/site-packages/tables-2.4.0-py2.7-linux-x86_64.egg/tables/file.py", line 1112, in getNode
node = self._getNode(nodePath)
File "/applis/ciment/v2/stow/x86_64/gcc_4.6.2/python_2.7.2/lib/python2.7/site-packages/tables-2.4.0-py2.7-linux-x86_64.egg/tables/file.py", line 1057, in _getNode
node = self.root._g_loadChild(nodePath)
File "/applis/ciment/v2/stow/x86_64/gcc_4.6.2/python_2.7.2/lib/python2.7/site-packages/tables-2.4.0-py2.7-linux-x86_64.egg/tables/group.py", line 1131, in _g_loadChild
childClass = self._g_getChildLeafClass(childName, warn=True)
File "/applis/ciment/v2/stow/x86_64/gcc_4.6.2/python_2.7.2/lib/python2.7/site-packages/tables-2.4.0-py2.7-linux-x86_64.egg/tables/group.py", line 291, in _g_getChildLeafClass
childCID = self._g_getLChildAttr(childName, 'CLASS')
File "hdf5Extension.pyx", line 708, in tables.hdf5Extension.Group._g_getLChildAttr (tables/hdf5Extension.c:6793)

HDF5ExtError: HDF5 error back trace

File "H5D.c", line 334, in H5Dopen2
not found
File "H5Gloc.c", line 430, in H5G_loc_find
can't find object
File "H5Gtraverse.c", line 861, in H5G_traverse
internal path traversal failed
File "H5Gtraverse.c", line 596, in H5G_traverse_real
can't look up component
File "H5Gobj.c", line 1156, in H5G_obj_lookup
can't locate object
File "H5Gstab.c", line 892, in H5G_stab_lookup
unable to protect symbol table heap
File "H5HL.c", line 490, in H5HL_protect
unable to load heap data block
File "H5AC.c", line 1322, in H5AC_protect
H5C_protect() failed.
File "H5C.c", line 3567, in H5C_protect
can't load entry
File "H5C.c", line 7957, in H5C_load_entry
unable to load entry
File "H5HLcache.c", line 670, in H5HL_datablock_load
unable to destroy local heap data block
File "H5HLint.c", line 401, in H5HL_dblk_dest
can't unpin local heap prefix
File "H5AC.c", line 1456, in H5AC_unpin_entry
can't unpin entry
File "H5C.c", line 5058, in H5C_unpin_entry
Entry isn't pinned
File "H5HLcache.c", line 657, in H5HL_datablock_load
can't initialize free list
File "H5HLcache.c", line 154, in H5HL_fl_deserialize
bad heap free list

End of HDF5 error back trace

Code to reproduce the error, on a random job_id:

!/bin/env python

import h5py
import numpy as np
colmetfile="/scratch/bzizou/colmet/froggy.2015-02-16.hdf5"
f = h5py.File(colmetfile, "r")
for name in f['5200156']:
print name

Error reproduced:

Traceback (most recent call last):
File "./colmet_hdf5_reader.py", line 9, in
for name in f['5200156']:
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/applis/site/src/h5py-2.4.0/h5py/_objects.c:2705)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/applis/site/src/h5py-2.4.0/h5py/_objects.c:2662)
File "build/bdist.linux-x86_64/egg/h5py/_hl/group.py", line 160, in getitem
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/applis/site/src/h5py-2.4.0/h5py/_objects.c:2705)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/applis/site/src/h5py-2.4.0/h5py/_objects.c:2662)
File "h5py/h5o.pyx", line 190, in h5py.h5o.open (/applis/site/src/h5py-2.4.0/h5py/h5o.c:3756)
KeyError: 'Unable to open object (Bad heap free list)'

Total data send divided by two on omi-path

perfquery -x

PortSelect:......................1
CounterSelect:...................0x0000
PortXmitData:....................15217586
PortRcvData:.....................33204887641
PortXmitPkts:....................3745479
PortRcvPkts:.....................65084876
PortUnicastXmitPkts:.............0
PortUnicastRcvPkts:..............0
PortMulticastXmitPkts:...........15

# send 10000*100000 bytes = 1 Giga octet :

ib_write_bw dahu-31.grenoble.grid5000.fr -s 10000 --iters 100000

perfquery -x

PortSelect:......................1
CounterSelect:...................0x0000
PortXmitData:....................15355105
PortRcvData:.....................33331290814
PortXmitPkts:....................3778876
PortRcvPkts:.....................65384944
PortUnicastXmitPkts:.............0
PortUnicastRcvPkts:..............0
PortMulticastXmitPkts:...........15
PortMulticastRcvPkts:............16

variation PortRcvData :

(33331290814 - 33204887641) * 4 / 1000000000 = 0.505612692 Giga bytes instead of de 1 Giga bytes.

tested on dahu.

Not redoing data access request when zooming in

Given a page that is already loaded (job id and time range specified).
When I zoom in to see a sub-time-range of a certain graph, it seems to redo a data access request, judging by the "loading" icons and the time it takes to show the zoomed-in graph.
Whereas the application already should have the data for that sub-time-range, and so shouldn't have to redo a request?

I'm not sure if this is a Grafana task or a colmet one.

wrong report of memory for multithreaded apps

We are using colmet on our cluster and i have noticed a problem with multi-threaded applications:

colmet reports a wrong amount of memory usage: if my apps is using 3.6 GBytes and have 16 threads, colmet reports almost 50GB , it looks like it is multiplying the memory used by the application by the number of threads, but this is wrong (confirmed by using 'top' and 'free').

Automatically determine job start/end time

When entering a job ID for a job executed in the past, one must manually fetch, copy and paste the start and end time from the oarstat command.
Would it be possible to automatically do this?
In the case of multiple job IDs being submitted at once, take [min(startTimes); max(stopTimes)}?

Crash on "RuntimeError: dictionary changed size during iteration"

We have frequent crashes of colmet-node, on the implementation_rapl_perfhw branch. Here's the backtrace:

Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Gathering the metrics
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - pull job :4991288
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - pull job :4990850
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - pull job :4991289
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - pull job :4991222
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - pull job :4991263
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - INFO - no taks in this cgroup
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - pull job :4990583
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - INFO - no taks in this cgroup
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - pull job :4990632
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - pull job :4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 148800 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 148801 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 148802 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 144963 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 148804 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 148805 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 148806 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 148807 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 148808 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 148809 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 148810 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 148811 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 148816 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 148803 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 148815 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 145018 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - Process/Task 144956 no more exists.Removing from the Job /dev/cpuset/oar/cbardel_4991293
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - pull job :4991224
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - pull job :4991287
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - pull job :4991285
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - pull job :4991295
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - pull job :4991282
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - INFO - no taks in this cgroup
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - DEBUG - pull job :4991225
Nov 16 12:56:00 dahu70 oar-node[1860]: 16/11/2019 12:56:00 - CRITICAL - Error not handled 'RuntimeError'
Nov 16 12:56:00 dahu70 oar-node[1860]: Traceback (most recent call last):
Nov 16 12:56:00 dahu70 oar-node[1860]:   File "/applis/site/colmet/bin/colmet-node", line 10, in <module>
Nov 16 12:56:00 dahu70 oar-node[1860]:     sys.exit(main())
Nov 16 12:56:00 dahu70 oar-node[1860]:   File "/applis/site/colmet/lib/python3.5/site-packages/colmet/node/main.py", line 236, in main
Nov 16 12:56:00 dahu70 oar-node[1860]:     Task(sys.argv[0], args).start()
Nov 16 12:56:00 dahu70 oar-node[1860]:   File "/applis/site/colmet/lib/python3.5/site-packages/colmet/node/main.py", line 73, in start
Nov 16 12:56:00 dahu70 oar-node[1860]:     self.loop()
Nov 16 12:56:00 dahu70 oar-node[1860]:   File "/applis/site/colmet/lib/python3.5/site-packages/colmet/node/main.py", line 95, in loop
Nov 16 12:56:00 dahu70 oar-node[1860]:     pulled_counters = backend.pull()
Nov 16 12:56:00 dahu70 oar-node[1860]:   File "/applis/site/colmet/lib/python3.5/site-packages/colmet/node/backends/taskstats.py", line 47, in pull
Nov 16 12:56:00 dahu70 oar-node[1860]:     for job in self.jobs.values():
Nov 16 12:56:00 dahu70 oar-node[1860]: RuntimeError: dictionary changed size during iteration

Being able to superimpose two graphs

Users sometimes try to match graphical events between different metrics, to see if they happened at the same time, or before/after.
This would be easier if multiple metrics graphs could be superimposed (different colors for each metric).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.