hepcloud / decisionengine Goto Github PK

HEPCloud Decision Engine framework

License: Apache License 2.0

Shell 0.58% Python 92.67% Dockerfile 3.17% Jsonnet 3.55% Jinja 0.03%

decisionengine's Issues

unit tests involving postgresql fixture started to fail

GitHub actions unit tests started to fail like so recently. No particular code changes seem to correlate with this error:

__________________ ERROR at setup of test_start_from_nothing ___________________
Traceback (most recent call last):
  File "/github/workspace/decisionengine/framework/engine/tests/fixtures.py", line 89, in de_server_factory
    proc_fixture = request.getfixturevalue(pg_prog_name)
  File "/github/workspace/venv/lib/python3.6/site-packages/_pytest/fixtures.py", line 572, in getfixturevalue
    fixturedef = self._get_active_fixturedef(argname)
  File "/github/workspace/venv/lib/python3.6/site-packages/_pytest/fixtures.py", line 592, in _get_active_fixturedef
    self._compute_fixture_value(fixturedef)
  File "/github/workspace/venv/lib/python3.6/site-packages/_pytest/fixtures.py", line 676, in _compute_fixture_value
    fixturedef.execute(request=subrequest)
  File "/github/workspace/venv/lib/python3.6/site-packages/_pytest/fixtures.py", line 1057, in execute
    result = hook.pytest_fixture_setup(fixturedef=self, request=request)
  File "/github/workspace/venv/lib/python3.6/site-packages/pluggy/hooks.py", line 286, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "/github/workspace/venv/lib/python3.6/site-packages/pluggy/manager.py", line 93, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "/github/workspace/venv/lib/python3.6/site-packages/pluggy/manager.py", line 87, in <lambda>
    firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
  File "/github/workspace/venv/lib/python3.6/site-packages/pluggy/callers.py", line 208, in _multicall
    return outcome.get_result()
  File "/github/workspace/venv/lib/python3.6/site-packages/pluggy/callers.py", line 80, in get_result
    raise ex[1].with_traceback(ex[2])
  File "/github/workspace/venv/lib/python3.6/site-packages/pluggy/callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "/github/workspace/venv/lib/python3.6/site-packages/_pytest/fixtures.py", line 1111, in pytest_fixture_setup
    result = call_fixture_func(fixturefunc, request, kwargs)
  File "/github/workspace/venv/lib/python3.6/site-packages/_pytest/fixtures.py", line 908, in call_fixture_func
    fixture_result = next(generator)
  File "/github/workspace/venv/lib/python3.6/site-packages/pytest_postgresql/factories.py", line 164, in postgresql_proc_fixture
    with postgresql_executor:
  File "/github/workspace/venv/lib/python3.6/site-packages/mirakuru/base.py", line 172, in __enter__
    return self.start()
  File "/github/workspace/venv/lib/python3.6/site-packages/pytest_postgresql/executor.py", line 124, in start
    self.init_directory()
  File "/github/workspace/venv/lib/python3.6/site-packages/pytest_postgresql/executor.py", line 164, in init_directory
    subprocess.check_output(init_directory, env=env)
  File "/usr/lib64/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib64/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/usr/pgsql-11/bin/pg_ctl', 'initdb', '--pgdata', '/tmp/postgresqldata.20577', '-o', '--username=postgres --auth=trust']' returned non-zero exit status 1.
---------------------------- Captured stderr setup -----------------------------
sh: warning: setlocale: LC_ALL: cannot change locale (C.UTF-8)
sh: warning: setlocale: LC_ALL: cannot change locale (C.UTF-8)
sh: warning: setlocale: LC_ALL: cannot change locale (C.UTF-8)
initdb: invalid locale settings; check LANG and LC_* environment variables
pg_ctl: database system initialization failed

Log invocations of importlib

When import lib changes our loaded modules, it should log out which modules are being added.

Have de-client --print-product return different error message if product does not exist

Right now if you
de-client --print-product foo

and foo does not exist

the output is just a blank data block, the same as if it really was there and blank. Could we flag a non-existent data block if it is not there?

Support applying dataframe query in de_client

--print-product returns a single dataframe for a given data product. Allow user to apply a filter/dataframe query to the resultant dataframe. Dataframe supports query() api which maybe useful for this.

Under certain circumstances the fetch of the "consumes" information fails but the channel does not go offline

This was observed in the production 1.1 decisionengine release.

We observed that glideclient classads were not being sent to the factory and traced the problem to the following issue:

2020-10-11 13:23:33,077 - root - TaskManager - 110564 - GlideinRequestManifests - INFO - transform GlideinRequestManifests did not get consumes data in 300 seconds. E
xiting

This was on cmsde01 in cms_resource_request.log

Even though the key transform exited, the channel was not set offline.

We observed that in time the resource_request channel, which should run every 5 minutes, was only running every 30 minutes or so because the system was so slowed down.

This error cleared on restart.

We do not know where the process was caught.

This is the first time we have observed the 100% cpu. We had one previous instance where glideclient classads were not being advertised to the factory.

We will look for this in the 1.4 release with higher debug levels enabled this time.

Sources should have a default value for the schedule

If a source is configured without schedule, it is only run once. This is not obvious. We need to have a default value for schedule if it is not configured in channel configuration.

Default schedule value 5 min
If schedule is set to 0, admin wants to run the source only once and need to be explicitly configured accordingly.

Consider partitioning large tables in the DB

We worked with the database group to see how much disk space is freed up now that we have started doing periodic reaping of old records. A full dump and restore of the DEV database took us to 581 GB as compared to 649 GB, took ~2 hours to dump and 8.5 hours to restore.
We are now testing a FULL VACUUM of the 649GB database to see how long that takes

From Olga Vlasova
"Is there any plan to partition the table dataproduct,
or even better to split it between the smaller tables?!

Or if you have any archive/historical data you can create and copy it to the tables with the same schema for easy access.

Currently, the size of the table (in dev 600G) and in production 1733GB is too large for the proper maintenance.
We cannot quickly complete backup and restore, and vacuum will take significant time as well locking your one big table - you will not have access during the vacuum process.

You need to take into account all this timing and difficulties in maintaining such large table, and think how to improve it without additional downtime as we have to do it now."

Taskmanager errors with "No objects to concatenate"

2019-02-24 16:08:07,168 - decision_engine - TaskManager - MainThread - ERROR - error in decision cycle(logic engine) No objects to concatenate

log_level issues in new decisionengine 1.2.0-1

I am trying to test the new functionality which was implemented for issue !84.

The logger section of the /etc/decisionengine/decision_engine.conf is below

'logger' : {'log_file': '/var/log/decisionengine/decision_engine_log',
'max_file_size': 200*1000000,
'max_backup_count': 6,
'log_level': "DEBUG",
},

But although I am running seven channels I do not see any DEBUG entries in any of the logs.

Is any further configuration necessary?

Also what is the syntax to set the log level on a channel by channel basis.

This is set up on fermicloud117.fnal.gov right now, I can give root login if needed.

Steve Timm

Feature: Allow to dump the types of values in a PANDAS data frame

It can often happen as PANDAS data frames are filled that one column in the data frame can end up having multiple different types of data in it. It is often necessary to isolate a single value that is of type "object" rather than type "string"
There are Pandas functions which can print the type. This is a feature request to have the de-client utility be able to dump that information out--the pandas type of each column or if necessary each value.

Add option to de-client --print-product to only print the column names in a data block and-or to print one or more records in key/value format.

For certain data blocks it is very difficult to check the number of columns because the tabulated printout is so wide that it takes many columns.. it would be very helpful to be able to print out only the column names
(similar to the \G option of mysql).

systemctl start decision engine doesn't work

When using the systemctl command as supplied in the rpm, the
start command never exits, and thus after two minutes systemctl declares a timeout.
Commands that are executed in the "start" are supposed to exit to work with systemctl

Reload config functionality

Add functionality to reload channel configs

TaskManager should log all created workers

It would be good to log the various created workers for debugging.

de-client hangs under certain circumstances in version 1.4 and greater (race condition)

This is related to issue #189 but different.
Initially you could not stop a channel that was in ERROR state.
I have observed once, but only once, that de-client hung when trying to start a channel.

i.e.

de-client --stop-channel resource_request
worked

but
de-client --start-channel resource_request

just hung indefinitely, and then all other de-client commands in separate windows hung as well.

Attempts to reproduce this on other machines and on the same machine were not successful, but
it is a pointer to a possible race condition and so we should leave this issue open and probably close #189.

This may also be related to issue #209 when we saw a similar hanging condition in the 1.1 branch which we were running in production at the time.

Compress datablock stored in the database

Addressed in #33

Remove the channel config files from the framework

These files are already available in the respective modules and should not be provided by the framework.

Feature request: have configuration file reflect PRODUCES, CONSUMES

The PRODUCES and CONSUMES fields of sources and transforms are not evident in the configuration file. It would be nice to not have to examine the code of the modules to figure this out.

More in general it would be good to have some kind of a syntax/content checker for decision channel config files which are written in python.

de-client.py doesn't work with new decisionengine framework 0.3.4

upgraded decisionengine rpm from 0.3.3 to 0.3.4. de-client.py code itself appears to be the same as what I was running before. However we now get the following

[root@hepcsvc03 decisionengine]# de-client --status
Traceback (most recent call last):
  File "/usr/bin/de-client", line 75, in 
    print s.status()
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in call
    return self.send(self.name, args)
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1591, in request
    verbose=self.verbose
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1306, in single_request
    return self.parse_response(response)
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1482, in parse_response
    return u.close()
  File "/usr/lib64/python2.7/xmlrpclib.py", line 794, in close
    raise Fault(**self._stack[0])


xmlrpclib.Fault: :'module' object has no attribute '_state_names'">

Decision Engine Logger output fragments to multiple files

We note that under certain circumstances, namely when the decision_engine_log or decision_engine_debug_log is rotated for size, or when a single decision channel is restarted, the logs can be split to several different files being written simultaneously. Developers have said that to fix this requires a complete redesign of the decision engine logger, which at the moment is just the default python logger, and that the issue is that each decision channel runs as a separate process and not just a thread.

Flexible log level per channel

Currently the log level of decision engine is hardcoded to WARN.

It is required that the initial log level for all channels to be configurable and defined in configuration,
Log level should be modifiable at the run time per channel like so:

   de-client --log-level=DEBUG
   de-client --log-level=DEBUG <channel name>

Decision Channel goes offline when a source fails even if nothing CONSUMES the output of that source

In our Nersc channel there is a NerscJobInfo source which produces a data block which nothing Consumes.
Nevertheless when this source failed, the whole channel was taken offline. Framework needs to only fail if datablock is needed elsewhere.

Logic Engine call leads to immediate taskmanager segfault exit

We have observed in two recent builds against trunk, the following behavior:

4 decision engine channels go to ERROR state and their task managers exit, with the last thing in their log being:

1:03
2020-12-17 10:16:18,102 - root - LogicEngine - 24669 - MainThread - INFO - LE: calling evaluate_facts
2020-12-17 10:16:18,102 - root - LogicEngine - 24669 - MainThread - INFO - Evaluated Fact: allow_aws_config -> Value: True -> TypeOf(Value): <class 'bool'>
2020-12-17 10:16:18,102 - root - LogicEngine - 24669 - MainThread - INFO - LE: calling execute

Evidently the name of the c++ logic engine library was recently changed along with the manner in which it was built.

Early analysis by Pat Riehecky indicates there is a segfault showing in dmesg at the point of failure.

Error when trying to run reaper in version 1.4.0

de-reaper
Traceback (most recent call last):
File "/usr/bin/de-reaper", line 23, in
config_file = policies.global_config_filename()
AttributeError: module 'decisionengine.framework.config.policies' has no attribute 'global_config_filename'

What should the attribute "global_config_filename" be set to?
Which section of the config file

Change install script of decisionengine rpm

In the decisionengine rpm the decisionengine user is created with home directory /var/lib/decisionengine and no shell.
It should be a home directory of /home/decisionengine and a /bin/bash shell.

pytest fails in mysterious ways

Currently on the trunk we have pytest failing like so:

https://travis-ci.com/github/HEPCloud/decisionengine/jobs/380414417

------------------------------------------------------------------------- Captured stderr call -------------------------------------------------------------------------
2020-09-02 13:39:26,237 - decision_engine - DecisionEngine - MainThread - ERROR - Channel test_channel failed to start : can't pickle module objects
2020-09-02 13:39:26,238 - decision_engine - DecisionEngine - MainThread - ERROR - Exception
Traceback (most recent call last):
  File "/home/litvinse/decisionengine/framework/engine/DecisionEngine.py", line 292, in start_channels
    self.start_channel(ch)
  File "/home/litvinse/decisionengine/framework/engine/DecisionEngine.py", line 275, in start_channel
    worker.start()
  File "/usr/lib64/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib64/python3.6/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib64/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/lib64/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib64/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle module objects
=============================================================================== FAILURES ===============================================================================
___________________________________________________________ TestChannel.test_client_can_get_de_server_status ___________________________________________________________
Traceback (most recent call last):
  File "/home/litvinse/decisionengine/framework/tests/test_channel.py", line 99, in test_client_can_get_de_server_status
    msg="Channel not in STEADY state")
  File "/usr/lib64/python3.6/unittest/case.py", line 1106, in assertIn
    self.fail(self._formatMessage(msg, standardMsg))
  File "/usr/lib64/python3.6/unittest/case.py", line 687, in fail
    raise self.failureException(msg)
AssertionError: 'STEADY' not found in 'Channel test_channel is in ERROR state\n\nreaper:\n\tstate: State.STARTING\n\tretention_interval: 370\n' : Channel not in STEADY state

The following stack trace is seen:

 File "/home/litvinse/decisionengine/framework/engine/DecisionEngine.py", line 292, in start_channels
    self.start_channel(ch)
  File "/home/litvinse/decisionengine/framework/engine/DecisionEngine.py", line 275, in start_channel
    worker.start()
  File "/usr/lib64/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib64/python3.6/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib64/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/lib64/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib64/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib64/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle module objects

The error occurs on attempt to start worker process that runs a channel

The error happens when pytest runs on all tests.

Just running

pytest -v --tb=native tests/test_channel.py

produces:

tests/test_channel.py::TestChannel::test_client_can_get_de_server_status PASSED                                                                                  [100%]

So is running succeeds:

(venv) [litvinse@fermicloud371 decisionengine]$ pytest  -v --tb=native ./tests/test_channel.py ./dataspace/tests/test_datablock.py ./dataspace/tests/test_Reaper.py ./dataspace/datasources/tests/test_postgresql.py ./util/tests/test_tsort.py ./configmanager/tests/test_configmanager.py ./logicengine/tests/test_pandas_fact.py ./logicengine/tests/test_simple_configuration.py ./logicengine/tests/test_rule_with_negated_fact.py ./logicengine/tests/test_cascaded_rules.py ./logicengine/tests/test_facts.py ./logicengine/tests/test_construction.py

But this fails:

(venv) [litvinse@fermicloud371 decisionengine]$ #pytest  -v --tb=native ./tests/test_channel.py ./dataspace/tests/test_datablock.py ./dataspace/tests/test_Reaper.py ./dataspace/datasources/tests/test_postgresql.py ./util/tests/test_tsort.py ./configmanager/tests/test_configmanager.py ./logicengine/tests/test_pandas_fact.py ./logicengine/tests/test_simple_configuration.py ./logicengine/tests/test_rule_with_negated_fact.py ./logicengine/tests/test_cascaded_rules.py ./logicengine/tests/test_facts.py ./logicengine/tests/test_construction.py ./engine/tests/test_runtime.py

The difference between the two lists - the second list is complete, first list is missing ./engine/tests/test_runtime.py

That is , to distill, this fails like so:

(venv) [litvinse@fermicloud371 framework]$ pytest  -v --tb=native ./tests/test_channel.py ./engine/tests/test_runtime.py 
========================================================================= test session starts ==========================================================================
platform linux -- Python 3.6.8, pytest-6.0.1, py-1.9.0, pluggy-0.13.1 -- /home/litvinse/venv/bin/python3
cachedir: .pytest_cache
rootdir: /home/litvinse/decisionengine/framework
plugins: postgresql-2.4.1
collected 9 items                                                                                                                                                      

tests/test_channel.py::TestChannel::test_client_can_get_de_server_status FAILED                                                                                  [ 11%]
tests/test_channel.py::TestChannel::test_client_can_get_de_server_status ERROR                                                                                   [ 11%]
engine/tests/test_runtime.py::TestClientServerPython::test_client_can_get_de_server_reaper_start_delay PASSED                                                    [ 22%]
engine/tests/test_runtime.py::TestClientServerPython::test_client_can_get_de_server_reaper_status PASSED                                                         [ 33%]
engine/tests/test_runtime.py::TestClientServerPython::test_client_can_get_de_server_reaper_stop PASSED                                                           [ 44%]
engine/tests/test_runtime.py::TestClientServerPython::test_client_can_get_de_server_reload_config PASSED                                                         [ 55%]
engine/tests/test_runtime.py::TestClientServerPython::test_client_can_get_de_server_show_channel_logger_level PASSED                                             [ 66%]
engine/tests/test_runtime.py::TestClientServerPython::test_client_can_get_de_server_show_config PASSED                                                           [ 77%]
engine/tests/test_runtime.py::TestClientServerPython::test_client_can_get_de_server_show_logger_level PASSED                                                     [ 88%]
engine/tests/test_runtime.py::TestClientServerPython::test_global_channel_log_level_in_config PASSED                                                             [100%]

================================================================================ ERRORS ================================================================================
________________________________________________ ERROR at teardown of TestChannel.test_client_can_get_de_server_status _________________________________________________
Traceback (most recent call last):
  File "/home/litvinse/decisionengine/framework/tests/test_channel.py", line 81, in tearDown
    self.de_client_request("--stop")
  File "/home/litvinse/decisionengine/framework/tests/test_channel.py", line 93, in de_client_request
    *args])
  File "/home/litvinse/decisionengine/framework/engine/de_client.py", line 200, in main
    return execute_command_from_args(args, socket)
  File "/home/litvinse/decisionengine/framework/engine/de_client.py", line 183, in execute_command_from_args
    return xmlrpcsocket.stop()
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request
    verbose=self.__verbose
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1170, in single_request
    return self.parse_response(resp)
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1342, in parse_response
    return u.close()
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 656, in close
    raise Fault(**self._stack[0])
xmlrpc.client.Fault: <Fault 1: "<class 'AttributeError'>:'NoneType' object has no attribute 'task_manager'">

I failed to understand why this is happening. Fortunately we can just put functionality of test_runtime.py into test_channel.py' because all test_runtime.pyis testing client commands against server which does not run any channels. Whereastest_channel.py' runs a channel and client/server interaction can be tested there.

Unless of course we understand what the root cause is...,

Restructure unittest directories

As per the code review feedback

separate the python and c/c++ scripts
use same directory structure as that of the actual codebase

Reaper error message on DE

After a reboot on cmsde01 we saw the following error in startup.log:

Reaper.reap() failed with FATAL: remaining connection slots are reserved for non-replication superuser connections

It is not clear why the reaper would have failed. We have never seen this error message before.

SourceProxy.py not compatible with new configuration features in Framework

I installed the CI rpms of the framework as built from the master branch on July 2. I installed the new framework rpm and left the decisionengine_modules RPM as it was.. but in this particular case this is a pure Framework issue.
The main change of the framework is the configuration change to the .jsonnet format.

We initially got this dump:

2020-07-08 20:16:11,925 - root - TaskManager - 13588 - SourceProxy - ERROR - Exception running source SourceProxy : operator does not exist: text = text[]
LINE 4: ...kmanager_id=156 AND foo.generation_id=1424 AND key=ARRAY['aw...
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py", line 298, in run_source
data = src.worker.acquire()
File "/usr/lib/python3.6/site-packages/decisionengine/framework/modules/SourceProxy.py", line 152, in acquire
self._get_data(data_block, k_in))
File "/usr/lib/python3.6/site-packages/decisionengine/framework/modules/SourceProxy.py", line 94, in _get_data
data = data_block.get(key)
File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datablock.py", line 296, in get
return self.getitem(key, default=default)
File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datablock.py", line 363, in getitem
self.generation_id, key)
File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/dataspace.py", line 127, in get_dataproduct
return self.datasource.get_dataproduct(taskmanager_id, generation_id, key)
File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datasources/postgresql.py", line 241, in get_dataproduct
return self._select_dictresult(q, (taskmanager_id, generation_id, key))[0]
File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datasources/postgresql.py", line 445, in _select_dictresult
sql_query, values, cursor_factory=psycopg2.extras.RealDictCursor)
File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datasources/postgresql.py", line 337, in _select
colnames, res = self.__query(query_string, values, cursor_factory)
File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datasources/postgresql.py", line 349, in __query
cursor.execute(query_string, values)
File "/usr/local/lib/python3.6/site-packages/DBUtils/SteadyDB.py", line 605, in tough_method
result = method(*args, **kwargs) # try to execute
File "/usr/local/lib64/python3.6/site-packages/psycopg2/extras.py", line 248, in execute
return super(RealDictCursor, self).execute(query, vars)
psycopg2.errors.UndefinedFunction: operator does not exist: text = text[]
LINE 4: ...kmanager_id=156 AND foo.generation_id=1424 AND key=ARRAY['aw...
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.

2020-07-08 20:16:12,828 - root - TaskManager - 13588 - MainThread - ERROR - Error occured during initial run of sources. Task Manager AWS_Calculations_with_source_proxy.jsonnet exits

There is nothing involved from the modules here--just the base framework and the SourceProxy.py which is also part of the source framework.

This is the configuration that calls it

"AWSJobLimits": {
  "module": "decisionengine.framework.modules.SourceProxy",
  "name": "SourceProxy",
  "parameters": {
    "channel_name": "channel_aws_config_data",
    "Dataproducts": [
      [
        "aws_instance_limits",
        "Job_Limits"
      ]
    ],
    "retries": 3,
    "retry_timeout": 20
  },
  "schedule": 360
}

In the process of converting the config files Vito noticed that this source originally
had a tuple in the config.

Originally the old style config looked like this:

"AWSJobLimits" : {
    "module" : "decisionengine.framework.modules.SourceProxy",
    "name"   : "SourceProxy",
    "parameters": {"channel_name": "channel_aws_config_data",
	      "Dataproducts":[("aws_instance_limits", "Job_Limits")],
	      "retries": 3,
	      "retry_timeout": 20,
    	      },
   	"schedule": 360,
}

The purpose of a tuple in the config was to indicate that the data block channel_aws_config_data / aws_instance_limits should be brought into this channel with the name Job_Limits.
This is the only place in the whole set of modules where this is done.

Vito modified the code to take a tuple or a list but that had no effect, we are still getting the same error as before.
So the error is not caused by the config changes or code changes that Vito is making.

Logic Engine doesn't handle missing values gracefully

The only non-trivial logic engine config we have is in the resource_request channel. As you can see, there are several places which rely on pandas queries to get values to compare in the logical expressions. If for whatever reason the row is missing from the PANDAS data frame (for instance there is a key error in looking up AWS_Billing_Rate[AWS_Billing_Rate['accountName'] == 'Fermilab']), then the logic engine throws an error, fortunately not a fatal one.

We need either a way to incorporate a test for existence of the value we need, or some way to help the DE recover in a suitable way when the value is not available.

facts: {
  publish_requests: "(True)",
  allow_grid: "(True)",
  allow_lcf: "(True)",
  allow_gce: "(True)",
  allow_aws: "(True)",
  awswithininstburnrate: "financial_params.iloc[0].target_aws_vm_burn_rate>AWS_Burn_Rate.iloc[0].BurnRate",
  awswithinbillburnrate: "financial_params.iloc[0].target_aws_bill_burn_rate>AWS_Billing_Rate[AWS_Billing_Rate['accountName']=='Fermilab'].iloc[0].costRatePerHourInLastSixHours",
  awsabovebalance: "financial_params.iloc[0].target_aws_balance<AWS_Billing_Info[AWS_Billing_Info['AccountName']=='Fermilab'].iloc[0].Balance",
  gcewithininstburnrate: "financial_params.iloc[0].target_gce_vm_burn_rate>GCE_Burn_Rate.iloc[0].BurnRate",
  gceabovebalance: "financial_params.iloc[0].target_gce_balance<GCE_Billing_Info.iloc[0].Balance",
  uscmsnerscbelowlimit: "Nersc_Allocation_Info[Nersc_Allocation_Info['uname']=='uscms'].iloc[0].user_amount_charged<Nersc_Allocation_Info[Nersc_Allocation_Info['uname']=='uscms'].iloc[0].user_limit",
}

Framework gives ambiguous error message when transform reads incomplete data block

When a transform is called with an incomplete data block, either one that is totally null or
one that is missing fields that the transform needs, the error message is inaccurate.

2018-11-01 16:52:26,708 - decision_engine - TaskManager - JobClustering - ERROR - exception from JobClustering: name 'V' is not defined

We were not requesting any field called "V" although we were requesting one called "VO"

It would be nice to have a better message for debugging..eventually I figured out that the modified
transform was using two fields that the source was not putting in the data frame (but neither
one of them had a "V" in it.)

New race condition in de-client

After a system reboot of cmsde01.fnal.gov, currently running framework version 1.4.1 and modules version 1.4.2,
the cms_resource_request channel did not start correctly. A stop/start of the channel worked but then all de-client
commands after that hung. We cannot leave it in this state since it is a production system.

Yum update on decisionengine rpm doesn't restart the service

Channel debug info now leaks into startup.log

I am observing that a huge amount of debug information from the resource_request channel
(and possibly other channels as well) is leaking into startup.log.
This cannot come from a logging configuration. startup.log only catches stuff that is getting written to standard out or standard error.

This is from the 1.4.0rc1 rpm. This was not happening before.

Decisionengine not starting as correct user

Packaging needs to be changed to move the init script to /usr/sbin/decision-engine. This is causing the service to start as root instead of decisionengine user. Also the default config file in the init script should be decision_engine.conf and not decisionengine.conf

Can't restart resource_request channel with de-client --stop-channel / de-client --start-channel

de-client -v --stop-channel resource_request
An error occurred while trying to access a DE server at 'http://localhost:8888'
Please ensure that the host and port names correspond to a running DE instance.
<Fault 1: "<class 'AttributeError'>:'NoneType' object has no attribute 'task_manager'">

channel: AWS_Calculations_with_source_proxy , id = 36276E7F-FB68-4A29-B736-413E97947BD4, state = STEADY
channel: sample_gcebilling , id = 602F0A9C-3C34-46E9-B161-49C0D370CD43, state = STEADY
channel: Gce , id = 817B8402-B4A2-4874-A6CB-033C5A9DD459, state = STEADY
channel: Nersc , id = 8BEB2096-D39F-4869-AB12-EEBCAB6318E6, state = STEADY
channel: channel_aws_config_data , id = 9416C0BE-DAEA-4081-B6C8-68B5532F6308, state = STEADY
channel: AWSbilling , id = 7AB6FAF0-D993-4E46-8B8D-E6A6787EE2CC, state = STEADY
Channel resource_request is in ERROR state
channel: job_classification , id = 4F4995F3-24CE-480C-ABC8-DDEAC91D4CB3, state = STEADY
state: State.STARTING

I also note that the
"Channel resource_request is in ERROR state" message has changed not only the format of the
previous message, but also the reported state, ERROR vs. OFFLINE.. operations has stuff that greps this output.
Is this going to continue this way, if so we will have to readjust the monitoring

Kyle reports he already has a patch.

For the record, the channel was expected to go into offline/error state in this configuration because we
had the timeout for the AWSBillingInfo SourcePRoxy set to 3 retries / 20 s which is known to be not enough time.

Feature: Develop a way to probe datablock values over a time series

It would be very good to be able to go back in the database and dump a value from a given datablock as it changed as a function of time. Maybe with the de-client but also under program control. Eventually would like to make a transform that can make predictions based on the history of one value in a data block.

ChannelConfigHandler should note if a config passed validation

As a debug log, ChannelConfigHandler should note if a config passed validation so we can isolate where things break.

Create and automate CI/CD pipeline

Configuration language examples

A recommendation will be made to the HEPCloud developers on June 4, 2020 regarding the configuration language. Please download the tarball here to compare current configurations with those represented in the new language.

yum update on decision engine rpm from python2 to python3 doesn't undo the symlinks

The decision engine rpm (python2 version), makes 2 symlinks

/usr/sbin/decisionengine -> /usr/lib/python2.7/site-packages/decisionengine/framework/engine/DecisionEngine.py
/usr/bin/de-client -> /usr/lib/python2.7/site-packages/decisionengine/framework/engine/de_client.py

When the framework is updated with "yum update decisionengine"
the symlinks do not get removed,.

On a clean install of the python3.6 rpms the symlinks do get made correctly.

Spurious message to rename config files

Please rename '/etc/decisionengine/config.d/job_classification.jsonnet' to '/etc/decisionengine/config.d/job_classification.jsonnet'.
Please rename '/etc/decisionengine/decision_engine.conf' to '/etc/decisionengine/decision_engine.jsonnet'.
Please rename '/etc/decisionengine/config.d/AWS_Calculations_with_source_proxy.jsonnet' to '/etc/decisionengine/config.d/AWS_Calculations_with_source_proxy.jsonnet'.
Please rename '/etc/decisionengine/config.d/sample_gcebilling.jsonnet' to '/etc/decisionengine/config.d/sample_gcebilling.jsonnet'.
Please rename '/etc/decisionengine/config.d/Gce.jsonnet' to '/etc/decisionengine/config.d/Gce.jsonnet'.
Please rename '/etc/decisionengine/config.d/Nersc.jsonnet' to '/etc/decisionengine/config.d/Nersc.jsonnet'.
Please rename '/etc/decisionengine/config.d/channel_aws_config_data.jsonnet' to '/etc/decisionengine/config.d/channel_aws_config_data.jsonnet'.
Please rename '/etc/decisionengine/config.d/AWSbilling.jsonnet' to '/etc/decisionengine/config.d/AWSbilling.jsonnet'.
Please rename '/etc/decisionengine/config.d/resource_request.jsonnet' to '/etc/decisionengine/config.d/resource_request.jsonnet'.
2020-07-08 20:16:14,019 - decision_engine - ConfigManager - MainThread - ERROR - resource_request.jsonnet No module named 'glideinwms', REMOVING the channel
Please rename '/etc/decisionengine/config.d/job_classification.jsonnet' to '/etc/decisionengine/config.d/job_classification.jsonnet'.

The framework is telling me to rename all the files to *.jsonnet even though I have already done so.
Please fix that.

SourceProxy doesn't respect the retry_timeout field

All SourceProxies have a retry_timeout that can be configured and is supposed to be measured in seconds.
We observe that in fact the retries are happening, but are happening immediately, not after the configured 20 seconds.

Please investigate.

(This is being observed in decisionengine_modules/AWS/sources/BillingInfoSourceProxy.py
which is configured with limit of 100 retries and 20 seconds retry_timeout.. that is sometimes enough time
to get it to retry enough, and sometimes not). But other SourceProxies which include the SourceProxy base class also have this problem. We are currently working around this by configuring them to have an obscenely high number of retries.

Version 1.1.0-1 decisionengine rpm has bad requires

rpm -qp --requires decisionengine-1.1.0-1_py2.7.x86_64.rpm
/bin/bash
/bin/sh
/bin/sh
/bin/sh
/sbin/service
/usr/bin/env
/usr/sbin/useradd
boost-python2.7 >= 1.53.0
boost-python2.7-devel >= 1.53.0
boost-regex >= 1.53.0
boost-system >= 1.53.0
config(decisionengine) = 1.1.0-1_py2.7
libLogicEngine.so()(64bit)
libboost_python.so.1.53.0()(64bit)
libboost_regex.so.1.53.0()(64bit)
libboost_system.so.1.53.0()(64bit)
libc.so.6()(64bit)
libc.so.6(GLIBC_2.2.5)(64bit)
libgcc_s.so.1()(64bit)
libgcc_s.so.1(GCC_3.0)(64bit)
libm.so.6()(64bit)
libpython2.7.so.1.0()(64bit)
libstdc++.so.6()(64bit)
libstdc++.so.6(CXXABI_1.3)(64bit)
libstdc++.so.6(GLIBCXX_3.4)(64bit)
libstdc++.so.6(GLIBCXX_3.4.15)(64bit)
libstdc++.so.6(GLIBCXX_3.4.9)(64bit)
python(abi) = 2.7
rpmlib(CompressedFileNames) <= 3.0.4-1
rpmlib(FileDigests) <= 4.6.0-1
rpmlib(PartialHardlinkSets) <= 4.0.4-1
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rtld(GNU_HASH)

In particular it is requiring boost-python2.7 >= 1.53.0
which is not supplied by the stock boost-python rpm in SL7.
Should just be requiring boost-python

Make list of ignored warnings consistent between GitHub actions and Travis (pep8speak)

I am getting these warnings:

Line 1129:13: W503 line break before binary operator

Whereas pylint run by GitHub actions does not generate these. Please make pep8speak list of ignored pep8 errors consistent with GitHub actions.

Current setting:

build/scripts/run_pep8.sh:    PEP8_OPTIONS="--ignore=E261,E265,E302,E303,E501,E129,E221,E241,E272,E731,E1004,W503,W504,F999,N801,N813,N814"

This needs to be done for both repos : decsionenine and decisionengine_modules

yum doesn't correctly find the latest version of the rpm

"yum update decisionengine" picked decisionengine-1.3.0rc1 instead of decisionengine-1.3.0-1
as the latest decisionengine rpm available. Similarly it picked the 1.2 version of decisionengine-standard-library instead of the 1.3 version which was available in the yum repo.

Furthermore the CI rpms for decisionengine-standard-library are still tagged with a 1.2 version release tag rather than a 1.3 version release tag.

DEBUG level not taking effect in main framework

I am testing the framework version as set in Master as built July 2.

"logger": {
"log_file": "/var/log/decisionengine/decision_engine_log",
"max_file_size": 200000000,
"max_backup_count": 6,
"log_level": "DEBUG"
},

de-client --print-engine-loglevel correctly shows DEBUG but there are no DEBUG messages in decision_engine_log, or decision_engine_log_debug. (we are only seeing INFO).

Add self tests for rpc_print_product types=True

With the merge of #216 we should make sure to test the new feature on the public API.

Test channel 'NOOP' doesn't respond to shutdown signal

The noop channels do not listen to the shutdown signal sent from the RPC call stop_channels.

/usr/lib/systemd/system/decision-engine.service unit file still has bad dependency

/usr/lib/systemd/system/decision-engine.service as shipped in version 0.3.4 of the decisionengine rpm
has an incorrect dependency on the condor service. The condor service does not run on a decision engine.
I suggest replacing it with a dependency on httpd, which does. I have made the edit by hand on hepcsvc03.

hepcloud / decisionengine Goto Github PK

decisionengine's Issues

Recommend Projects

Recommend Topics

Recommend Org