rshk / jobcontrol Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 3.0 1.8 MB

Job scheduling and tracking library.

Home Page: http://rshk.github.io/jobcontrol/

License: Apache License 2.0

Python 81.62% Makefile 3.14% Shell 2.63% JavaScript 1.14% CSS 11.48%

jobcontrol's Introduction

Job Control

Job scheduling and tracking library.

Provides a base interface for scheduling, running, tracking and retrieving results for "jobs".

Each job definition is simply any Python callable, along with arguments to be passed to it.

The tracking include storing: - the function return value - any exception raised - log messages produced during task execution - optionally a "progress", if the task supports it

The status storage is completely decoupled from the main application.

The project "core" currently includes two storage implementations:

MemoryStorage -- keeps all data in memory, useful for development / testing.
PostgreSQLStorage -- keeps all data in a PostgreSQL database, meant for production use.

Project status

Travis CI build status

Branch	Status
master
develop

Source code

Source is hosted on GitHub: https://github.com/rshk/jobcontrol/

And can be cloned with:

git clone https://github.com/rshk/jobcontrol.git

Python Package Index

The project can be found on PyPI here: https://pypi.python.org/pypi/jobcontrol

Project documentation

Documentation is hosted on GitHub pages:

http://rshk.github.io/jobcontrol/docs/

A mirror copy is hosted on ReadTheDocs (compiled automatically from the Git repository; uses RTD theme; supports multiple versions):

http://jobcontrol.rtfd.org/

Concepts

Jobs are simple definitions of tasks to be executed, in terms of a Python function, with arguments and keywords.

They also allow defining dependencies (and seamingless passing of return values from dependencies as call arguments), cleanup functions, and other nice stuff.

The library itself is responsible of keeping track of job execution ("build") status: start/end times, return value, whether it raised an exception, all the log messages produced during execution, the progress report of the execution, ...

It also features a web UI (and web APIs are planned) to have an overview of the builds status / launch new builds / ...

jobcontrol's People

Contributors

Stargazers

Watchers

Forkers

vortec opendatatrentino jian-zhan

jobcontrol's Issues

Reverse dependency building

We need to decide how to handle this, to avoid the risk of infinite loops, ...

Create some view listing all the "outdated" builds
Allow some option when running the build to enable building (some) revdeps too, upon successful build.

Add some kind of "signalling" system to allow parallel building and decide when something needs to be built

Example jobs:

A -> B
B -> C
B -> D
C -> E
D -> E

we start by building E
once E is built, we can start (in parallel) both C and D
once both C and D are done, we can start B
once B is done, we can finally start A

We need some way to keep track of this and react on "events" such as "job X completed"

We don't really care about abrupt termination of jobs (eg. worker process is killed), as we won't continue anyways -- ⚠️ but still, we should mark a build as failed if a dependency build failed!
(allow starting builds "as a dependency of ..."? how to store updated build ids as soon as new builds become available?)

Add support for "protected" jobs

Eg. for importing data in production: we want to make sure the job is not executed by mistake:

remove quick run button from main page list
the "build" button on the job page should have an "arm" button, to toggle disabled state

Dependency resolution + loop detection before job run

Add some way to detect circular dependencies at job configuration time -> add some warning to inform the user earlier (before running).

For the initial release no automatic build of dependencies is done -- jobcontrol will simply check that all dependencies are met before creating the build.

Better form CSRF protection

Right now, this method is in place: http://flask.pocoo.org/snippets/3/

but, this prevents the user to submit a form opened before submitting another form.
Example:

GET form1.html
GET form2.html
POST form1.html (form submit) -> 403, as the token in session changed

If we reuse the same token for the whole user session, we're potentially vulnerable to session fixation.

Note: if the login is handled by the webserver, and there is no way for an anonymous user to get a session, the risk of session fixation is greatly reduced.

Add support for job "groups"

..to allow better views; groups should support some way to specify "hierarchical" levels -> what to use as separator? something "uncommon", eg ::?

Job configuration in YAML

Support job configuration in YAML.

user-friendly configuration format
flexible, allows us to easily extend serializable objects

Use a binary format for storing pickled objects

Specifically, we should use "bytea" type http://www.postgresql.org/docs/9.3/static/datatype-binary.html in order to prevent problems with byte sequences that are not valid in the current charset

Better definition of call arguments

Due to #14, we decided to move configuration -> YAML

Add support for cleanup functions

This can be accomplished in two ways:

Define a "cleanup" function in the configuration THIS IS THE CHOSEN WAY

jobs:
  - id: myjob
    function: mymodule:myjob
    cleanup_function: mymodule:myjob_cleanup

Use a decorator to convert the job function to a class, which exposes a decorator to register cleanup functions. Example:

@jobcontrol.job
def myjob(foo, bar):
    pass

@myjob.cleanup
def myjob_cleanup(build):
    pass  # Remove resource pointed by build['retval']

The decorator can be just a class acting as a proxy for the actual call:

class JobFunction(object):
    def __init__(self, func):
        self._cleanup_functions = []
        self.func = func

    def __call__(self, *a, **kw):
        return self.func(*a, **kw)

    def cleanup(self, func):
        self._cleanup_functions.append(func)

Note: we want to either wrap all the methods to return documentation, arguments, ... or handle this as a special case and retrieve them from obj.func instead.

Add UI action to delete builds

We should also allow a "double option": delete the build with and without data.

Keep reference of dependency build ids

We want to keep reference to the exact build for a dependency a given job has been built upon.

on build creation, add a copy of:
- job configuration (in case it should change, we will know exactly which was the exact configuration)
- ids of latest builds of dependencies (something like {job_id: build_id})

💡 For the build id selection thing, we need a "nested" structure, as we might as well want to specify build numbers for dependencies of dependencies.. although this might create some problems in case we want to build against two different versions at once -> but that's just a less-likely corner case

❗ We also want to be able to say "create a new build of this job"

Example:

job dependencies: 1,2,3
build dep. spec: {1: 11, 2: 21, 3: None}

Means:

use build 11 of job 1, build 21 of job 2, a new one of build 3

Add function autocomplete / documentation in job create/edit form

Autocomplete the function name with a widget like this:

+-------------------+-------------------------+
| module:func1      |                         |
| module:func2      |  function documentation |
| module:func3      |                         |
+-------------------+-------------------------+

The left column contains a list of candidate completions; the one on the right contains the documentation for the selected completion, if any

Pass YAML configuration through jinja

Mostly to allow:

defining of variables / reuse them in multiple places (eg. for database URL "root")
defining macros to avoid needless repetitions (eg. to harvest different instances of the same software, ..)

Better dependency graph manipolation / analysis functions

We need to:

sort the graph topologically, with "levels" (to allow "parallel" building in the future [which in turn requires some kind of "semaphore" system..])
detect loops in the graph -> allow rendering them with a different color on the graphviz visualization
provide the required functionality to render the "reverse" dependencies in a different way on the svg virualization

Also, we want a nicer visualization of dependency graph: nodes should be in "sorted" order, in a way that makes clear which should be built first (the arrows point to dependencies, but this can be sometimes misleading, although it is the most "consistent" way to represent a dependency graph)

Allow interrupting builds

This can be tricky: in some cases we can "just" send a TERM/INT/KILL signal to process, but:

it is not guaranteed that a process is running just one build at once (threads, greenlets, ..)
it is not guaranteed that the PID has not been reallocated

On a side note, we also want some way to track & report sudden death of builder processes (maybe some method to check for worker status? -> celery should provide some mechanisms to do so..)

Add support for "multi-level" progress reporting

Allow the progress to be reported on "multiple levels", eg:

{
    'Spam': {'current': 30, 'total': 60},
    'Eggs': {'current': 10, 'total': 20},
    'Bacon': {'current': 0, 'total': 20},
}

The UI will then render the progress bar like this:

[^] Total:   [========            ] 40/100 (40%)
 |-- Spam    [==========          ] 30/60 (50%)
 |-- Eggs    [==========          ] 10/20 (50%)
 '-- Bacon   [                    ] 0/20 (0%)

Error while launching the web app

I have problems launching the web app, during the startup.

The output hereafter

marcomb@debianmc:/opt/ckan/envs/jobcontrol/bin$ ./jobcontrol-cli --config-file /opt/ckan/envs/conf/conf.yaml web --port 5060 --debug
Traceback (most recent call last):
  File "./jobcontrol-cli", line 9, in <module>
    load_entry_point('jobcontrol==0.1a', 'console_scripts', 'jobcontrol-cli')()
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/jobcontrol-0.1a-py2.7.egg/jobcontrol/cli.py", line 337, in main
    cli_main_grp(obj={})
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 610, in __call__
    return self.main(*args, **kwargs)
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 590, in main
    rv = self.invoke(ctx)
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 936, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 782, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 416, in invoke
    return callback(*args, **kwargs)
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/jobcontrol-0.1a-py2.7.egg/jobcontrol/cli.py", line 292, in web
    app.config.update(jc.config['webapp'])
TypeError: 'NoneType' object is not iterable
marcomb@debianmc:/opt/ckan/envs/jobcontrol/bin$ ./jobcontrol-cli --config-file /opt/ckan/envs/conf/conf.yaml web --port 5060 > debug_jobcontrol.txt
Traceback (most recent call last):
  File "./jobcontrol-cli", line 9, in <module>
    load_entry_point('jobcontrol==0.1a', 'console_scripts', 'jobcontrol-cli')()
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/jobcontrol-0.1a-py2.7.egg/jobcontrol/cli.py", line 337, in main
    cli_main_grp(obj={})
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 610, in __call__
    return self.main(*args, **kwargs)
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 590, in main
    rv = self.invoke(ctx)
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 936, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 782, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 416, in invoke
    return callback(*args, **kwargs)
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/jobcontrol-0.1a-py2.7.egg/jobcontrol/cli.py", line 292, in web
    app.config.update(jc.config['webapp'])
TypeError: 'NoneType' object is not iterable

Mark "run" buttons differently based on dependency build status

If all the dependencies have successful builds -> green (blue?) button
otherwise yellow / red button

Note: add a method to JobInfo classes to check for this

Better traceback for exceptions

We want to have more information, such as values of local variables, etc.

Maybe pickling the whole thing wouldn't be a great idea, but we can store at least (limited?) representations of the objects.

This can be accomplished with a custom version of traceback.extract_tb, to include local variables (from tb.tb_frame.f_locals) [note: include globals too?] -> but only the repr() of values, as there might be problems with pickling / unpickling of arbitrary objects.

Add CLI command to dump the "compiled" configuration

Useful for testing jinja macros: it should output the YAML configuration as generated after the jinja preprocessing.

It might also be useful to add some command to show the structure as processed through the configuration manager, i.e. after reading it from YAML...

Create build before running

We need to change the "run" process a bit:

Create build, with job_id and configuration (pinned builds, ...)
Launch the execution of the build, via celery
- we should keep track of the process / worker running the thing, in order to check the status / stop the execution / ...

Note: dependency build should happen inside the build execution, so we can mark it as failed if a dependency build failed.

Replacement on storage URLs is failing

Example: mongodb://database.local/harvester_141103/{id}_pat_statistica -> {id} is not getting replaced (guess this is a problem caused by the harvester function..)

Add support for task "grouping"

Tree structure VS tags + "batch build" button

with a tree, we can better unclutter the UI using a "folder-like" structure
with tags, we can have multiple categorizations

Example categorizations:

Crawl FOO -> harvester, crawl, foo
Crawl BAR -> harvester, crawl, bar
Convert FOO -> harvester, convert, foo
Convert BAR -> harvester, convert, bar

BUT the above tasks are all part of the "harvester" group

Support building dependencies / depending jobs

Add support for custom "return value formatters"

We need this for things like creating links to storage explorer / displaying stuff / downloading return values / ...

Use cases

crawler jobs, containing connection to db:
- "view" -> link to storage explorer app
- "download" -> zip / tar archive containing the whole database contents
"import preview" jobs: we need to precalculate changes that will be made -> store somewhere -> display them to the user in a nice interface.

Configuration

Many functions can be "generic" and then accept some configuration; for example, the "link to storage" formatter requires the URL of the storage explorer application.

Configuration of formatters might look like:

retval_formatters:
  - cls: mypkg.mymodule:MyClass
    formatter: myotherpkg.myothermodule:my_formatter_function
  - cls: mypkg.mymodule:MyClass2
    formatter: myotherpkg.myothermodule:my_formatter_function2

Formatters can then be pre-processed with functools.partial:

retval_formatters:
  - cls: mypkg.mymodule:MyClass
    formatter: myotherpkg.myothermodule:my_formatter_function
    kwargs:
      base_url: "http://example.com/..."

Parametrized builds (like travis' "build matrix")

Rationale: a job might return a "collection" of items, each of which needs independent processing; if the build of some "derived" builds fail, we might still want to preserve the results of the successful ones.

Example

Job 1: download data from http://...
- Build 1: returned a collection of three items, A, B, C
Job 2: import files from job 1 to database
- Build 2 [A]: Import A to database (returns table_A)
- Build 2 [B]: Import B to database (returns table_B)
- Build 2 [C]: Import C to database (returns table_C) FAILED
Job 3: convert field names to uppercase in tables from job 2
- Build 2 [A]: convert table_A to uppercase
- Build 2 [B]: convert table_B to uppercase
- Build 2 [C]: convert table_C to uppercase SKIPPED
Job 4: merge stuff from all tables from job 3
- The build should decide whether to continue with just two successful dependency builds or not.

Use cases

download list of resources from the geocatalogo
import each dataset in its own PostGIS table

Possible alternative

Subdivide builds into "steps", each one can have separate progress / status / return value; we still need some way to "run this build for X items".

rshk / jobcontrol Goto Github PK

jobcontrol's Introduction

Job Control

Project status

Source code

Python Package Index

Project documentation

Concepts

jobcontrol's People

Contributors

Stargazers

Watchers

Forkers

jobcontrol's Issues

Use cases

Configuration

Example

Use cases

Possible alternative

Recommend Projects

Recommend Topics

Recommend Org