Giter VIP home page Giter VIP logo

jobcontrol's Introduction

Job Control

https://raw.githubusercontent.com/rshk/jobcontrol/develop/.misc/banner.png

Job scheduling and tracking library.

Provides a base interface for scheduling, running, tracking and retrieving results for "jobs".

Each job definition is simply any Python callable, along with arguments to be passed to it.

The tracking include storing: - the function return value - any exception raised - log messages produced during task execution - optionally a "progress", if the task supports it

The status storage is completely decoupled from the main application.

The project "core" currently includes two storage implementations:

  • MemoryStorage -- keeps all data in memory, useful for development / testing.
  • PostgreSQLStorage -- keeps all data in a PostgreSQL database, meant for production use.

Project status

Travis CI build status

Branch Status
master https://travis-ci.org/rshk/jobcontrol.svg?branch=master
develop https://travis-ci.org/rshk/jobcontrol.svg?branch=develop

Source code

Source is hosted on GitHub: https://github.com/rshk/jobcontrol/

And can be cloned with:

git clone https://github.com/rshk/jobcontrol.git

Python Package Index

The project can be found on PyPI here: https://pypi.python.org/pypi/jobcontrol

Latest PyPI version Number of PyPI downloads Supported Python versions Development Status License

Project documentation

Documentation is hosted on GitHub pages:

http://rshk.github.io/jobcontrol/docs/

A mirror copy is hosted on ReadTheDocs (compiled automatically from the Git repository; uses RTD theme; supports multiple versions):

http://jobcontrol.rtfd.org/

Concepts

Jobs are simple definitions of tasks to be executed, in terms of a Python function, with arguments and keywords.

They also allow defining dependencies (and seamingless passing of return values from dependencies as call arguments), cleanup functions, and other nice stuff.

The library itself is responsible of keeping track of job execution ("build") status: start/end times, return value, whether it raised an exception, all the log messages produced during execution, the progress report of the execution, ...

It also features a web UI (and web APIs are planned) to have an overview of the builds status / launch new builds / ...

jobcontrol's People

Contributors

rshk avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

jobcontrol's Issues

Reverse dependency building

We need to decide how to handle this, to avoid the risk of infinite loops, ...

  • Create some view listing all the "outdated" builds
  • Allow some option when running the build to enable building (some) revdeps too, upon successful build.

Add some kind of "signalling" system to allow parallel building and decide when something needs to be built

Example jobs:

A -> B
B -> C
B -> D
C -> E
D -> E

  • we start by building E
  • once E is built, we can start (in parallel) both C and D
  • once both C and D are done, we can start B
  • once B is done, we can finally start A

We need some way to keep track of this and react on "events" such as "job X completed"

We don't really care about abrupt termination of jobs (eg. worker process is killed), as we won't continue anyways -- โš ๏ธ but still, we should mark a build as failed if a dependency build failed!
(allow starting builds "as a dependency of ..."? how to store updated build ids as soon as new builds become available?)

Add support for "protected" jobs

Eg. for importing data in production: we want to make sure the job is not executed by mistake:

  • remove quick run button from main page list
  • the "build" button on the job page should have an "arm" button, to toggle disabled state

Dependency resolution + loop detection *before* job run

Add some way to detect circular dependencies at job configuration time -> add some warning to inform the user earlier (before running).

For the initial release no automatic build of dependencies is done -- jobcontrol will simply check that all dependencies are met before creating the build.

Better form CSRF protection

See also: https://code.djangoproject.com/wiki/CsrfProtection

Right now, this method is in place: http://flask.pocoo.org/snippets/3/

but, this prevents the user to submit a form opened before submitting another form.
Example:

  • GET form1.html
  • GET form2.html
  • POST form1.html (form submit) -> 403, as the token in session changed

If we reuse the same token for the whole user session, we're potentially vulnerable to session fixation.

Note: if the login is handled by the webserver, and there is no way for an anonymous user to get a session, the risk of session fixation is greatly reduced.

Add support for job "groups"

..to allow better views; groups should support some way to specify "hierarchical" levels -> what to use as separator? something "uncommon", eg ::?

Job configuration in YAML

Support job configuration in YAML.

  • user-friendly configuration format
  • flexible, allows us to easily extend serializable objects

See also #9

Proposed configuration example:

module: package_name.module_name
function: function_name
args:
    - one
    - two
kwargs:
    three: 3
    four: !retval 3
dependencies: [3]
# ... support for custom extensions here ...

It would also be nice to have a "natural key" field, allowing to keep in sync with an external configuration file...

Add support for cleanup functions

This can be accomplished in two ways:

  • Define a "cleanup" function in the configuration THIS IS THE CHOSEN WAY
jobs:
  - id: myjob
    function: mymodule:myjob
    cleanup_function: mymodule:myjob_cleanup
  • Use a decorator to convert the job function to a class, which exposes a decorator to register cleanup functions. Example:
@jobcontrol.job
def myjob(foo, bar):
    pass

@myjob.cleanup
def myjob_cleanup(build):
    pass  # Remove resource pointed by build['retval']

The decorator can be just a class acting as a proxy for the actual call:

class JobFunction(object):
    def __init__(self, func):
        self._cleanup_functions = []
        self.func = func

    def __call__(self, *a, **kw):
        return self.func(*a, **kw)

    def cleanup(self, func):
        self._cleanup_functions.append(func)

Note: we want to either wrap all the methods to return documentation, arguments, ... or handle this as a special case and retrieve them from obj.func instead.

Keep reference of dependency build ids

We want to keep reference to the exact build for a dependency a given job has been built upon.

  • on build creation, add a copy of:
    • job configuration (in case it should change, we will know exactly which was the exact configuration)
    • ids of latest builds of dependencies (something like {job_id: build_id})

๐Ÿ’ก For the build id selection thing, we need a "nested" structure, as we might as well want to specify build numbers for dependencies of dependencies.. although this might create some problems in case we want to build against two different versions at once -> but that's just a less-likely corner case

โ— We also want to be able to say "create a new build of this job"

Example:

  • job dependencies: 1,2,3
  • build dep. spec: {1: 11, 2: 21, 3: None}

Means:

  • use build 11 of job 1, build 21 of job 2, a new one of build 3

See also #16

Add function autocomplete / documentation in job create/edit form

  • Autocomplete the function name with a widget like this:
+-------------------+-------------------------+
| module:func1      |                         |
| module:func2      |  function documentation |
| module:func3      |                         |
+-------------------+-------------------------+

The left column contains a list of candidate completions; the one on the right contains the documentation for the selected completion, if any

Pass YAML configuration through jinja

Mostly to allow:

  • defining of variables / reuse them in multiple places (eg. for database URL "root")
  • defining macros to avoid needless repetitions (eg. to harvest different instances of the same software, ..)

Better dependency graph manipolation / analysis functions

We need to:

  • sort the graph topologically, with "levels" (to allow "parallel" building in the future [which in turn requires some kind of "semaphore" system..])
  • detect loops in the graph -> allow rendering them with a different color on the graphviz visualization
  • provide the required functionality to render the "reverse" dependencies in a different way on the svg virualization

Also, we want a nicer visualization of dependency graph: nodes should be in "sorted" order, in a way that makes clear which should be built first (the arrows point to dependencies, but this can be sometimes misleading, although it is the most "consistent" way to represent a dependency graph)

Allow interrupting builds

This can be tricky: in some cases we can "just" send a TERM/INT/KILL signal to process, but:

  • it is not guaranteed that a process is running just one build at once (threads, greenlets, ..)
  • it is not guaranteed that the PID has not been reallocated

On a side note, we also want some way to track & report sudden death of builder processes (maybe some method to check for worker status? -> celery should provide some mechanisms to do so..)

Add support for "multi-level" progress reporting

Allow the progress to be reported on "multiple levels", eg:

{
    'Spam': {'current': 30, 'total': 60},
    'Eggs': {'current': 10, 'total': 20},
    'Bacon': {'current': 0, 'total': 20},
}

The UI will then render the progress bar like this:

[^] Total:   [========            ] 40/100 (40%)
 |-- Spam    [==========          ] 30/60 (50%)
 |-- Eggs    [==========          ] 10/20 (50%)
 '-- Bacon   [                    ] 0/20 (0%)

Error while launching the web app

I have problems launching the web app, during the startup.

The output hereafter

marcomb@debianmc:/opt/ckan/envs/jobcontrol/bin$ ./jobcontrol-cli --config-file /opt/ckan/envs/conf/conf.yaml web --port 5060 --debug
Traceback (most recent call last):
  File "./jobcontrol-cli", line 9, in <module>
    load_entry_point('jobcontrol==0.1a', 'console_scripts', 'jobcontrol-cli')()
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/jobcontrol-0.1a-py2.7.egg/jobcontrol/cli.py", line 337, in main
    cli_main_grp(obj={})
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 610, in __call__
    return self.main(*args, **kwargs)
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 590, in main
    rv = self.invoke(ctx)
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 936, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 782, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 416, in invoke
    return callback(*args, **kwargs)
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/jobcontrol-0.1a-py2.7.egg/jobcontrol/cli.py", line 292, in web
    app.config.update(jc.config['webapp'])
TypeError: 'NoneType' object is not iterable
marcomb@debianmc:/opt/ckan/envs/jobcontrol/bin$ ./jobcontrol-cli --config-file /opt/ckan/envs/conf/conf.yaml web --port 5060 > debug_jobcontrol.txt
Traceback (most recent call last):
  File "./jobcontrol-cli", line 9, in <module>
    load_entry_point('jobcontrol==0.1a', 'console_scripts', 'jobcontrol-cli')()
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/jobcontrol-0.1a-py2.7.egg/jobcontrol/cli.py", line 337, in main
    cli_main_grp(obj={})
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 610, in __call__
    return self.main(*args, **kwargs)
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 590, in main
    rv = self.invoke(ctx)
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 936, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 782, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/click-3.3-py2.7.egg/click/core.py", line 416, in invoke
    return callback(*args, **kwargs)
  File "/opt/ckan/envs/jobcontrol/local/lib/python2.7/site-packages/jobcontrol-0.1a-py2.7.egg/jobcontrol/cli.py", line 292, in web
    app.config.update(jc.config['webapp'])
TypeError: 'NoneType' object is not iterable

Better traceback for exceptions

We want to have more information, such as values of local variables, etc.

Maybe pickling the whole thing wouldn't be a great idea, but we can store at least (limited?) representations of the objects.

This can be accomplished with a custom version of traceback.extract_tb, to include local variables (from tb.tb_frame.f_locals) [note: include globals too?] -> but only the repr() of values, as there might be problems with pickling / unpickling of arbitrary objects.

Add CLI command to dump the "compiled" configuration

Useful for testing jinja macros: it should output the YAML configuration as generated after the jinja preprocessing.

It might also be useful to add some command to show the structure as processed through the configuration manager, i.e. after reading it from YAML...

Create build before running

We need to change the "run" process a bit:

  • Create build, with job_id and configuration (pinned builds, ...)
  • Launch the execution of the build, via celery
    • we should keep track of the process / worker running the thing, in order to check the status / stop the execution / ...

Note: dependency build should happen inside the build execution, so we can mark it as failed if a dependency build failed.

Replacement on storage URLs is failing

Example: mongodb://database.local/harvester_141103/{id}_pat_statistica -> {id} is not getting replaced (guess this is a problem caused by the harvester function..)

Add support for task "grouping"

Tree structure VS tags + "batch build" button

  • with a tree, we can better unclutter the UI using a "folder-like" structure
  • with tags, we can have multiple categorizations

Example categorizations:

  • Crawl FOO -> harvester, crawl, foo
  • Crawl BAR -> harvester, crawl, bar
  • Convert FOO -> harvester, convert, foo
  • Convert BAR -> harvester, convert, bar

BUT the above tasks are all part of the "harvester" group

Add support for custom "return value formatters"

We need this for things like creating links to storage explorer / displaying stuff / downloading return values / ...

Use cases

  • crawler jobs, containing connection to db:
    • "view" -> link to storage explorer app
    • "download" -> zip / tar archive containing the whole database contents
  • "import preview" jobs: we need to precalculate changes that will be made -> store somewhere -> display them to the user in a nice interface.

Configuration

Many functions can be "generic" and then accept some configuration; for example, the "link to storage" formatter requires the URL of the storage explorer application.

Configuration of formatters might look like:

retval_formatters:
  - cls: mypkg.mymodule:MyClass
    formatter: myotherpkg.myothermodule:my_formatter_function
  - cls: mypkg.mymodule:MyClass2
    formatter: myotherpkg.myothermodule:my_formatter_function2

Formatters can then be pre-processed with functools.partial:

retval_formatters:
  - cls: mypkg.mymodule:MyClass
    formatter: myotherpkg.myothermodule:my_formatter_function
    kwargs:
      base_url: "http://example.com/..."

Parametrized builds (like travis' "build matrix")

Rationale: a job might return a "collection" of items, each of which needs independent processing; if the build of some "derived" builds fail, we might still want to preserve the results of the successful ones.

Example

  • Job 1: download data from http://...
    • Build 1: returned a collection of three items, A, B, C
  • Job 2: import files from job 1 to database
    • Build 2 [A]: Import A to database (returns table_A)
    • Build 2 [B]: Import B to database (returns table_B)
    • Build 2 [C]: Import C to database (returns table_C) FAILED
  • Job 3: convert field names to uppercase in tables from job 2
    • Build 2 [A]: convert table_A to uppercase
    • Build 2 [B]: convert table_B to uppercase
    • Build 2 [C]: convert table_C to uppercase SKIPPED
  • Job 4: merge stuff from all tables from job 3
    • The build should decide whether to continue with just two successful dependency builds or not.

Use cases

  • download list of resources from the geocatalogo
  • import each dataset in its own PostGIS table

Possible alternative

Subdivide builds into "steps", each one can have separate progress / status / return value; we still need some way to "run this build for X items".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.