vogt4nick / dequindre Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 221 KB

Dequindre /de-KWIN-der/ (n.): A minimalist scheduler.

License: MIT License

Python 100.00%

python scheduler

dequindre's People

Contributors

Stargazers

Watchers

dequindre's Issues

Allow Dequindre to restart from point of failure

Per Reddit

If something goes wrong with a 5-script pipeline, let's say problem is with the 3rd script, after the problem is fixed, does the pipeline runs from the very beginning, or can it start where the failure occured?

I bet a .dequindre.json file would be enough to track which tasks have and have not run. Actually, it would only have to track which one's errored out. We can uniquely identify DAGs and Dequindres if we want. This may be a solution.

Rework schedule methods

Dequindre.get_task_priorities and Dequindre.get_priorities is a little opaque.

Proposal A

We rename get_task_priorities to get_task_schedules, and rename get_priorities to get_schedules. They'll return the same values, but the new names will be more obvious to users.

Proposal B

We keep get_task_priorities and get_priorities as they are. We introduce get_schedules which returns an ordered list of when the tasks will be run at runtime.

Proposal A sounds best right now, but we should stew on it a bit before making a final decision.

Document with "dq" instead of "dd"

dd = Dequindre(dag)  # not this
dq = Dequindre(dag)  # like this

Define exception for when cycles are detected

Consider renaming "dedge" to dependency or something more intuitive for users

Some big questions:

if not dedges, what do we call it?
should we encourage users to access the dedges directly?
what should the API look and feel like?

dedges is a dict where each key is an upstream node and each value is a set of downstream nodes.

Can dequindre work with virtualenv too?

No idea. I don't use venv, but it'd be a great add for v1.0 if it can. Explore.

Support Concurrent Tasks

Following up #75. asyncio or threading will support concurrency.

Maybe at a later date we can use multithreading to support parallel tasks, but I'm still opposed to it. If parallelism is a breaking point, then Airflow is almost always going to be a better tool for the job.

Allow users to "add" DAGs together

Large workflows can usually be broken up into parts. It'd be nice to define these parts separately and then add the DAGs together by combining all Tasks and dedges. The DAGs don't even have to be connected, since the DAG class allows for multiple graphs.

# define tasks
buy_groceries = DAG(tasks=tasks)
# define tasks
make_breakfast = DAG(tasks=tasks)

schedule = buy_groceries + make_breakfast

Default to base env when no env is selected

Mostly useful for short-hand demos.

It allows users to get away with poor practice and never specifying an environment, but I guess they can already do that.

protect DAG.edges from end users

rename edges to _edges. Users shouldn't be accessing edges unless they know what they're doing. The naming should also reflect that edges are an internal attribute.

Start documenting with sphinx

Sphinx

Show only filename in Task repr

filenames will rarely be reused (except, main.py I guess). Let's show the filename only when printing Tasks.

Start tracking docstring examples with doctest

Verify or update all examples in docstrings

Offer option to config pipeline with yaml?

It might look like this:

env: {"HOMEDIR": path/to/homedir}
docker_image: continuum/miniconda3
task: my_task
  runfile: /path/to/runfile.py
  stage: 1
  conda_env: python36
  depends: [your_task, their_task]

Is this a good alternative to dynamic configuration?
How much work would it take?
Can we keep dynamic configuration without muddying the standard?

Add option to fail hard or soft

If a task fails, Dequindre will keep going as normal. We should have an option to accept this behavior or abort the whole thing.

Document virtualenv functionality

Related to #51

Define dedges as downstream: upstream

Start using bumpversion

bumpversion

Right now we can get away with not tagging development builds because it's only one person. That will (hopefully) change.

Should we support asyncio?

asyncio comes standard in python 3.5 and later.

The DAG should catch cycles before they get to Dequindre

Annotate which tasks are running when `run_tasks` is called

something simple like

print(f'\nRunning {task.loc}\n')

Add class.qualname to repr

For all classes

Remove `stage` from repo

The stage param is adding unnecessary complexity for now. Bin it.

Allow user to define dependencies when DAG is initiated

write a test for it too.

Whiteboard the dynamic configuration API

what should the API look and feel like?

The API is somewhat limited by the simple structure of Task, DAG, and Dequindre as the node-, edge-, and scheduler-classes respectively.

Above all, dequindre exists to simplify things. It's baby's first scheduler. It's a learning tool. Priorities are:

to make the dag configuration easy to skim,
to make the dag easy to configure, and
to make the dag easy to debug.

Debugging is the last priority because dequindre is so minimalist that there really shouldn't be any puzzling errors.

With that in mind, here are some possible dependency configurations for the API:

current path

tea_dag = DAG(empty=True)
tea_dag.add_dependencies({
    pour_water: {get_mug, boil_water},
    steep_tea: pour_water,
    drink_tea: steep_tea
})

Reference last key as current value?

tea_dag = DAG(empty=True)
tea_dag.add_dependencies(
    pour_water={get_mug, boil_water},
    steep_tea=.,
    drink_tea=.
)

airflow-like

tea_dag = DAG(empty=True)
tea_dag.set_downstream(boil_water, pour_water)
tea_dag.set_downstream(pour_water, steep_tea)
tea_dag.set_downstream(steep_tea, drink_tea)

We could add dags together by overriding the byteshift operators.

make_tea >> go_to_work  # all of make_tea must finish before any of go_to_work starts

Align readme.md with docs/

flush prints

Right now we get output like this:

I am preparing the infuser...
I am pouring water...
I am boiling water...
I am steeping tea...

Running Task(./tea-tasks/prep_infuser.py)
Running Task(./tea-tasks/pour_water.py)
Running Task(./tea-tasks/boil_water.py)
Running Task(./tea-tasks/steep_tea.py)

It should be like

Running Task(./tea-tasks/prep_infuser.py)
I am preparing the infuser...
Running Task(./tea-tasks/pour_water.py)
I am pouring water...
Running Task(./tea-tasks/boil_water.py)
I am boiling water...
Running Task(./tea-tasks/steep_tea.py)
I am steeping tea...

Remove activate_env_cmd parameter

We can make assumptions about the conda infrastructure, notably where to find the python. That means we can just call the environment directly.

setup tox to test multiple python versions

Document example usage of dequindre

get_priorities is important. So is max_stage. Nothing references that in the docker demo.

check that conda environments exists

Right now it's easy to give dummy conda environments. That shouldn't slip by in prod.

Consider merging DAG and Dequindre functionality as Workflow

Suppose we had a different class structure built on Tasks and Workflows. Workflows would basically combine the functionality of DAGs and Dequindres.

from dequindre import Task, Workflow

def my_workflow():
    ## define tasks and environments
    boil_water = Task('boil_water.py')
    steep_tea = Task('steep_tea.py')
    drink_tea = Task('drink_tea.py')

    ## define runtime dependencies
    make_tea = Workflow(dependencies={
        boil_water: pour_water,
        steep_tea: {boil_water, prep_infuser}
    })

    ## run tasks
    make_tea.get_schedules()
    # defaultdict(<class 'set'>, {
    #     1: {Task(prep_infuser.py), Task(pour_water.py)},  
    #     2: {Task(boil_water.py)},  
    #     3: {Task(steep_tea.py)}})
        
    make_tea.run_tasks()

I think that UI looks better and more intuitive. It still plays nice with schedules too:

import time
from dequindre import schedule

if __name__ == '__main__':
    stoptime = time.time() + 3600 # seconds
    schedule.every(10).minutes.do(my_workflow)
    while time.time() < stoptime:
        schedule.run_pending()

where is the "schedule" part of the tool? Like if I want boil_water to run daily at 10 AM.

Supporting true scheduling is out of the question; other libraries to it better. What we can do is change the tagline to better reflect Dequindre's capabilities.

Maybe "dependency scheduler" or "minimalist pipeline manager"?