vogt4nick / dequindre Goto Github PK
View Code? Open in Web Editor NEWDequindre /de-KWIN-der/ (n.): A minimalist scheduler.
License: MIT License
Dequindre /de-KWIN-der/ (n.): A minimalist scheduler.
License: MIT License
Right now we can get away with not tagging development builds because it's only one person. That will (hopefully) change.
If something goes wrong with a 5-script pipeline, let's say problem is with the 3rd script, after the problem is fixed, does the pipeline runs from the very beginning, or can it start where the failure occured?
I bet a .dequindre.json file would be enough to track which tasks have and have not run. Actually, it would only have to track which one's errored out. We can uniquely identify DAGs and Dequindres if we want. This may be a solution.
write a test for it too.
Large workflows can usually be broken up into parts. It'd be nice to define these parts separately and then add the DAGs together by combining all Tasks and dedges. The DAGs don't even have to be connected, since the DAG class allows for multiple graphs.
# define tasks
buy_groceries = DAG(tasks=tasks)
# define tasks
make_breakfast = DAG(tasks=tasks)
schedule = buy_groceries + make_breakfast
For all classes
They're pretty garbage right now.
Following up #75. asyncio
or threading
will support concurrency.
Maybe at a later date we can use multithreading
to support parallel tasks, but I'm still opposed to it. If parallelism is a breaking point, then Airflow is almost always going to be a better tool for the job.
Dequindre.get_task_priorities
and Dequindre.get_priorities
is a little opaque.
We rename get_task_priorities
to get_task_schedules
, and rename get_priorities
to get_schedules
. They'll return the same values, but the new names will be more obvious to users.
We keep get_task_priorities
and get_priorities
as they are. We introduce get_schedules
which returns an ordered list of when the tasks will be run at runtime.
Proposal A sounds best right now, but we should stew on it a bit before making a final decision.
asyncio
comes standard in python 3.5 and later.
Related to #51
If a task fails, Dequindre will keep going as normal. We should have an option to accept this behavior or abort the whole thing.
No idea. I don't use venv, but it'd be a great add for v1.0 if it can. Explore.
Suppose we had a different class structure built on Tasks and Workflows. Workflows would basically combine the functionality of DAGs and Dequindres.
from dequindre import Task, Workflow
def my_workflow():
## define tasks and environments
boil_water = Task('boil_water.py')
steep_tea = Task('steep_tea.py')
drink_tea = Task('drink_tea.py')
## define runtime dependencies
make_tea = Workflow(dependencies={
boil_water: pour_water,
steep_tea: {boil_water, prep_infuser}
})
## run tasks
make_tea.get_schedules()
# defaultdict(<class 'set'>, {
# 1: {Task(prep_infuser.py), Task(pour_water.py)},
# 2: {Task(boil_water.py)},
# 3: {Task(steep_tea.py)}})
make_tea.run_tasks()
I think that UI looks better and more intuitive. It still plays nice with schedules too:
import time
from dequindre import schedule
if __name__ == '__main__':
stoptime = time.time() + 3600 # seconds
schedule.every(10).minutes.do(my_workflow)
while time.time() < stoptime:
schedule.run_pending()
The built-in sched
should be some help here. We may want to introduce a command line utility for this.
get_priorities
is important. So is max_stage
. Nothing references that in the docker demo.
We can make assumptions about the conda infrastructure, notably where to find the python. That means we can just call the environment directly.
where is the "schedule" part of the tool? Like if I want
boil_water
to run daily at 10 AM.
Supporting true scheduling is out of the question; other libraries to it better. What we can do is change the tagline to better reflect Dequindre's capabilities.
Maybe "dependency scheduler" or "minimalist pipeline manager"?
Right now it's easy to give dummy conda environments. That shouldn't slip by in prod.
Right now we get output like this:
I am preparing the infuser...
I am pouring water...
I am boiling water...
I am steeping tea...
Running Task(./tea-tasks/prep_infuser.py)
Running Task(./tea-tasks/pour_water.py)
Running Task(./tea-tasks/boil_water.py)
Running Task(./tea-tasks/steep_tea.py)
It should be like
Running Task(./tea-tasks/prep_infuser.py)
I am preparing the infuser...
Running Task(./tea-tasks/pour_water.py)
I am pouring water...
Running Task(./tea-tasks/boil_water.py)
I am boiling water...
Running Task(./tea-tasks/steep_tea.py)
I am steeping tea...
Verify or update all examples in docstrings
what should the API look and feel like?
The API is somewhat limited by the simple structure of Task, DAG, and Dequindre as the node-, edge-, and scheduler-classes respectively.
Above all, dequindre
exists to simplify things. It's baby's first scheduler. It's a learning tool. Priorities are:
Debugging is the last priority because dequindre
is so minimalist that there really shouldn't be any puzzling errors.
With that in mind, here are some possible dependency configurations for the API:
current path
tea_dag = DAG(empty=True)
tea_dag.add_dependencies({
pour_water: {get_mug, boil_water},
steep_tea: pour_water,
drink_tea: steep_tea
})
Reference last key as current value?
tea_dag = DAG(empty=True)
tea_dag.add_dependencies(
pour_water={get_mug, boil_water},
steep_tea=.,
drink_tea=.
)
airflow-like
tea_dag = DAG(empty=True)
tea_dag.set_downstream(boil_water, pour_water)
tea_dag.set_downstream(pour_water, steep_tea)
tea_dag.set_downstream(steep_tea, drink_tea)
We could add dags together by overriding the byteshift operators.
make_tea >> go_to_work # all of make_tea must finish before any of go_to_work starts
It might look like this:
env: {"HOMEDIR": path/to/homedir}
docker_image: continuum/miniconda3
task: my_task
runfile: /path/to/runfile.py
stage: 1
conda_env: python36
depends: [your_task, their_task]
filenames will rarely be reused (except, main.py I guess). Let's show the filename only when printing Tasks.
The stage
param is adding unnecessary complexity for now. Bin it.
rename edges
to _edges
. Users shouldn't be accessing edges unless they know what they're doing. The naming should also reflect that edges
are an internal attribute.
Mostly useful for short-hand demos.
It allows users to get away with poor practice and never specifying an environment, but I guess they can already do that.
dd = Dequindre(dag) # not this
dq = Dequindre(dag) # like this
Some big questions:
dedges
, what do we call it?dedges
directly?dedges
is a dict where each key is an upstream node and each value is a set of downstream nodes.
something simple like
print(f'\nRunning {task.loc}\n')
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.