Giter VIP home page Giter VIP logo

synpp's Introduction

Synthetic Population Pipeline (synpp)

Build Status

The synpp module is a tool to chain different stages of a (population) synthesis pipeline. This means that self-contained pieces of code can be run, which are dependent on the outputs of other self-contained pieces of code. Those pieces, or steps, are called stages in this module.

The following will describe the components of the pipeline and how it can be set up and configured. Scroll to the bottom to find a full example of such a pipeline which automatically downloads NYC taxi data sets, merges them together and calculates the average vehicle occupancy during a predefined period.

Installation

The synpp package releases can be installed via pip:

pip install synpp

Currently, version 1.5.1 is the active release version. Alternatively, you can clone the develop branch of this repository to use the development version. It can be installed by calling

pip install .

inside of the repository directoy.

Concepts

A typical chain of stages could, for instance, be: (C1) load raw census data, (C2) clean raw census data (dependent on C1), (H1) load raw household travel survey data, (H2) clean survey data (dependent on C2), (P1) merge census (C1) and survey (H2) data, (P2) generate a synthetic population from merged data (P1).

In synpp each stage is defined by:

  • A descriptor, which contains the stage logic.
  • Configuration options that parameterize each stage.

Defining a descriptor

A descriptor can be defined in its compact form or in its full form. Both work in the same way and can be used interchangeably in most cases.

In this readme the full form is preferred to explain each of synpp's feature as it is more expressive, but towards the end a closer look at the compact form is also provided.

A descriptor in its full form looks like:

def configure(context):
  pass

def execute(context):
  pass

def validate(context):
  pass

These functions are provided in a Python object, such as a module, a class or a class's instance. synpp expects either a String containing the path to the object, such as "pkg.subpkg.module", or the instantiated object directly.

In its compact form, the stage is defined as a function, and looks like:

@synpp.stage
def stage_to_run():
  pass

Where the @stage decorator informs synpp that it should handle this function as a stage and how it should do it.

Configuration and Parameterization

Whenever the pipeline explores a stage, configure is called first. Note that in the example above we use a Python module, but the same procedure would work analogously with a class. In configure one can tell the pipeline what the stage expects in terms of other input stages and in terms of configuration options:

def configure(context):
  # Expect an output directory
  value = context.config("output_path")

  # Expect a random seed
  value = context.config("random_seed")

  # Expect a certain stage (no return value)
  context.stage("my.pipeline.raw_data")

We could add this stage (let's call it my.pipeline.raw_data) as a dependency to another one. However, as we did not define a default value with the config method, we need to explicitly set one, like so:

def configure(context):
  context.stage("my.pipeline.raw_data", { "random_seed": 1234 })

Note that it is even possible to build recursive chains of stages using only one stage definition:

def configure(context):
  i = context.config("i")

  if i > 0:
    context.stage("this.stage", { "i": i - 1 })

Configuration options can also be defined globally in the pipeline. In case no default value is given for an option in configure and in case that no specific value is passed to the stage, a global configuration that is specific to the pipeline will be used to look up the value.

Execution

The requested configuration values and stages are afterwards available to the execute step of a stage. There those values can be used to do the "heavy work" of the stage. As the configure step already defined what kind of values to expect, we can be sure that those values and dependencies are present once execute is called.

def execute(context):
  # Load some data from another stage
  df = context.stage("my.pipeline.census.raw")

  df = df.dropna()
  df["age"] = df["age"].astype(int)

  # We could access some values if we wanted
  value = context.config("...")

  return df

Note that the execute step returns a value. This value will be pickled (see pickle package of Python) and cached on the hard drive. This means that whenever the output of this stage is requested by another stage, it doesn't need to be run again. The pipeline can simply load the cached result from hard drive.

If one has a very complex pipeline with many stages this means that changes in one stage will likely not lead to a situation where one needs to re-run the whole pipeline, but only a fraction. The synpp framework has intelligent explorations algorithms included which figure out automatically, which stages need to be re-run.

Running a pipeline

A pipeline can be started using the synpp.run method. A typical run would look like this:

config = { "random_seed": 1234 }
working_directory = "~/pipeline/cache"

synpp.run([
    { "descriptor": "my.pipeline.final_population" },
    { "descriptor": "my.pipeline.paper_analysis", "config": { "font_size": 12 } }
], config = config, working_directory = working_directory)

Here we call the stage defined by the module my.pipeline.final_population which should be available in the Python path. And we also want to run the my.pipeline.paper_analysis path with a font size parameter of 12. Note that in both cases we could also have based the bare Python module objects instead of strings.

The pipeline will now figure out how to run those stages. Probably they have dependencies and the analysis stage may even depend on the other one. Therefore, synpp explores the tree of dependencies as follows:

  • Consider the requested stages (two in this case)
  • Step by step, go through the dependencies of those stages
  • Then again, go through the dependencies of all added stages, and so on

By that the pipeline traverses the whole tree of dependencies as they are defined by the configure steps of all stages. At the same time it collects information about which configuration options and parameters are required by each stage. Note that a stage can occur twice in this dependency tree if it has different parameters.

After constructing a tree of stages, synpp devalidates some of them according to the following scheme. A stage is devalidated if ...

  • ... it is requested by the run call (and rerun_required is set to True, the default)
  • ... it is new (no meta data from a previous call is present)
  • ... the code of the stage has changed (verified with inspection)
  • ... if at least one of the requested configuration options has changed
  • ... if at least one dependency has been re-run since the last run of the stage
  • ... if list of dependencies has changed
  • ... if manual validation of the stage has failed (see below)
  • ... if any ascendant of a stage has been devalidated

This list of conditions makes sure that in almost any case of pipeline modification we end up in a consistent situation (though we cannot prove it). The only measure that may be important to enforce 'by convention' is to always run a stage after the code has been modified. Though even this can be automated.

Validation

Each stage has an additional validate step, which also receives the configuration options and the parameters. Its purpose is to return a hash value that represents the environment of the stage. To learn about the concept in general, search for "md5 hash", for instance. The idea is the following: After the execute step, the validate step is called and it will return a certain value. Next time the pipeline is resolved the validate step is called during devalidation, i.e. before the stage is actually executed. If the return value of validate now differs from what it was before, the stage will be devalidated.

This is useful to check the integrity of data that is not generated inside of the pipeline but comes from the outside, for instance:

def configure(context):
  context.config("input_path")

def validate(context):
  path = context.config("input_path")
  filesize = get_filesize(path)

  # If the file size has changed, the file must have changed,
  # hence we want to run the stage again.
  return filesize

def execute(context):
  pass # Do something with the file

Cache paths

Sometimes, results of a stage are not easily representable in Python. Even more, stages may call Java or Shell scripts which simply generate an output file. For these cases each stage has its own cache path. It can be accessed through the stage context:

def execute(context):
  # In this case we write a file to the cache path of the current stage
  with open("%s/myfile.txt" % context.path()) as f:
    f.write("my content")

  # In this case we read a file from the cache path of another stage
  with open("%s/otherfile.txt" % context.path("my.other.stage")) as f:
    value = f.read()

As the example shows, we can also access cache paths of other stages. The pipeline will make sure that you only have access to the cache path of stages that have been defined as dependencies before. Note that the pipeline cannot enforce that one stage is not corrupting the cache path of another stage. Therefore, by convention, a stage should never write to the cache path of another stage.

Aliases

Once a pipeline has been defined, the structure is relatively rigid as stages are referenced by their names. To provide more flexibility, it is possible to define aliases, for instance:

synpp.run(..., aliases = {
  "my.pipeline.final_population": "my.pipeline.final_population_replacement"
})

Whenever my.pipeline.final_population is requested, my.pipeline.final_population_replacement will be used instead. Note that this allows to define entirely virtual stages that are referenced from other stages and which are only bound to a specific execution stage when running the pipeline (see example above).

Parallel execution

The synpp package comes with some simplified ways of parallelizing code, which are built on top of the multiprocessing package. To set up a parallel routine, one can follow the following pattern:

def run_parallel(context, x):
  return x**2 + context.data("y")

def execute(context):
  data = { "y": 5 }

  with context.parallel(data) as parallel:
    result = parallel.map(run_parallel, [1, 2, 3, 4, 5])

This approach looks similar to the Pool object of multiprocessing but has some simplifications. First, the first argument of the parallel routine is a context object, which provides configuration and parameters. Furthermore, it provides data, which has been passed before in the execute function. This simplifies passing data to all parallel threads considerably to the more flexible approach in multiprocessing. Otherwise, the parallel object provides most of the functionality of Pool, like, map, async_map, imap, and unordered_imap.

Info

While running the pipeline a lot of additional information may be interesting, like how many samples of a data set have been discarded in a certain stage. However, they often would only be used at the very end of the pipeline when maybe a paper, a report or some explanatory graphics are generated. For that, the pipeline provides the set_info method:

def execute(context):
  # ...
  context.set_info("dropped_samples", number_of_dropped_samples)
  # ...

The information can later be retrieved from another stage (which has the stage in question as a dependency):

def execute(context):
  # ...
  value = context.get_info("my.other.stage", "dropped_samples")
  # ...

Note that the info functionality should only be used for light-weight information like integers, short strings, etc.

Progress

The synpp package provides functionality to show the progress of a stage similar to tqdm. However, tqdm tends to spam the console output which is especially undesired if pipelines have long runtimes and run, for instance, in Continuous Integration environments. Therefore, synpp provides its own functionality, although tqdm could still be used:

def execute(context):
  # As a
  with context.progress(label = "My progress...", total = 100) as progress:
    i = 0

    while i < 100:
      progress.update()
      i += 1

  for i in context.progress(range(100)):
    pass

Compact stage definition

As quickly introduced before, stages can also be defined in a compact form. The example offered is the simplest possible, where a stage takes no configuration parameters. Consider now the more elaborate setting:

@synpp.stage(loaded_census="my.pipeline.census.raw", sample_size="census_sample_size")
def clean_census(loaded_census, sample_size=0.1):
    ...

When synpp sees clean_census, it will under the hood convert it to a stage in its full form. Basically @synpp.stage says how the stage should be configured and the function defines the stage's logic. To put clearly, the stage above is converted by synpp to something like:

def configure(context):
  context.stage("my.pipeline.census.raw")
  context.config("census_sample_size", default=0.1)

def execute(context):
  loaded_census = context.stage("my.pipeline.census.raw")
  sample_size = context.config("census_sample_size")
  return clean_census(loaded_census, sample_size)

As you may have noticed, census_sample_size is a config option defined in the config file, and in case it isn't found, synpp will simply use the function's default. Notice also that the following wouldn't work: @synpp.stage(..., sample_size=0.2), since synpp would try to find a parameter called "0.2" in the config file that doesn't exist.

In case a parameterized stage must be passed as dependency, this can be performed in a similar way, by simply wrapping the stage in the synpp.stage() decorator. Following the previous example, we may replace the first argument with loaded_census=synpp.stage("my.pipeline.census.raw", file_path="path/to/alternative/census").

This compact way of defining stages does not support all functionality, for instance custom stage devalidation, but functionality that requires the context object are also possible via the helper method synpp.get_context().

Command-line tool

The synpp pipeline comes with a command line tool, which can be called like

python3 -m synpp [config_path]

If the config path is not given, it will assume config.yml. This file should contain everything to run a pipeline. A simple version would look like this:

# General pipeline settings
working_directory: /path/to/my/working_directory

# Requested stages
run:
  - my_first_module.my_first_stage
  - my_first_parameterized_stage:
    param1: 123
    param2: 345

# These are configuration options that are used in the pipeline
config:
  my_option: 123

It receives the working directory, a list of stages (which may be parameterized) and all configuration options. The stages listed above should be available as Python modules or classes. Furthermore, aliases can be defined as a top-level element of the file.

NYC Taxi Example

This repository contains an example of the pipline. To run it, you will need pandas as an additional Python dependency. For testing, you can clone this repository to any directory on your machine. Inside the repository directory you can find the example directory. If you did not install synpp yet, you can do this by executing

pip install .

inside of the repository directory. Afterwards, open examples/config.yml and adjust the working_directory path. This is a path that should exist on your machine and it should be empty. The best is if you simply create a new folder and add the path in config.yml.

You can now go to examples and call the pipeline code:

cd examples
python3 -m synpp

It will automatically discover config.yml (but you could path a different config file path manually as a command line argument). It will then download the NYC taxi data for January, February and March 2018 (see configuration options in config.yml). Note that this is happening in one stage for which you can find the code in nyc_taxi.download. It is parameterized by a month and a year to download the respective data set. These data sets then go into nyc_taxi.aggregate, where they are merged together. Finally, an average occupancy value is printed out in nyc_taxi.print_occupancy. So the dependency structure is as follows:

nyc_taxi.aggregate depends on multiple nyc_taxi.download(year, month)
nyc_taxi.print_occupancy depends on nyc_taxi.aggregate

After one successful run of the pipeline you can start it again. You will notice that the pipeline does not download the data again, because nothing has changed for those stages. However, if you would change the requested months in config.yml the pipeline may download the additional data sets.

synpp's People

Contributors

ainar avatar ctchervenkov avatar davibicudo avatar nitnelav avatar sebhoerl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

synpp's Issues

Exspose pipeline network and status through zmq

Just an idea, because I like ZeroMQ and because it's allready a dependency :)

We could expose the state of the pipeline through a zmq api. The network structure could be exposed with a REQ-RES pattern and the various nodes' states through a PUB-SUB pattern.
The REQ-RES pattern could also be used for "control" requests like : run X, invalid cache for X, pause after X, etc...
That would enable any HMI technology to be used on top of it...

Decorator as a stage wrapper

I'm considering refactoring existing code to use synpp, and to avoid changing too much the code, one option that came to my mind would be to use decorator functions for converting existing functions to pipeline stages, in a similar way to pytest's parametrize.
Something like:

@synpp.stage(df="module.stage1", df2="module.stage2", param="config_key")
def old_function(df1, df2, param): ...

This would be converted during process_stages (i.e. before running the pipeline) to something like:

class old_function:
    def configure(self, context):
        context.stage("module.stage1")
        context.stage("module.stage2")
        context.config("config_key")

    def execute(self, context):
        return old_function(context.stage("module.stage1"), context.stage("module.stage2"), context.config("config_key"))

Do you think this could work?

Improving cache management

Currently, if a working directory is given, all cache files are deleted from memory after each stage execution.
We could accelerate the whole execution of a pipeline by avoiding re-loading results that have been used or produced during the preceding stage.

This is how it would work:
We would use a cache map for all stage executions. At each stage execution, this cache is filled with dependency results/caches and, in the end, with the stage result.
This is already the case when no working directory is given. So technically, we could use the same cache.
Then, at the beginning of each stage execution, we delete the cache that will not be used during the subsequent execution.

This reduces the execution time of the whole pipeline significantly depending on the cache file sizes. This may be explained by the fact that a topological sort gives execution order: for each stage, its result is probably used during the next stage.

There is one major drawback: the cache is mutable, so any modification of a retrieved cache will be transmitted to the following stages using this same retrieved cache (which does not happen currently and we generally do not want). I do not know how to prevent that because efficiently detecting a modification of an object or making it immutable depends on the kind of this object. A heavy way to do that is to compare the serialization before and after execution, but then we may lose most of the speedup depending on the cost of a serialization.
Let me know if you see any solution apart from warning the developer or if you note any other drawbacks.

Maybe we could make my suggestion an optional behavior?

I already implemented it in my fork (diffs). If you agree with my suggestion, should I wait for #81 to be merged to open a new PR? Or may you want to implement it by yourself?

Details of input data

Hi, Thank you for the nice repo.
Could you please add more details about the input data requirements for running this pipeline?
I imagine it is more or less the same as what is written under the "Synthetic population of Switzerland" GitLab repo.
However, especially for an external user, more explicit information can be very helpful. For example, the GitLab repo documents the need for having structural_survey data, but it does not specify at which level (NUTS2, NUTS3, etc.) and what socio-demographic variables should the dataset contain. It is the same for other inputs as well. Do you think, it is possible to give a clearer picture on this?

Increase verbosity of parameters

We should make sure that the pipeline throws an error if a config option / parameter is passed to a stage which is not requested.

Avoid monolithic pipeline state file

Currently, the pipeline.json file containing the state of all stages is one big monolithic file. This comes with problems, for instance when one wants to run the pipeline multiple times in parallel, for instance with different random seeds. This can lead to race conditions in which the pipeline.json is updated by one process, and read by the other one, etc.

Ideally, meta information about stages could be distributed in the relevant folders of the stages.

Difference between synpp and other tools

I discovered synpp through https://github.com/eqasim-org/ile-de-france.
This tool is handy. Doing some research, I found that this kind of framework is widespread. They are called "data pipeline" frameworks.
We can find it in bioinformatics or, more generally, in data science research works.

So here are my questions:

  • What are the differences between synpp and the other pipeline frameworks that are far more used?
  • Does synpp have specific and mandatory features for population synthesis?

For example:

I think the reason for synpp is that it is more straightforward than the other tools I listed (or I am just used to it). Because of that, I think synpp should stay the most simple, not reinvent the wheel with each new feature.
Do you have another opinion? What were your thoughts when you thought about alternatives if you did?

That makes me wonder, can synpp be officially generalized for other works non-related to population synthesis?

Jupyter support

It would be nice to have functions to run things with synpp from Jupyter, for instance stage by stage and to examine the results directly, e.g.

pipeline = synpp.setup("config_ile_de_france.yml")
df_census = pipeline.run("data.census.cleaned")
# Do something with df_census

df_census = pipeline.run("data.census.cleaned", stages = {
  "data.census.raw": SOME_DATA # Override stage output
})

Default value of rerun_required when running a pipeline is False and cannot be changed

Hello,

After upgrading from 1.3.1 (the version of synpp used for https://github.com/eqasim-org/ile-de-france) to 1.5.0 (the last version available today), I noticed a change of behavior.
The stages in the run section of the YML file are no longer "devalidated" if they are cached. I would like to always "devalidate" stages in the run section, but there is no parameter to do that from the YML file. Do I miss something?

I found that the rerun_required parameter of run_pipeline is set to False by default.

def run_pipeline(self, definitions=None, rerun_required=False, dryrun=None, verbose=False, flowchart_path=None):

It confused me, as it is written:

A stage is devalidated if ...

  • ... it is requested by the run call (and rerun_required is set to True, the default)

I understand that putting a stage name in the run section of the YML is not equivalent to requesting a run call on it. There is maybe something to clarify here.

Aina

Optional intermediary stages

Hi @sebhoerl

Considering a pipeline with stages like:

LOAD_DATA -> PROCESS1 -> (PROCESS2) -> WRITE_DATA

what would be the best way to handle process2? write_data is dependant on process1 or process2 (both).

One way to solve it could be setting the chain as if all are mandatory, but passing a flag to process2 named run_process2, which is defined in the config. This is not very elegant, since ideally we could simply add process2 in the run-list.
Another option is combining process1 and process2, but this also requires some flag and if each is a long-running complex stage, this is also not ideal.

Do you have another suggestion on how to deal with this issue?

Run stages in parallel

Often, the tree structure of the pipelines allows to run things in parallel. Right now the pipeline runs one stage at a time. To make use of parallel computing power, a couple of steps are necessary:

  1. Let user define resource availability via configuration, e.g.
resources:
  - cpu: 8
  - memory: 10
  1. Let user define resource requirements, e.g.
def configure(context):
  context.resource("cpu", 4)
  1. Run stages in parallel in the pipeline. There is a caveat: We can not start slave processes from within slave processes. This means if a stage makes use of the parallel() context, it should not already be in a child process! Therefore, we need to put some thoughts and intelligent management of the process pool. (In particular, it would need to be managed centrally by the pipeline instead of per ParallelMasterContext object).

Verbosity of executed/loaded stages

Hi Sebastian,

I had a good experience with this pipeline code at ETH and then happy to later see that you published it in a separate repository.
We are using it here for setting up a simple data analysis pipeline (nothing to do with Synpop^^), and one thing I missed when debugging was to know what stage is being executed/loaded at what time and with which parameters. Currently we add a logging message inside execute() but I believe this could be easily and better done in the library code itself.
Do you agree? If you can add this feature or point me where to add it would be great!

Best,
Davi

๐Ÿ’ก Implement a plugin system ?

WIP : This is just an idea for now...

It would be interesting to be able to imagine a plugin system that allows users to replace a stage with an external one.
This could be set in the config file in a plugin section.

For exemple :
stage A depends on stage B. The file structure should look like this :

root
| -- A.py
| -- B.py
| -- config.yml

A.py

def configure(context):
    context.stage("B")

def exec(context):
    B = context.stage("B")
    # ...

B.py

def configure(context):
    context.config("bar")

def exec(context):
    foo = context.config("bar")
    # ...

config.yml

run:
    - A

config:
    bar: "foo"
    
plugins:
    B: /path/to/external/B.py

In this exemple, the external B.py is executed instead of the "internal" B.py. It is the responsability of the external B.py to match the data structure expected in the internal B.py outputs.

This would be interesting for big projects like https://github.com/eqasim-org/ile-de-france where somes stages can be solved by various algorithmes or datasources. Down the line this would avoid specifying a lot of "selected" stages with conditionnal switches like in : https://github.com/eqasim-org/ile-de-france/blob/develop/data/hts/selected.py

Can it be done ? How ? How to manage external dependency tree then ? What if dependency C is defined in both trees ?
I don't know. I can work on it if you think it's feasable without breaking everything :)

Make possible to explicitly request "parameters"

Currently, a stage like this will obtain either a value that has directly been passed by a child stage or a value that comes from the global pipeline configuration:

def configure(context):
   value = context.config("abc")

In some cases, we don't want to rely on the global pipeline configuration (because we want to explicitly make sure that this stage is called with a well-chosen argument). Currently, there is no way to enforce a parameter to by passed from another stage rather from global configuration. Therefore, I'd propose to have:

def configure(context):
   value = context.config("abc", parameter = True)

In this case the option becomes a "parameter" to the stage. This requires some moving around of code internally in the pipeline, but should not be awfully complicated.

Improve configuration

I think it would be nice to clean up the configuration format a bit. I imagine something like this:

synpp:
  working_directory: /path/to/working_directory

  processes: 4

  run:
    - stage1
    - stage2
    - etc

population:
  settings:
    hts: entd
    sampling_rate: 0.05

  data_path: /path/to/my/data
  output_path: /path/to/my/output

java:
  binary: /path/to/java
  memory: 14G

There is a synpp namespace which controls the execution of the pipeline, everything else is individual to the use case. All options in the synpp namespace should be validated for consistency and lead to early errors if something is wrong.

Note that the rest is hierarchichal. I would like to request the java binary in the pipeline as java.binary and other values in the form of population.settings.hts.

Pass result to validate

The validate step should have access to the result of the stage. Proposal:

context.result()

Make devalidation less sensible in some cases

Hi !

We've been working with synpp in a project, and it's been quite useful so far. It is basically a MVP of a data processing interface. Particularly it's caching is super helpful in which it helps avoiding re-running time-consuming data processing everytime. It is also quite simple and intuitive in comparison to other pipeline libraries, so also a pleasure to work with :)
Sometimes I note however that synpp's devalidation might be too sensible. Like:

  • If a stage is a class, then module_hash devalidates the stage even if any changes happened outside the stage. An option here would be to hash inspect.getsource(foo) whenever the stage is a class, and rename module_hash with stage_hash. This happens here often since we defined several stages as classes (avoid excess of files and is helpful to set fields in configure(), like e.g. self.foo = context.config('bar')).
  • Some config options don't break the calculations, but may be included to control output or verbosity. It would be very useful to have something like context.config('write_output', ignore_in_devalidation=True).
  • Sometimes asking synpp for a stage doesn't mean necessarily we want to re-run it, i.e. the cache would suffice. Something like synpp.run(descriptors, rerun_cached=False) would also be nice.

What do you think? I could try to implement/test here (although I must admit time is not abundant at the moment...). I'd have an idea how to start, except for the second case where I'm not sure how the hash is checked there.

Best,
Davi

Increment pickle protocol to 4

There are currently 6 different protocols which can be used for pickling. The higher the protocol used, the more recent the version of Python needed to read the pickle produced.

Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. It is the default protocol starting with Python 3.8.

Thus, we need to specify to use this version in our pipeline to be able to deal with potentially large objects.

Example broken: object has no attribute 'reset

The nyc_taxi example crashes because it calls the progress.reset function which is no longer present in the current version of synpp. (See DownloadProgress in download.py).

AttributeError: 'ProgressContext' object has no attribute 'reset'

403 Error while trying to run NYC example

Hey,

I recently tried running the example of NYC from the repository to better understand SYNPP. However, I'm getting a 403 message while following the steps to run the NYC data from the README.

Could you please advise me on resolving this error?

Thanks in advance.

Lazy-load results of other stages

Right now the pipeline loads all the data of dependency stages. However, it would be much better if this worked in a lazy manner, i.e. only loading stuff when it is actually needed. We can even print warnings if stages are requested that are never used.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.