Giter VIP home page Giter VIP logo

Comments (3)

1025KB avatar 1025KB commented on August 27, 2024 1

The previous execution results will be cached if nothing changed, if you check the runtime of each component, it should be smaller than the first run, cache run skips the executor part within each component.

from tfx.

krazyhaas avatar krazyhaas commented on August 27, 2024 1

Hi Matt. There are two possibilities.

  1. Do you have 'enable_cache=True' set in the PipelineDecorator? That needs to be enabled in order for the component to reuse artifacts created from prior runs. Also, caching is not supported with BigQuery.
@PipelineDecorator(
    pipeline_name='chicago_taxi_simple',
    enable_cache=True,
    metadata_db_root=_metadata_db_root,
  1. However, I expect you are experiencing a different issue. Airflow will dutifully run the components even if the artifacts have been cached. When you look at the airflow DAG in the UI, you only see the component not the internal decisions made by the component. If you look at the airflow logs, you should see a message that the executor (the module that does the actual work) is skipped.

Example log record from uncached run:

[2019-03-11 15:58:12,494] {logging_mixin.py:95} INFO - [2019-03-11 15:58:12,494] {airflow_adapter.py:136} INFO - No cached execution found. Starting executor.
[2019-03-11 15:58:12,494] {python_operator.py:104} INFO - Done. Returned value was: chicago_taxi_simple.Trainer.exec
[2019-03-11 15:58:12,495] {python_operator.py:132} INFO - Following branch chicago_taxi_simple.Trainer.exec

Example log record from cached run:

[2019-03-11 16:09:28,516] {logging_mixin.py:95} INFO - [2019-03-11 16:09:28,516] {base_driver.py:164} INFO - Found cache from previous run.
[2019-03-11 16:09:28,516] {logging_mixin.py:95} INFO - [2019-03-11 16:09:28,516] {airflow_adapter.py:118} INFO - All artifacts found. Publishing to pipeline and skipping executor.
[2019-03-11 16:09:28,526] {python_operator.py:104} INFO - Done. Returned value was: chicago_taxi_simple.Trainer.publishcache
[2019-03-11 16:09:28,526] {python_operator.py:132} INFO - Following branch chicago_taxi_simple.Trainer.publishcache

The small dataset runs almost as fast in the executors as not running at all, which is why you won't see an observable difference between cached and uncached runs. If you get a bigger dataset, like 100M rows, you should see a difference with caching as the bulk of the time will be spent in the executor with less relative time spent between components i.e. Airflow scheduler polling.

PRs to help clarify this in the docs would definitely be welcomed!

-K

from tfx.

MattMorgis avatar MattMorgis commented on August 27, 2024

Thank you for that great explanation @krazyhaas!

However, I expect you are experiencing a different issue. Airflow will dutifully run the components even if the artifacts have been cached

This is precisely what we were experiencing.

I'm going to close this issue, but we should have a few incoming PRs by the end of the week as we continue to work through it.

from tfx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.