tensorflow / transform Goto Github PK

Input pipeline framework

License: Apache License 2.0

Python 100.00%

transform's Introduction

TensorFlow Transform

TensorFlow Transform is a library for preprocessing data with TensorFlow. tf.Transform is useful for data that requires a full-pass, such as:

Normalize an input value by mean and standard deviation.
Convert strings to integers by generating a vocabulary over all input values.
Convert floats to integers by assigning them to buckets based on the observed data distribution.

TensorFlow has built-in support for manipulations on a single example or a batch of examples. tf.Transform extends these capabilities to support full-passes over the example data.

The output of tf.Transform is exported as a TensorFlow graph to use for training and serving. Using the same graph for both training and serving can prevent skew since the same transformations are applied in both stages.

For an introduction to tf.Transform, see the tf.Transform section of the TFX Dev Summit talk on TFX (link).

Installation

The tensorflow-transform PyPI package is the recommended way to install tf.Transform:

pip install tensorflow-transform

Build TFT from source

To build from source follow the following steps: Create a virtual environment by running the commands

python3 -m venv <virtualenv_name>
source <virtualenv_name>/bin/activate
pip3 install setuptools wheel
git clone https://github.com/tensorflow/transform.git
cd transform
python3 setup.py bdist_wheel

This will build the TFT wheel in the dist directory. To install the wheel from dist directory run the commands

cd dist
pip3 install tensorflow_transform-<version>-py3-none-any.whl

Nightly Packages

TFT also hosts nightly packages at https://pypi-nightly.tensorflow.org on Google Cloud. To install the latest nightly package, please use the following command:

pip install --extra-index-url https://pypi-nightly.tensorflow.org/simple tensorflow-transform

This will install the nightly packages for the major dependencies of TFT such as TensorFlow Metadata (TFMD), TFX Basic Shared Libraries (TFX-BSL).

Notable Dependencies

TensorFlow is required.

Apache Beam is required; it's the way that efficient distributed computation is supported. By default, Apache Beam runs in local mode but can also run in distributed mode using Google Cloud Dataflow and other Apache Beam runners.

Apache Arrow is also required. TFT uses Arrow to represent data internally in order to make use of vectorized numpy functions.

Compatible versions

The following table is the tf.Transform package versions that are compatible with each other. This is determined by our testing framework, but other untested combinations may also work.

tensorflow-transform	apache-beam[gcp]	pyarrow	tensorflow	tensorflow-metadata	tfx-bsl
GitHub master	2.47.0	10.0.0	nightly (2.x)	1.15.0	1.15.1
1.15.0	2.47.0	10.0.0	2.15	1.15.0	1.15.1
1.14.0	2.47.0	10.0.0	2.13	1.14.0	1.14.0
1.13.0	2.41.0	6.0.0	2.12	1.13.1	1.13.0
1.12.0	2.41.0	6.0.0	2.11	1.12.0	1.12.0
1.11.0	2.41.0	6.0.0	1.15.5 / 2.10	1.11.0	1.11.0
1.10.0	2.40.0	6.0.0	1.15.5 / 2.9	1.10.0	1.10.0
1.9.0	2.38.0	5.0.0	1.15.5 / 2.9	1.9.0	1.9.0
1.8.0	2.38.0	5.0.0	1.15.5 / 2.8	1.8.0	1.8.0
1.7.0	2.36.0	5.0.0	1.15.5 / 2.8	1.7.0	1.7.0
1.6.1	2.35.0	5.0.0	1.15.5 / 2.8	1.6.0	1.6.0
1.6.0	2.35.0	5.0.0	1.15.5 / 2.7	1.6.0	1.6.0
1.5.0	2.34.0	5.0.0	1.15.2 / 2.7	1.5.0	1.5.0
1.4.1	2.33.0	4.0.1	1.15.2 / 2.6	1.4.0	1.4.0
1.4.0	2.33.0	4.0.1	1.15.2 / 2.6	1.4.0	1.4.0
1.3.0	2.31.0	2.0.0	1.15.2 / 2.6	1.2.0	1.3.0
1.2.0	2.31.0	2.0.0	1.15.2 / 2.5	1.2.0	1.2.0
1.1.1	2.29.0	2.0.0	1.15.2 / 2.5	1.1.0	1.1.1
1.1.0	2.29.0	2.0.0	1.15.2 / 2.5	1.1.0	1.1.0
1.0.0	2.29.0	2.0.0	1.15 / 2.5	1.0.0	1.0.0
0.30.0	2.28.0	2.0.0	1.15 / 2.4	0.30.0	0.30.0
0.29.0	2.28.0	2.0.0	1.15 / 2.4	0.29.0	0.29.0
0.28.0	2.28.0	2.0.0	1.15 / 2.4	0.28.0	0.28.1
0.27.0	2.27.0	2.0.0	1.15 / 2.4	0.27.0	0.27.0
0.26.0	2.25.0	0.17.0	1.15 / 2.3	0.26.0	0.26.0
0.25.0	2.25.0	0.17.0	1.15 / 2.3	0.25.0	0.25.0
0.24.1	2.24.0	0.17.0	1.15 / 2.3	0.24.0	0.24.1
0.24.0	2.23.0	0.17.0	1.15 / 2.3	0.24.0	0.24.0
0.23.0	2.23.0	0.17.0	1.15 / 2.3	0.23.0	0.23.0
0.22.0	2.20.0	0.16.0	1.15 / 2.2	0.22.0	0.22.0
0.21.2	2.17.0	0.15.0	1.15 / 2.1	0.21.0	0.21.3
0.21.0	2.17.0	0.15.0	1.15 / 2.1	0.21.0	0.21.0
0.15.0	2.16.0	0.14.0	1.15 / 2.0	0.15.0	0.15.0
0.14.0	2.14.0	0.14.0	1.14	0.14.0	n/a
0.13.0	2.11.0	n/a	1.13	0.12.1	n/a
0.12.0	2.10.0	n/a	1.12	0.12.0	n/a
0.11.0	2.8.0	n/a	1.11	0.9.0	n/a
0.9.0	2.6.0	n/a	1.9	0.9.0	n/a
0.8.0	2.5.0	n/a	1.8	n/a	n/a
0.6.0	2.4.0	n/a	1.6	n/a	n/a
0.5.0	2.3.0	n/a	1.5	n/a	n/a
0.4.0	2.2.0	n/a	1.4	n/a	n/a
0.3.1	2.1.1	n/a	1.3	n/a	n/a
0.3.0	2.1.1	n/a	1.3	n/a	n/a
0.1.10	2.0.0	n/a	1.0	n/a	n/a

Questions

Please direct any questions about working with tf.Transform to Stack Overflow using the tensorflow-transform tag.

transform's People

Contributors

Stargazers

Watchers

Forkers

pyjava1984 chingu163 flyingfish42 cyzn alvinjamur agistrueai anpark samjabrahams architectureofthings mohan-chinnappan-n hdasappinc alexxnica kryndex solertis terrytangyuan sean0liu bwry grseb9s mariobriggs redeipirati shobhit-agarwal yangkf1985 tvandevyvere alexwelcing lukashes cclauss s91-maker alehl algoskynet tspannhw fprost mohisen lnrsoft puneith rodrigogonzalez holdenk robertwb gweidner ourobouros bdgowda1 davidcavazos xinzha623 miturchi debasish-das-ck pedroregueiro debasish83 schoaib baxen cuptea jbingham ciandt-d1 kjeanclaude qipa ginking ravwojdyla amygdala lgeiger ru003ar cdmr chrisantaki wei-he tarrysingh sharmanatasha rileytg dfdazac yjwcode siemens1313 joytianya solversa xiching aabayarea eachsaj mlazarew parkwisdom littolee gavinljj dnuang wangkuiyi paulgc feitianyiren andhau maplewzx 0101011 simon-moloco kazk1018 sjoerdapp yuhonghong7035 brianmartin rileym mbrukman kestertong marcromeyn kurushi pedrolelis sswapnil2 joar phillyschoolofai dhanaji yueyedeai redpoint13

transform's Issues

Using `scale_to_0_1` on list of numbers

Given a dataset with a numerical column containing a list of numbers, is it possible to normalize each element of the list in the [0,1] range?

I successfully used the 'scale_to_0_1' function to normalize numerical columns containing a single value, but cannot apply it to a list of numbers.

On the following toy dataset

student id, previous grades
1, [0, 5, 6]
2, [7, 8, 10]

I would like to obtain the following transformed dataset

student id, previous grades
1, [0, 0.5, 0.6]
2, [0.7, 0.8, 1]

"Table not initialized" error

Consider the simple example:

graph = tf.Graph()
with graph.as_default():
    comma_separated = tf.constant(["hello, hello, test, random"], dtype=tf.string)
    words = tf.string_split(comma_separated, delimiter = ", ")
    indices = tft.string_to_int(words, top_k = 1)
    
sess = tf.InteractiveSession(graph=graph)
print(sess.run([indices]))

Results in the error:

FailedPreconditionError (see above for traceback): Table not initialized.
	 [[Node: hash_table_Lookup = LookupTableFind[Tin=DT_STRING, Tout=DT_INT64, _class=["loc:@string_to_index/hash_table"], _device="/job:localhost/replica:0/task:0/cpu:0"](string_to_index/hash_table, StringSplit:1, string_to_index/hash_table/Const)]]

I just ran pip install tensorflow-transform this morning so I believe tf is up to date. I'm guessing that maybe I am not up to date with the most recent tensorflow version that is needed? I am using tensorflow 1.2 because Google Cloud ML Engine does not yet support tensorflow 1.3. Is this the reason for the issue? I am trying to use TFT as a preprocessing script for my GCMLE experiment.

I separately got a warning that may also be related to the issue:

"WARNING:tensorflow:From /Users/user/anaconda/lib/python2.7/site-packages/tensorflow_transform/mappers.py:305: string_to_index_table_from_tensor (from tensorflow.contrib.lookup.lookup_ops) is deprecated and will be removed after 2017-04-10.
Instructions for updating:
Use index_table_from_tensor"

Any ideas what may be causing the issue?

NaN values lead to unexpected results in scale_to_z_score

The current implementations of mean and variance in analyzers.py don't seem to be able to handle NaN values. This results in the scale_to_z_score function giving unexpected results if the data contains any NaNs. For example:

import tempfile
import tensorflow as tf
import numpy as np
import tensorflow_transform as tft
import tensorflow_transform.beam.impl as beam_impl
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema

size = 10000
values = np.random.normal(10.0, 10.0, size=size) # original has a mean of 10 and std of 10
values[np.random.rand(size) < 0.2] = np.nan  # but 20% are NaN

def preprocessing_fn(inputs):
    """Preprocess input columns into transformed columns."""
    x = inputs['x']
    x_zscore = tft.scale_to_z_score(x)
    return {
        'x_zscore': x_zscore,
     }

raw_data = [{'x': v} for v in values]
raw_data_metadata = dataset_metadata.DatasetMetadata(dataset_schema.Schema({
    'x': dataset_schema.ColumnSchema(
        tf.float32, [], dataset_schema.FixedColumnRepresentation())
}))

with beam_impl.Context(temp_dir=tempfile.mkdtemp()):
    transformed_dataset, transform_fn = (
        (raw_data, raw_data_metadata) | beam_impl.AnalyzeAndTransformDataset(preprocessing_fn)
    )
    transformed_data, transformed_metadata = transformed_dataset  # pylint: disable=unused-variable

transformed = np.array([d['x_zscore'] for d in transformed_data])


print(np.nanmean(transformed))  # not close to zero
print(np.nanvar(transformed))  # not close to one

You end up with the transformed data having a mean and variance much greater than 0 and 1 respectively. I am pretty sure this is happening because tf.reduce_sum is giving NaN for all batches that contain at least one NaN value here: https://github.com/tensorflow/transform/blob/master/tensorflow_transform/tf_utils.py#L99

I just closed a PR [#68] that fixed this in an older implementation. I'd be happy to make a PR to fix this in the current as well, I think it would be reasonable to take the mean/var of all non-NaN values and then to set any NaNs to zero as the default behavior.

Support for (weighted) N-hot encoder

I am working on a use case where we have a categorical feature associated with a weight. We would like to transform it to a "weighted N-Hot encoder". The use case is the following:
In the field of music taste, we have a feature called "genre affinities".
For one example, it could look like this:

{
"rock": 0.75,
"pop": 0.12,
"rap": 0.88
}

There could be a variable number of entries (possibly 0).

The "vocabulary" of possible keys/genres is not explicitly known beforehand.
We would like a set of analyzer and transformer to compute the vocabulary of possible keys and then build a SparseVector that would represent this N-hot encoded feature with the weights instead of ones. Expected output would look like this:

With the previous example and another presented this way:

{
"rock": 0.13,
"latin": 0.96,
"k-pop": 0.76,
"country": 0.08
}

then an analyzed and transformed batch of these would give:

tf.SparseVector(
    indices = [[0, 0], [0, 4], [0, 5], [1, 0], [1, 1], [1, 2], [1, 3]],
    values = [0.75, 0.12, 0.88, 0.13, 0.96, 0.76, 0.08], 
    dense_shape = [2, 6])

We're not sure yet how to build a tf.Example to present the input yet, since this transformation does not exist at all yet.
Any ideas of how that could be done?

A basic N-Hot encoder would be great to have as well (same thing but with only ones the weights in the values).

How can we export the statistics about the dataset?

One typical use case about transformation is that, we compute the all the statistics summary about the dataset (e.g. the quantile value for the continuous features), then at training time, we can bucketize our continuous features using those quantile values.

But seems like the statistics object is not implemented.

How can we use the statistics?

Error when dataset_schema.FixedColumnRepresentation has default_value == 0

It's caused by this piece of code:

if spec.default_value is not None:
      raise ValueError(
          'feature "{}" had default_value {}, but FixedLenFeature must have '
          'default_value=None'.format(name, spec.default_value))

Link to source

What's the rationale behind enforcing default_value=None? Usually a default value corresponding to the data type should be OK. It seems like there is a bug in the condition and it should be without not:
if spec.default_value is None:

support for regular expressions

Hi all,
I would like to use regular expressions as part of preprocessing text data - so that it is usable by serving input function. I did quite a bit of searching, but could not find anything. Is this something currently not supported? If it is, I would appreciate your help.

installing on Datalab is not possible due to pytz incompatibility

I have created a datalab in GCP and installed tensorflow-gpu.
When I have tried to install tensorflow-transform I received the following error:

Installing collected packages: pytz, monotonic, fasteners, google-apitools, proto-google-cloud-pubsub-v1, google-cloud-core, google-cloud-bigquery, grpc-google-iam-v1, ply, google-gax, gapic-google-cloud-pubsub-v1, google-cloud-pubsub, apache-beam, tensorflow-transform
Found existing installation: pytz 2016.7
Cannot uninstall 'pytz'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

Location of data for examples?

The census and sentiment examples require data files that aren't located in this git repo, and there's no documentation (at least not that I saw) about where to find them.

Should there be a test data folder? Or if the data is too big, a cloud bucket perhaps?

tf transform's "exported model" use in tf serving issue

I run census_example.py and I check exported model file.
https://github.com/tensorflow/transform/blob/master/examples/census_example.py

And, I can check ServingInputReceiver below
print("serving_input_receiver:", serving_input_receiver)
==>
ServingInputReceiver(features={
'capital-loss': <tf.Tensor 'ParseExample/ParseExample:2' shape=(?,) dtype=float32>,
'relationship': <tf.Tensor 'ParseExample/ParseExample:10' shape=(?,) dtype=string>,
'age': <tf.Tensor 'ParseExample/ParseExample:0' shape=(?,) dtype=float32>,
...
receiver_tensors={'examples': <tf.Tensor 'input_example_tensor:0' shape=(?,) dtype=string>}, receiver_tensors_alternatives=None)

And I run tensorflow serving using the model in exported model dir.
And I send curl request but, I got error

curl -d '{"instances": [{"age":50, "workclass":"Self-emp-not-inc", "education":"Bachelors", "education-num":13, "marital-status":"Married-civ-spouse", "occupation":"Exec-managerial", "relationship":"Husband", "race":"White", "sex":"Male", "capital-gain":0, "capital-loss":0, "hours-per-week":13, "native-country":"United-States"}]}'
-X POST http://localhost:8501/v1/models/census:predict

{ "error": "Failed to process element: 0 key: age of 'instances' list. Error: Invalid argument: JSON object: does not have named input: age" }

How do I call ?

chain SavedModel in tf.Transform using Beam

Can we use pretrained_models.apply_saved_model() funcion to chain transform_fn?
How to combine the SavedModel into transformation graph?

Basically I want to chain models: have one SavedModel (1st model); based on the output of 1st model, I want to train 2nd model. My current idea is to combine 1st model's (SavedModel) into the transformation of 2nd model's preprocess pipeline. The target model (1st model + 2nd model) will be exported for final tf serving.

I generated the combined transform_fn , and even the final combined SavedModel. But testing the final model serving function will throw error: ValueError: Attempted to map inputs that were not found in graph_def: [input_example_tensor:0]

Also the final model graph (visualized in tensorboard) is really overly complicated ( not what I expected).

the code I used to combine the SavedModel into 2nd' model's preprocess pipeline (using beam):

        input_function = lambda inputs: self.transform_fn(inputs, self.params) # which call pretrained_models.apply_saved_model
        self.tensorflow_transform = (dataset.pcollection, dataset.dataset_schema.convert_dict_to_tft_schema()) | \
                                    self.transform_fn.func_name + "GetTransformationFunction" >> \
                                    impl.AnalyzeDataset(input_function)
        _ = self.tensorflow_transform | self.transform_fn.func_name + "WriteTransformFn" >> \
            transform_fn_io.WriteTransformFn(path=transform_export_path)

pip install does not work on Python 3

I can confirm that pip install tensorflow-transform goes through fine on Python 2.7.13. The following error message shows up for Python 3.5.3 and Python 3.6.0, however:

Could not find a version that satisfies the requirement tensorflow-transform (from versions: ) No matching distribution found for tensorflow-transform

I see that the package on PyPI is for Python 2 only. Is this intentional? If so, will there be an upcoming Python 3 release in the near future?

Transform without having to write to disk

Is it possible to transform features on the fly while feeding the training input pipeline? In a similar way as the 'normalizer_fn' works in tf.feature_column.numeric_column. In all the examples I've seen everything is transformed, written to disk, and then read back. I'd like to be able to use (or wrap) the PCollection as a tf.data.DataSet.

Python 3 syntax error in tensorflow_transform/beam/impl_test.py

See #1 (comment)

Migrate examples and documentation to TF core

Is there some reason these examples are still using tf.contrib.learn?
e.g. https://github.com/tensorflow/transform/blob/master/examples/census_example.py#L31

(seems like that's not going to help demystify tft for users..)

Update: similarly re: mentions of "tf.Learn" in the docs, e.g. here: https://github.com/tensorflow/transform/blob/master/getting_started.md
I think this will just confuse people.

'tft.coders' usage in examples doesn't work with packaged release?

In trying to run the examples, I get an error accessing tft.coders, which I see was added to the samples 15 days ago. I'm using the 'pip install' latest version of TFT, which is 0.6.
What version is necessary to run the examples now? (Am I missing something?)

(my dev rel opinion: user-facing examples should sync with the publicly released packages so that the examples always work).

Cannot process dataset with 800+ features: Job graph is too large

Hi,

I'm trying to submit a job to process a dataset (~850 features) in Cloud Dataflow.
The preprocessing_fn looks like this:

def preprocessing_fn(inputs):
    """Preprocess input columns into transformed columns."""
    outputs = {}

    for key in _discrete_features(): # 395
        x = inputs[key]
        tft.uniques(tf.as_string(x), vocab_filename=key, store_frequency=True)
        outputs[key] = tft.scale_to_z_score(x)

    for key in _continuous_features(): # 216
        x = inputs[key]
        nanmean = t.nanmean(x)
        x = tf.where(tf.is_nan(x), tf.fill(tf.shape(x), nanmean), x)
        outputs[key] = tft.scale_to_z_score(x)

    for key in _float_features(): # 59
        outputs[key] = tft.scale_to_z_score(inputs[key])

    for key in _string_features(): # 191
        outputs[key] = tft.string_to_int(inputs[key], vocab_filename=key)

    outputs[LABEL_KEY] = inputs[LABEL_KEY]

    return outputs

After a few minutes the job submission fails claiming that "The job graph is too large."
Has anyone seen this before ? How can I workaround it?

Detailed logs below:

INFO:root:Starting the size estimation of the input
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:root:Finished the size estimation of the input at 1 files. Estimation took 0.172410011292 seconds
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/11f97ee8c0fd4197bca9d2c0361ebf49/saved_model.pb
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/11f97ee8c0fd4197bca9d2c0361ebf49/saved_model.pb
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/66e87beb6715476288308d47028d92e6/saved_model.pb
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/66e87beb6715476288308d47028d92e6/saved_model.pb
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/5493bd9dd95845aba7924d331a416bf3/saved_model.pb
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/5493bd9dd95845aba7924d331a416bf3/saved_model.pb
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/48aa71f6a4664a6da46ded569656bb7c/saved_model.pb
INFO:tensorflow:SavedModel written to: gs://bucket/datasets/v2.0-rc4/tmp/tftransform_tmp/48aa71f6a4664a6da46ded569656bb7c/saved_model.pb
INFO:root:Starting the size estimation of the input
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:root:Finished the size estimation of the input at 10 files. Estimation took 0.0985808372498 seconds
INFO:root:Starting GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/pipeline.pb...
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:root:Completed GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/pipeline.pb
INFO:root:Executing command: ['/home/user/npd/venv/bin/python', 'setup.py', 'sdist', '--dist-dir', '/tmp/tmpt5flm5']

(...)

INFO:root:Starting GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/workflow.tar.gz...
INFO:root:Completed GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/workflow.tar.gz
INFO:root:Starting GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/pickled_main_session...
INFO:root:Completed GCS upload to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/pickled_main_session
INFO:root:Staging the SDK tarball from PyPI to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/dataflow_python_sdk.tar
INFO:root:Executing command: ['/home/user/npd/venv/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpt5flm5', 'google-cloud-dataflow==2.4.0', '--no-binary', ':all:', '--no-deps']
Collecting google-cloud-dataflow==2.4.0
  Using cached https://files.pythonhosted.org/packages/3b/6b/165eb940a26b16ee27cee2643938e23955c54f6042e7e241b2d6afea8cea/google-cloud-dataflow-2.4.0.tar.gz
  Saved /tmp/tmpt5flm5/google-cloud-dataflow-2.4.0.tar.gz
Successfully downloaded google-cloud-dataflow
INFO:root:file copy from /tmp/tmpt5flm5/google-cloud-dataflow-2.4.0.tar.gz to gs://bucket/df/data/tft-top-100-20180514202806.1526329784.381043/dataflow_python_sdk.tar.
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
Traceback (most recent call last):
  File "preprocess/transform_v2.py", line 62, in <module>
    main()
  File "preprocess/transform_v2.py", line 57, in main
    transform_data(pipeline_options, known_args.input_dir, known_args.output_dir, top_features)
  File "preprocess/transform_v2.py", line 31, in transform_data
    | CreateSegmentDataset(segment_name, converter, output_dir, make_preprocessing_fn(top=top_features), RAW_DATA_METADATA))
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 389, in __exit__
    self.run().wait_until_finish()
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 369, in run
    self.to_runner_api(), self.runner, self._options).run(False)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 382, in run
    return self.runner.run_pipeline(self)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 324, in run_pipeline
    self.dataflow_client.create_job(self.job), self)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 180, in wrapper
    return fun(*args, **kwargs)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 475, in create_job
    return self.submit_job_description(job)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 180, in wrapper
    return fun(*args, **kwargs)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 523, in submit_job_description
    response = self._client.projects_locations_jobs.Create(request)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/clients/dataflow/dataflow_v1b3_client.py", line 643, in Create
    config, request, global_params=global_params)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 722, in _RunMethod
    return self.ProcessHttpResponse(method_config, http_response, request)
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 728, in ProcessHttpResponse
    self.__ProcessHttpResponse(method_config, http_response, request))
  File "/home/user/npd/venv/local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 599, in __ProcessHttpResponse
    http_response, method_config=method_config, request=request)
apitools.base.py.exceptions.HttpBadRequestError: HttpError accessing <https://dataflow.googleapis.com/v1b3/projects/my-project/locations/us-central1/jobs?alt=json>: response: <{'status': '400', 'content-length': '229', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Mon, 14 May 2018 20:30:37 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8'}>, content <{
  "error": {
    "code": 400,
    "message": "(73447f6f3d9a02de): The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.",
    "status": "INVALID_ARGUMENT"
  }
}

Regarding writing the files to TFRecord there is an error saying no attribute schema when the transformed schema is provided

Even when the transformed schema is provided, there seems to be an error while we write the transformed data to tensorflow record -

The error that is thrown looks something similar -

'BeamDatasetMetadata' object has no attribute 'schema' [while running 'AnalyzeAndTransformDataset/TransformDataset/ConvertAndUnbatch']

Please find the snippet of the code below -

raw_data = (
          pipeline
          | 'ReadTrainData' >> textio.ReadFromText(train_data_file)
          | 'FilterTrainData' >> beam.Filter(
              lambda line: line and line != 'app_category,connection_type,creative_id,day_of_week,device_size,geo,hour_of_day,num_of_connects,num_of_conversions,opt_bid,os_version')
          | 'FixCommasTrainData' >> beam.Map(
              lambda line: line.replace(', ', ','))
          | 'DecodeTrainData' >> MapAndFilterErrors(converter.decode))

      # Combine data and schema into a dataset tuple.  Note that we already used
      # the schema to read the CSV data, but we also need it to interpret
      # raw_data.

      raw_dataset = (raw_data, RAW_DATA_METADATA)
      transformed_dataset, transform_fn = (
          raw_dataset | beam_impl.AnalyzeAndTransformDataset(preprocessing_fn))

      transformed_data, transformed_metadata = transformed_dataset

      transformed_data_coder = example_proto_coder.ExampleProtoCoder(transformed_metadata.schema)

      _ = (
          transformed_data
          | 'EncodeTrainData' >> beam.Map(transformed_data_coder.encode)
          | 'WriteTrainData' >> tfrecordio.WriteToTFRecord(
              os.path.join(working_dir, TRANSFORMED_TRAIN_DATA_FILEBASE)))

      # Now apply transform function to test data.  In this case we also remove
      # the header line from the CSV file and the trailing period at the end of
      # each line.
      raw_test_data = (
         pipeline
          | 'ReadTestData' >> textio.ReadFromText(test_data_file, skip_header_lines=1)
          | 'FixCommasTestData' >> beam.Map(
              lambda line: line.replace(', ', ','))
          | 'DecodeTestData' >> beam.Map(converter.decode))

      raw_test_dataset = (raw_test_data, RAW_DATA_METADATA)

      transformed_test_dataset = ((raw_test_dataset, transform_fn) | beam_impl.TransformDataset())
      # Don't need transformed data schema, it's the same as before.
      transformed_test_data, _ = transformed_test_dataset

      _ = (
          transformed_test_data
          | 'EncodeTestData' >> beam.Map(transformed_data_coder.encode)
          | 'WriteTestData' >> tfrecordio.WriteToTFRecord(
             os.path.join(working_dir, TRANSFORMED_TEST_DATA_FILEBASE)))

      _ = (
          transform_fn
          | 'WriteTransformFn' >>
          transform_fn_io.WriteTransformFn(working_dir))

@KesterTong I hope you can help me out in this case as it looks like it is caused unexpectedly only sometimes while applying transformation to data

PS:
Tensorflow version - 1.4
Tensorflow transform version - 0.4.0

string_to_int() got an unexpected keyword argument 'vocab_filename'

I am trying to use tft.string_to_int() with a vocab_filename, but when I ran this I got an error

TypeError: string_to_int() got an unexpected keyword argument 'vocab_filename'. However, the mappers.py function shown here clearly has vocab_filename listed as an argument. Is this vocab_filename no longer supported?

My vocab filename is a .txt file with each vocab word on a newline and I was hoping that this would be usable to create the string_to_int function on the fly. Is there some step that I am missing?

AttributeError: 'thread._local' object has no attribute 'state'

I'm having a thread issue with the following code. The traceback and pip freeze are below as well.

I'm on a Mac, running beam locally. Already tested for tensorflow==1.9.0 and had the same error. Any ideas?

Code

import tensorflow_transform as tft
from tensorflow_transform.beam import impl as tft_beam
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema
import apache_beam as beam

raw_data_metadata = dataset_metadata.DatasetMetadata(
    dataset_schema.from_feature_spec({
        'sku': tf.FixedLenFeature([], tf.string),
}))

categorical_columns = [
    'sku',
]

def preprocessing_tft_fn(inputs):
    for key in categorical_columns:
        tft.vocabulary(inputs[key], vocab_filename=key)

    return inputs

with beam.Pipeline() as p:
    raw_data = (p
        | beam.Create([{'sku': 'a'}, {'sku': 'b'}])
    )

    (transformed_data, transformed_metadata), transform_fn = (
        (raw_data, raw_data_metadata)
        | 'AnalyzeAndTransformTrain' >> tft_beam.AnalyzeAndTransformDataset(preprocessing_tft_fn)
    )

Traceback

Traceback (most recent call last):
  File "test.py", line 30, in <module>
    | 'AnalyzeAndTransformTrain' >> tft_beam.AnalyzeAndTransformDataset(preprocessing_tft_fn)
  File "/Users/jonathangarcialima/.virtualenvs/tft_testing_env/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 831, in __ror__
    return self.transform.__ror__(pvalueish, self.label)
  File "/Users/jonathangarcialima/.virtualenvs/tft_testing_env/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 488, in __ror__
    result = p.apply(self, pvalueish, label)
  File "/Users/jonathangarcialima/.virtualenvs/tft_testing_env/lib/python2.7/site-packages/apache_beam/pipeline.py", line 468, in apply
    return self.apply(transform, pvalueish)
  File "/Users/jonathangarcialima/.virtualenvs/tft_testing_env/lib/python2.7/site-packages/apache_beam/pipeline.py", line 504, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish)
  File "/Users/jonathangarcialima/.virtualenvs/tft_testing_env/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 193, in apply
    return m(transform, input)
  File "/Users/jonathangarcialima/.virtualenvs/tft_testing_env/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 199, in apply_PTransform
    return transform.expand(input)
  File "/Users/jonathangarcialima/.virtualenvs/tft_testing_env/lib/python2.7/site-packages/tensorflow_transform/beam/impl.py", line 862, in expand
    dataset | 'AnalyzeDataset' >> AnalyzeDataset(self._preprocessing_fn))
  File "/Users/jonathangarcialima/.virtualenvs/tft_testing_env/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 831, in __ror__
    return self.transform.__ror__(pvalueish, self.label)
  File "/Users/jonathangarcialima/.virtualenvs/tft_testing_env/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 488, in __ror__
    result = p.apply(self, pvalueish, label)
  File "/Users/jonathangarcialima/.virtualenvs/tft_testing_env/lib/python2.7/site-packages/apache_beam/pipeline.py", line 468, in apply
    return self.apply(transform, pvalueish)
  File "/Users/jonathangarcialima/.virtualenvs/tft_testing_env/lib/python2.7/site-packages/apache_beam/pipeline.py", line 504, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish)
  File "/Users/jonathangarcialima/.virtualenvs/tft_testing_env/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 193, in apply
    return m(transform, input)
  File "/Users/jonathangarcialima/.virtualenvs/tft_testing_env/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 199, in apply_PTransform
    return transform.expand(input)
  File "/Users/jonathangarcialima/.virtualenvs/tft_testing_env/lib/python2.7/site-packages/tensorflow_transform/beam/impl.py", line 717, in expand
    base_temp_dir = Context.create_base_temp_dir()
  File "/Users/jonathangarcialima/.virtualenvs/tft_testing_env/lib/python2.7/site-packages/tensorflow_transform/beam/impl.py", line 207, in create_base_temp_dir
    state = cls._get_topmost_state_frame()
  File "/Users/jonathangarcialima/.virtualenvs/tft_testing_env/lib/python2.7/site-packages/tensorflow_transform/beam/impl.py", line 200, in _get_topmost_state_frame
    if cls._thread_local.state.frames:
AttributeError: 'thread._local' object has no attribute 'state'

Pip Freeze

absl-py==0.5.0
apache-beam==2.6.0
astor==0.7.1
avro==1.8.2
backports.weakref==1.0.post1
boto==2.49.0
cachetools==2.1.0
certifi==2018.8.24
chardet==3.0.4
crcmod==1.7
dill==0.2.8.2
docopt==0.6.2
enum34==1.1.6
fasteners==0.14.1
funcsigs==1.0.2
future==0.16.0
futures==3.2.0
gapic-google-cloud-pubsub-v1==0.15.4
gast==0.2.0
google-api-core==1.5.0
google-api-python-client==1.7.4
google-apitools==0.5.20
google-auth==1.5.1
google-auth-httplib2==0.0.3
google-cloud-bigquery==0.25.0
google-cloud-core==0.25.0
google-cloud-dataflow==2.5.0
google-cloud-pubsub==0.26.0
google-cloud-storage==1.13.0
google-compute-engine==2.8.3
google-gax==0.15.16
google-resumable-media==0.3.1
googleapis-common-protos==1.5.3
googledatastore==7.0.1
grpc-google-iam-v1==0.11.4
grpcio==1.15.0
h5py==2.8.0
hdfs==2.1.0
httplib2==0.11.3
idna==2.7
Keras-Applications==1.0.6
Keras-Preprocessing==1.0.5
Markdown==2.6.11
mock==2.0.0
monotonic==1.5
numpy==1.14.5
oauth2client==4.1.3
pandas==0.23.4
pbr==4.2.0
ply==3.8
proto-google-cloud-datastore-v1==0.90.4
proto-google-cloud-pubsub-v1==0.15.4
protobuf==3.6.1
psycopg2==2.7.5
pyasn1==0.4.4
pyasn1-modules==0.2.2
pydot==1.2.4
pyparsing==2.2.1
python-dateutil==2.7.3
pytz==2018.4
PyVCF==0.6.8
PyYAML==3.13
requests==2.19.1
retrying==1.3.3
rsa==4.0
sh==1.12.14
six==1.11.0
SQLAlchemy==1.2.12
tensorboard==1.10.0
tensorflow==1.10.0
tensorflow-metadata==0.9.0
tensorflow-transform==0.9.0
termcolor==1.1.0
typing==3.6.6
uritemplate==3.0.0
urllib3==1.23
Werkzeug==0.14.1

string_to_index_table_from_tensor to be deprecated warning

WARNING:tensorflow:From /Users/user/anaconda/lib/python2.7/site-packages/tensorflow_transform/mappers.py:305: string_to_index_table_from_tensor (from tensorflow.contrib.lookup.lookup_ops) is deprecated and will be removed after 2017-04-10.
Instructions for updating:
Use index_table_from_tensor

I am getting this warning after a pip install, but I don't see the deprecated function being used in the source code. Is the pip install out of date?

Imputing missing values

I'd like to use Tensorflow Transform to impute missing values in a training dataset...it seems like this should be possible, correct? I believe I should save off some analyzers to do this imputing during training, but I'm not certain how to pull out those analyzers and work with them during training. Can someone point me in the right direction?

AttributeError: 'module' object has no attribute 'Context'

When I run sentiment_example.py script, AttributeError: 'module' object has no attribute 'Context' occured. Anybody else had encountered this problem???

Fail to pip install tensorflow-transform==0.6.0 with docker base image 'python:2.7-slim'

Previously, everything works fine. This issue appeared very recently.

DockerFile is the following:

FROM python:2.7-slim
RUN pip install tensorflow-transform==0.6.0
...

The procedure stops with the following error:

  Running setup.py bdist_wheel for fastavro: started
  Running setup.py bdist_wheel for fastavro: finished with status 'error'
  Complete output from command /usr/local/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-xzjqeH/fastavro/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-9O9XdV --python-tag cp27:
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-2.7
  creating build/lib.linux-x86_64-2.7/fastavro
  copying fastavro/validation.py -> build/lib.linux-x86_64-2.7/fastavro
  copying fastavro/_schema_common.py -> build/lib.linux-x86_64-2.7/fastavro
  copying fastavro/_read_common.py -> build/lib.linux-x86_64-2.7/fastavro
  copying fastavro/_validation_py.py -> build/lib.linux-x86_64-2.7/fastavro
  copying fastavro/const.py -> build/lib.linux-x86_64-2.7/fastavro
  copying fastavro/read.py -> build/lib.linux-x86_64-2.7/fastavro
  copying fastavro/_read_py.py -> build/lib.linux-x86_64-2.7/fastavro
  copying fastavro/schema.py -> build/lib.linux-x86_64-2.7/fastavro
  copying fastavro/_timezone.py -> build/lib.linux-x86_64-2.7/fastavro
  copying fastavro/write.py -> build/lib.linux-x86_64-2.7/fastavro
  copying fastavro/_schema_py.py -> build/lib.linux-x86_64-2.7/fastavro
  copying fastavro/_write_py.py -> build/lib.linux-x86_64-2.7/fastavro
  copying fastavro/__init__.py -> build/lib.linux-x86_64-2.7/fastavro
  copying fastavro/_validate_common.py -> build/lib.linux-x86_64-2.7/fastavro
  copying fastavro/six.py -> build/lib.linux-x86_64-2.7/fastavro
  copying fastavro/__main__.py -> build/lib.linux-x86_64-2.7/fastavro
  running build_ext
  building 'fastavro._read' extension
  creating build/temp.linux-x86_64-2.7
  creating build/temp.linux-x86_64-2.7/fastavro
  gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/usr/local/include/python2.7 -c fastavro/_read.c -o build/temp.linux-x86_64-2.7/fastavro/_read.o
  unable to execute 'gcc': No such file or directory
  error: command 'gcc' failed with exit status 1

  ----------------------------------------
  **Failed building wheel for fastavro**
  Running setup.py clean for fastavro

Currently, I work around this problem by using the ubuntu base docker image with following installation:

RUN apt-get update
RUN apt-get -y install build-essential python-pip python2.7

More docs for apply_function needed?

I am implementing a conversion from from characters to a list of character indicies (int). To perform the conversion, I wrote a function convert_character to split the input string and perform a table lookup for every character.

The conversion function runs fine when I convert a tf.constant('test string') in a TF graph. The conversion will return something like [45, 4, 18, 19, 70, 18, 19, 17, 8, 13, 6].

If I apply the same apply the conversion method in my TF Transform example, every character gets converted individually and the output looks like this:

[{u'indicies': 45},
 {u'indicies': 4},
 {u'indicies': 18},
 {u'indicies': 19},
 {u'indicies': 70},
 {u'indicies': 18},
 {u'indicies': 19},
 {u'indicies': 17},
 {u'indicies': 8},
 {u'indicies': 13},
 {u'indicies': 6}]

I would have expected the output to look like

[{u'indicies': [45, 4, 18, 19, 70, 18, 19, 17, 8, 13, 6]},
 {u'indicies': [.....]},
]

Is the problem in my case the usage of apply_function? I noticed other users struggled with the correct use of the mapper function too (#58).

I also tried to replaced the apply_function with tf.map_fn, but the conversion was the same. This makes sense to me since map_fn is applied to every character in the string.

If anyone can point me into the right direction use apply_function correctly, I am happy to extend the docs to prevent similar mistakes in the future.

Here is my example character conversion code:

import pprint
import tempfile
import numpy as np

import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.beam.impl as beam_impl
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema

def convert_character(input_string):
    input_characters = tf.string_split(input_string, delimiter="") 
    characters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"\'?!.,:; '
    mapping_characters = tf.string_split([characters], delimiter="")
    table = tf.contrib.lookup.index_table_from_tensor(
        mapping=mapping_characters.values, default_value=0) 
    return table.lookup(input_characters.values)

def preprocessing_fn(inputs):
    """Preprocess input columns into transformed columns."""    
    return {'indicies': tft.apply_function(convert_character, inputs['sentence'])}


def main():
    
    raw_data = [
      {'sentence': 'Test string'},
      {'sentence': 'String Test'},
    ]

    raw_data_metadata = dataset_metadata.DatasetMetadata(dataset_schema.Schema({
      'sentence': dataset_schema.ColumnSchema(
          tf.string, [], dataset_schema.FixedColumnRepresentation())
    }))

    with beam_impl.Context(temp_dir=tempfile.mkdtemp()):
        transformed_dataset, transform_fn = (  # pylint: disable=unused-variable
            (raw_data, raw_data_metadata) | beam_impl.AnalyzeAndTransformDataset(
                preprocessing_fn))

        transformed_data, transformed_metadata = transformed_dataset  # pylint: disable=unused-variable

        pprint.pprint(transformed_data)

main()

support for reduce_instance_dims in quantiles

just as min, max etc in analyzer.py.

Request: nightly/dev release

Could we please have a nightly/dev release. FYI using git reference in pip requirements seem to fail on Dataflow runner. But overall it seems like a good idea to make it easier to try new features. What do you think?

train_transformed* does not exist after 'WriteTrainData'

I am trying TF transform to save model and restore it.

I wrote function (copied from examples):

    _ = (
        transformed_train_data
        | 'WriteTrainData' >> tfrecordio.WriteToTFRecord(
            transformed_train_filebase,
            coder=example_proto_coder.ExampleProtoCoder(
                transformed_metadata.schema)))

    ...
    _ = (transformed_metadata
         | 'WriteTransformedMetadata' >> beam_metadata_io.WriteMetadata(transformed_metadata_dir, pipeline=pipeline))

After start script I see error: ValueError: No files match /tmp/aclImdb/tmpm6E1MH/train_transformed*.

File really does not exist, but raw and metadata exists:

aclImdb/tmpm6E1MH$ ls -a
. .. metadata raw

What is problem? How can I debug WriteTrainData function?

apply tft.TFTransformOutput.transformed_feature_spec to tf.data

dataset = tf.contrib.data.make_batched_features_dataset(
file_pattern=transformed_examples,
batch_size=batch_size,
features=tf_transform_output.transformed_feature_spec(),
reader=tf.data.TFRecordDataset,
shuffle=True)
tft examples uses a contrib API which is not official. Can I use the following ?

dataset = tf.data.TFRecordDataset(filenames_list)
dataset = dataset.map(_parse_proto)
dataset.map(tf_transform_output.transformed_feature_spec())

tf.Transform and Google DataFlow Templates Integration

We are in the process of establishing a Machine Learning pipeline on Google Cloud, leveraging GC ML-Engine for distributed TensorFlow training and model serving, and DataFlow for distributed pre-processing jobs.

We would like to run our Apache Beam apps as DataFlow jobs on Google Cloud. looking at the ML-Engine samples
it appears possible to get tensorflow_transform.beam.impl AnalyzeAndTransformDataset to specify which PipelineRunner to use as follows:

from tensorflow_transform.beam import impl as tft
pipeline_name = "DirectRunner"
p = beam.Pipeline(pipeline_name) 
p | "xxx" >> xxx | "yyy" >> yyy | tft.AnalyzeAndTransformDataset(...)

TemplatingDataflowPipelineRunner provides the ability to separate our preprocessing development from parameterized operations - see here: https://cloud.google.com/dataflow/docs/templates/overview

we could leverage this to dynamically generate a tf.Transform CsvCoder

The steps of using DataFlow Templates are as follows:

A) in PipelineOptions derived types, change option types to ValueProvider
B) change runner to TemplatingDataflowPipelineRunner
C) mvn archetype:generate to store template in GCS (python way: a yaml file like TF Hypertune ???)
D) gcloud beta dataflow jobs run --gcs-location —parameters

For (A), we could define UserOptions subclassed from PipelineOptions and use the add_value_provider_argument API to add specific arguments to be parameterized:

class UserOptions(PipelineOptions):
     @classmethod
     def _add_argparse_args(cls, parser):
         parser.add_value_provider_argument('--value_provider_arg', default='some_value')
         parser.add_argument('--non_value_provider_arg', default='some_other_value')

The question is: Can you show me how we can we use tf.Transform to leverage TemplatingDataflowPipelineRunner (B & C) ?

Looking at the java TemplatingDataflowPipelineRunner class , it encapsulates DataflowPipelineRunner - How can we create a custom python runner that encapsulates the apache beam python API class DataflowRunner that provides the functionality of the java TemplatingDataflowPipelineRunner?

InvalidArgumentError tft.compute_and_apply_vocabulary to dense vector

Hi!

I am trying to convert the sparse output vector from compute_and_apply_vocabulary to a dense vector using sparse_tensor_to_dense_with_shape.

  def preprocessing_fn(inputs):
     words = tf.string_split(inputs['tweet'],DELIMITERS)
     int_representation = tft.compute_and_apply_vocabulary(words,top_k=10000)
     int_representation = tft.sparse_tensor_to_dense_with_shape(int_representation,[None,43])
     outputs = inputs
     outputs["int_representation"] = int_representation 
     return outputs

On small samples it works great but on a bigger batch work using Dataflow it crashes with the following log:

Caused by op u'transform/transform/SparseToDense', defined at:
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/start.py", line 86, in
main()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/start.py", line 82, in main
batchworker.BatchWorker(properties, sdk_pipeline_options).run()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 839, in run
deferred_exception_details=deferred_exception_details)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 642, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 156, in execute
op.start()
File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/beam/impl.py", line 396, in process
lambda: self._make_graph_state(saved_model_dir))
File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/beam/shared.py", line 221, in acquire
return _shared_map.acquire(self._key, constructor_fn)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/beam/shared.py", line 183, in acquire
result = control_block.acquire(constructor_fn)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/beam/shared.py", line 85, in acquire
result = constructor_fn()
File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/beam/impl.py", line 396, in
lambda: self._make_graph_state(saved_model_dir))
File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/beam/impl.py", line 372, in _make_graph_state
self._exclude_outputs, tf_config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/beam/impl.py", line 287, in init
saved_model_dir, {}))
File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/saved/saved_transform_io.py", line 360, in partially_apply_saved_transform_internal
saved_model_dir, logical_input_map, tensor_replacement_map)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/saved/saved_transform_io.py", line 218, in _partially_apply_saved_transform_impl
input_map=input_map)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1666, in import_meta_graph
meta_graph_or_file, clear_devices, import_scope, **kwargs)[0]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1688, in _import_meta_graph_with_return_elements
**kwargs))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/meta_graph.py", line 806, in import_scoped_meta_graph_with_return_elements
return_elements=return_elements)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def
_ProcessNewOps(graph)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 234, in _ProcessNewOps
for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3438, in _add_new_tf_operations
for c_op in c_api_util.new_tf_operations(self)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3297, in _create_op_from_tf_operation
ret = Operation(c_op, self)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1768, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): indices[11913] = [942,43] is out of bounds: need 0 <= index < [10000,43]
[[{{node transform/transform/SparseToDense}} = SparseToDense[T=DT_INT64, Tindices=DT_INT64, validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](transform/transform/StringSplit, transform/transform/SparseToDense/output_shape, transform/transform/compute_and_apply_vocabulary/apply_vocab/hash_table_Lookup, transform/transform/SparseToDense/default_value)]] [while running 's54']

Thank full for any help to solve this!

tft.quantiles() returns fewer than num of buckets -1 values

when the input has fewer than number of buckets -1 distinct values.
for example, if the inputs are 30 negative ones, 40 zeros, and 30 ones. quantiles() with 10 buckets only returns [-1, 0, 1].
similarly, when the values are all zeros, quantiles only returns [0].
this behavior is inconsistent with numpy.

NumPy support

I wonder if you've considered adding support for NumPy as a data processing framework. It would be really useful to be able to use the vast amount of preprocessing functionality already written in NumPy, run that on a dataset and get it included in the TensorFlow graph for deployment. I realize that this is a really hard task, but perhaps a subset of the most common NumPy operations could be supported at least (ndarray methods)?

simple example did not work

here is my setup:

conda create -n tftransform python=2.7
source activate tftransform
pip install tensorflow
pip install tensorflow-transform
pip install dill==0.2.6
git clone https://github.com/tensorflow/transform.git
cd transform/
python setup.py install    # for good measure ...

I then try to execute simple_example:
python examples/simple_example.py

I get the following stacktrace:

No handlers could be found for logger "oauth2client.contrib.multistore_file"
Traceback (most recent call last):
  File "examples/simple_example.py", line 64, in <module>
    preprocessing_fn, tempfile.mkdtemp()))
  File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 439, in __ror__
    result = p.apply(self, pvalueish, label)
  File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/apache_beam/pipeline.py", line 249, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish)
  File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 162, in apply
    return m(transform, input)
  File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 168, in apply_PTransform
    return transform.expand(input)
  File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/tensorflow_transform/beam/impl.py", line 597, in expand
    self._output_dir)
  File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 439, in __ror__
    result = p.apply(self, pvalueish, label)
  File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/apache_beam/pipeline.py", line 249, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish)
  File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 162, in apply
    return m(transform, input)
  File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 168, in apply_PTransform
    return transform.expand(input)
  File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/tensorflow_transform/beam/impl.py", line 328, in expand
    self._preprocessing_fn, input_schema)
  File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/tensorflow_transform/impl_helper.py", line 416, in run_preprocessing_fn
    inputs = _make_input_columns(schema)
  File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/tensorflow_transform/impl_helper.py", line 218, in _make_input_columns
    placeholders = schema.as_batched_placeholders()
  File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/tensorflow_transform/tf_metadata/dataset_schema.py", line 87, in as_batched_placeholders
    for key, column_schema in self.column_schemas.items()}
  File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/tensorflow_transform/tf_metadata/dataset_schema.py", line 87, in <dictcomp>
    for key, column_schema in self.column_schemas.items()}
  File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/tensorflow_transform/tf_metadata/dataset_schema.py", line 133, in as_batched_placeholder
    return self.representation.as_batched_placeholder(self)
  File "/Users/XXX/anaconda/envs/tftransform/lib/python2.7/site-packages/tensorflow_transform/tf_metadata/dataset_schema.py", line 330, in as_batched_placeholder
    return tf.placeholder(column.domain.dtype,
AttributeError: 'DType' object has no attribute 'dtype'

Can tf.transform handle viewfs:// path?

I find after apache beam 2.5.0, the hdfs path can be recognized by beam's python sdk.
But in "/python2.7/site-packages/apache_beam/io/hadoopfilesystem.py", the support schema is only "hdfs". When i loaded a viewfs path, i got this error:

Could anyone give me some advice about use tf.transform to read an file in viewfs filesystem, i really appreciate it.
Thank you!

Error using apply_function

Hi,

I'm trying to apply the function below:

def _group_ethnic(original):
    unknonw_cat = tf.constant(['UNCODABLE', 'UNCODED', 'ASIAN AMERICAN 2', 'NULL'])
    caucasian_cat = tf.constant(['SCANDINAVIAN', 'MEDITERRANEAN', 'WESTERN EUROPEAN'])
    east_asian_cat = tf.constant(['POLYNESIAN', 'EAST ASIAN'])
    zero = tf.constant(0, dtype=tf.int64)
    return tf.case({
        tf.greater(tf.count_nonzero(tf.equal(unknonw_cat, original)), zero): lambda: tf.constant('Unknown'),
        tf.greater(tf.count_nonzero(tf.equal(caucasian_cat, original)), zero): lambda: tf.constant('CAUCASIAN (NON-HISPANIC)'),
        tf.greater(tf.count_nonzero(tf.equal(east_asian_cat, original)), zero): lambda: tf.constant('EAST ASIAN/PACIFIC ISLANDER')
    }, default=lambda: original, exclusive=True, name='group_ethnic')

using tft.apply_function like this:

new_ethnic_group = tft.apply_function(_group_ethnic, inputs['ethnic'])
outputs['ethnic'] = tft.string_to_int(new_ethnic_group, vocab_filename='ethnic')

You can also see it running on colab.

But the following error is being raised:

File "/home/user/.virtualenvs/env1/local/lib/python2.7/site-packages/tensorflow_transform/beam/impl.py", line 670, in expand
  outputs = self._preprocessing_fn(impl_helper.copy_tensors(inputs))
File "/home/user/Workspaces/env1/proj/preprocess/transform.py", line 153, in preprocessing_fn
  new_ethnic_group = tft.apply_function(_group_ethnic, inputs['ethnic'])
File "/home/user/.virtualenvs/env1/local/lib/python2.7/site-packages/tensorflow_transform/api.py", line 161, in apply_function
  return FunctionApplication(fn, args).user_output
File "/home/user/.virtualenvs/env1/local/lib/python2.7/site-packages/tensorflow_transform/api.py", line 89, in __init__
  output = fn(*args)
File "/home/user/Workspaces/env1/proj/preprocess/transform.py", line 148, in _group_ethnic
  tf.greater(tf.count_nonzero(tf.equal(unknown_cat, original)), zero): lambda: tf.constant('Unknown'),
File "/home/user/.virtualenvs/env1/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1489, in equal
  "Equal", x=x, y=y, name=name)
File "/home/user/.virtualenvs/env1/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 519, in _apply_op_helper
  repr(values), type(values).__name__))
TypeError: Expected string passed to parameter 'y' of op 'Equal', got <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7fbdc112a090> of type 'SparseTensor' instead.

Does anyone know what I'm missing ?

Thanks

vocabulary_file no longer set

As of 0.8.0 vocabulary_file is no longer set in schema.json. One of my projects was relying on this to read the relevant vocabulary filenames and manage some index-to-token mapping outside of TensorFlow.

What is the expected way to find the relevant vocabulary filename after v0.8.0? I presume it's written in saved_model somewhere, but it's not human-readable and harder to access programmatically than JSON. What was the reason for this change / what should vocabulary_file in schema.json be used for if not for this purpose?

Any help or pointers to relevant code is much appreciated! Thanks for all the work on TFT so far.

csv_coder.py 'encode' error

I have an adapted version of the 'chicago taxi' tft pipeline (from the TFMA ex), where after applying:
beam.Map(csv_coder.decode)
I later in the same pipeline apply the following to raw_data:
beam.Map(csv_coder.encode)

This used to work for a previous version of TFT, but for the current version (0.11.0) I get the following error on encode (guessing it's for the case where a field is missing). I think it may just need additional handling in _utf8(s) for when s == None.

File "apache_beam/runners/common.py", line 677, in apache_beam.runners.common.DoFnRunner.process
    self.do_fn_invoker.invoke_process(windowed_value)
  File "apache_beam/runners/common.py", line 414, in apache_beam.runners.common.SimpleInvoker.invoke_process
    windowed_value, self.process_method(windowed_value.value))
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/core.py", line 1068, in <lambda>
    wrapper = lambda x: [fn(x)]
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/coders/csv_coder.py", line 498, in encode
    return self._encoder.encode_record(string_list)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/coders/csv_coder.py", line 385, in encode_record
    self._writer.writerow(_to_string(record))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/coders/csv_coder.py", line 38, in _to_string
    return list(map(_utf8, x)) if isinstance(x, (list, np.ndarray)) else _utf8(x)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/coders/csv_coder.py", line 33, in _utf8
    return s if isinstance(s, bytes) else s.encode('utf-8')

AFAICT, something like this will fix things:

def _utf8(s):
  if s is None:
    return ''
  else:
    return s if isinstance(s, bytes) else s.encode('utf-8')

Add tensorflow_transform.version

Hi! For running Dataflow jobs it is sometimes required for us to ensure that Dataflow workers have certain required packages installed. For this reason, it would be useful to be able to determine the tensorflow_transform version at runtime.

Ideally, this would follow the common python convention of a __version__ in the top-level module. For example, how tensorflow-data-validation does:

In [1]: import tensorflow_data_validation
In [2]: tensorflow_data_validation.__version__
Out[2]: '0.11.0'

tf.SequenceExample support

After a short look into the code of tensorflow_transform.coders.ExampleProtoCoder, it looks like tf.SequenceExample record files are not supported yet. Is this already on the list, or did I miss something and this es already supported by tf.Transform?

Thank you in advance!

min/max values of transformed metadata

Assume I am applying the string_to_int function on a categorical column. In the transformed schema this column is mapped to INT where the min and max entries of the domain are the min and max values of INT

"domain": {
      "ints": {
            "isCategorical": false,
            "max": "9223372036854775807",
            "min": "-9223372036854775808"
        }
 },

I would expect instead the actual min (-1) and max (size(vocab)-1) values of this transformed column. Is there any workaround for this?

Beam WriteTransformFn leads to RuntimeError: AlreadyExistsError: file already exists

Hi,

while writing out the transformation function:

(transform_fn | 'WriteTransformFn' >> WriteTransformFn(tft_dir))

we get the following error:

RuntimeError: AlreadyExistsError: file already exists [while running 'WriteTransformFn/WriteTransformFn']

It seems that multiple saved_model.pb files are generated in the tft-beam.Context tmp dir. WriteTransformFn then tries to copy them leading to above error.

Tensorflow: 1.9.0
Tensorflow Transform: 0.9.0
Apache Beam: 2.6.0

All TensorInfo protos used in the SignatureDefs must have the name field set

Hello.

I getting this error:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 609, in do_work
    work_executor.execute()
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 167, in execute
    op.start()
  File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.native_operations.NativeReadOperation.start
    def start(self):
  File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.native_operations.NativeReadOperation.start
    with self.scoped_start_state:
  File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.native_operations.NativeReadOperation.start
    with self.spec.source.reader() as reader:
  File "dataflow_worker/native_operations.py", line 54, in dataflow_worker.native_operations.NativeReadOperation.start
    self.output(windowed_value)
  File "apache_beam/runners/worker/operations.py", line 159, in apache_beam.runners.worker.operations.Operation.output
    cython.cast(Receiver, self.receivers[output_index]).receive(windowed_value)
  File "apache_beam/runners/worker/operations.py", line 85, in apache_beam.runners.worker.operations.ConsumerSet.receive
    cython.cast(Operation, consumer).process(windowed_value)
  File "apache_beam/runners/worker/operations.py", line 392, in apache_beam.runners.worker.operations.DoOperation.process
    with self.scoped_process_state:
  File "apache_beam/runners/worker/operations.py", line 393, in apache_beam.runners.worker.operations.DoOperation.process
    self.dofn_receiver.receive(o)
  File "apache_beam/runners/common.py", line 488, in apache_beam.runners.common.DoFnRunner.receive
    self.process(windowed_value)
  File "apache_beam/runners/common.py", line 496, in apache_beam.runners.common.DoFnRunner.process
    self._reraise_augmented(exn)
  File "apache_beam/runners/common.py", line 537, in apache_beam.runners.common.DoFnRunner._reraise_augmented
    six.raise_from(new_exn, original_traceback)
  File "/usr/local/lib/python2.7/dist-packages/six.py", line 718, in raise_from
    raise value
AssertionError: All TensorInfo protos used in the SignatureDefs must have the name field set: dtype: DT_STRING
tensor_shape {
  dim {
    size: -1
  }
  dim {
    size: -1
  }
}
coo_sparse {
  values_tensor_name: "transform/inputs/label/values:0"
  indices_tensor_name: "transform/inputs/label/indices:0"
  dense_shape_tensor_name: "transform/inputs/label/shape:0"
}

when running TFT on Dataflow Runner, my label in preprocessing_fn is a sparse tensor of string. Surpassingly I don't see this error in DirectRunner! I'm using current master version 3014617.

This might be related to tensorflow/tensorflow#6110

If you take a look at build_tensor_info (which is used by TFT):

@tf_export("saved_model.utils.build_tensor_info")
def build_tensor_info(tensor):
  """Utility function to build TensorInfo proto.

  Args:
    tensor: Tensor or SparseTensor whose name, dtype and shape are used to
        build the TensorInfo. For SparseTensors, the names of the three
        constitutent Tensors are used.

  Returns:
    A TensorInfo protocol buffer constructed based on the supplied argument.
  """
  tensor_info = meta_graph_pb2.TensorInfo(
      dtype=dtypes.as_dtype(tensor.dtype).as_datatype_enum,
      tensor_shape=tensor.get_shape().as_proto())
  if isinstance(tensor, sparse_tensor.SparseTensor):
    tensor_info.coo_sparse.values_tensor_name = tensor.values.name
    tensor_info.coo_sparse.indices_tensor_name = tensor.indices.name
    tensor_info.coo_sparse.dense_shape_tensor_name = tensor.dense_shape.name
  else:
    tensor_info.name = tensor.name
  return tensor_info

from TF, for SparseTensor there will be no name set (so that would explain the error) and overall issue, but then again why would that work in DirectRunner? Also what is the recommendation for SparseTensor in preprocessing_fn?

Example of Dataflow job: 2018-06-26_10_13_22-11943278300951017592

`string_to_int` on multiple columns

Given a dataset with two categorical columns of the same feature type, is it possible to map them to an integer value using the same vocabulary? For example by applying string_to_int on the following toy dataset

previous_occupation, current_occupation
programmer, analyst
analyst, programmer

we get the transformed dataset

previous_occupation, current_occupation
0,0
1,1

I would like to get the following instead, given that the two columns contain the same feature type.

previous_occupation, current_occupation
0,1
1,0

Help with apply_function

I am trying to calculate the length of the input text as part of the preprocessing_fn and cant figure out the right way to do it:

    outputs['text'] = tft.mappers.string_to_int(outputs['text'])
    outputs['text_length'] = tft.apply_function(lambda i: tf.cast(tf.convert_to_tensor(tf.size(i)), tf.int64)], outputs['text'])
    # essentially want to do 
    outputs['text_length'] = tft.apply_function(lambda i: tf.size(i), outputs['text'])

but i get the following error:

File "clause_type_raw_to_tf.py", line 453, in <module>
    main()
  File "clause_type_raw_to_tf.py", line 450, in main
    transform_data(train_data_file, test_data_file, working_dir, pipeline_args)
  File "clause_type_raw_to_tf.py", line 378, in transform_data
    raw_dataset | beam_impl.AnalyzeAndTransformDataset(preprocessing_fn))
  File "/home/madhav/ml-engine-cnn/beam-pipelines/env/local/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 488, in __ror__
    result = p.apply(self, pvalueish, label)
  File "/home/madhav/ml-engine-cnn/beam-pipelines/env/local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 479, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish)
  File "/home/madhav/ml-engine-cnn/beam-pipelines/env/local/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 174, in apply
    return m(transform, input)
  File "/home/madhav/ml-engine-cnn/beam-pipelines/env/local/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 180, in apply_PTransform
    return transform.expand(input)
  File "/home/madhav/ml-engine-cnn/beam-pipelines/env/local/lib/python2.7/site-packages/tensorflow_transform/beam/impl.py", line 825, in expand
    dataset | 'AnalyzeDataset' >> AnalyzeDataset(self._preprocessing_fn))
  File "/home/madhav/ml-engine-cnn/beam-pipelines/env/local/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 820, in __ror__
    return self.transform.__ror__(pvalueish, self.label)
  File "/home/madhav/ml-engine-cnn/beam-pipelines/env/local/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 488, in __ror__
    result = p.apply(self, pvalueish, label)
  File "/home/madhav/ml-engine-cnn/beam-pipelines/env/local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 443, in apply
    return self.apply(transform, pvalueish)
  File "/home/madhav/ml-engine-cnn/beam-pipelines/env/local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 479, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish)
  File "/home/madhav/ml-engine-cnn/beam-pipelines/env/local/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 174, in apply
    return m(transform, input)
  File "/home/madhav/ml-engine-cnn/beam-pipelines/env/local/lib/python2.7/site-packages/apache_beam/runners/runner.py", line 180, in apply_PTransform
    return transform.expand(input)
  File "/home/madhav/ml-engine-cnn/beam-pipelines/env/local/lib/python2.7/site-packages/tensorflow_transform/beam/impl.py", line 764, in expand
    schema=impl_helper.infer_feature_schema(outputs))
  File "/home/madhav/ml-engine-cnn/beam-pipelines/env/local/lib/python2.7/site-packages/tensorflow_transform/impl_helper.py", line 60, in infer_feature_schema
    for name, tensor in six.iteritems(tensors)
  File "/home/madhav/ml-engine-cnn/beam-pipelines/env/local/lib/python2.7/site-packages/tensorflow_transform/impl_helper.py", line 60, in <dictcomp>
    for name, tensor in six.iteritems(tensors)
  File "/home/madhav/ml-engine-cnn/beam-pipelines/env/local/lib/python2.7/site-packages/tensorflow_transform/tf_metadata/dataset_schema.py", line 562, in infer_column_schema_from_tensor
    remove_batch_dimension=True)
  File "/home/madhav/ml-engine-cnn/beam-pipelines/env/local/lib/python2.7/site-packages/tensorflow_transform/tf_metadata/dataset_schema.py", line 591, in _shape_to_axes
    raise ValueError('Expected tf_shape to have rank >= 1')
ValueError: Expected tf_shape to have rank >= 1

Seems to be caused by the batching performed by tft automatically. I couldnt find any documentation or guidelines on how to do such a thing... Can someone help with an example? I am sure the community would find it useful.

I can not use Estimator from tf.estimator package TF 1.3

I have this bundle:

TF: 1.3.0
TFT: 0.1.10
Python: 2.7.12

I read TF 1.3 docs here about Estimators:

Note: TensorFlow also provides an Estimator class at tf.contrib.learn.Estimator, which you should not use.

I tried to change learn.Estimator with tf.estimator.Estomator and got error (all examples TFT works with learn.Estimator):

local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 440, in export_savedmodel
serving_input_receiver.receiver_tensors,
AttributeError: 'InputFnOps' object has no attribute 'receiver_tensors'

I use export_strategy for serving model through tf.contrib.learn.Experiment and it should support tf.estimator.Estomator, from docs:

estimator: Object implementing Estimator interface, which could be a combination of ${tf.contrib.learn.Trainable} and ${tf.contrib.learn.Evaluable} (deprecated), or ${tf.estimator.Estimator}.

Can you explain what version of TFT/TF I should use, or what I am doing wrong?

Best practices for interfacing tf.Transform and tf.data.Dataset

I was looking for the best way to replicate the training data preprocessing at serving time. It looks like tf.Transform is the way to go, but it's unclear, what are the best practice of interfacing it with tf.data.Dataset pipelines (which also have dataset mappers, etc.)?

Also, going forward, will tf.Transform and tf.data.Dataset co-exist, or one will succeed the other?

"pip install git+https://github.com/tensorflow/transform.git" fault

pip install git+https://github.com/tensorflow/transform.git

NameError: name 'execfile' is not defined

Normalize subgroups

Hi,

What I'd like to be able to do is normalize a column per subgroup, so per key that is contained in another column. Something like .groupby('column').apply(normalizefunc) in pandas, and I guess I can implement it in Beam using GroupByKey but the prospect of having an automatically generated transform graph for use during inference is just very attractive.

I looked through the source and can't find this functionality but it looks like you guys are doing similar things with tf-idf and similar computations. Did I overlook this groupby-scale functionality, is it in the works or something you envision implementing in the near future?

running sentiment_example.py model as a server

Hi guys,

I am new to Tensorflow, so bear with me if I am doing something completely wrong :-) I am following the text classification example at https://github.com/tensorflow/transform/blob/master/examples/sentiment_example.py
Model development worked as expected. I am working on running the developed model on google ml engine environment.

I added the following lines to "train_and_evaluate" function to export the model

from tensorflow.contrib.learn.python.learn.utils import input_fn_utils
from tensorflow.contrib.layers import create_feature_spec_for_parsing

feature_spec = create_feature_spec_for_parsing(train_input_fn)
serving_input_fn = input_fn_utils.build_parsing_serving_input_fn(feature_spec)
estimator.export_savedmodel(job_dir, serving_input_fn)

I am receiving the following error upon a classification request for a sample sentence "nice piece of work ." payload looks like this: {"inputs": "nice piece of work ."}

{
  "error": "Prediction failed: Exception during model execution: AbortionError(code=StatusCode.INVALID_ARGUMENT, details=\"Could not parse example input, value: 'nice piece of work .'\n\t [[Node: ParseExample/ParseExample = ParseExample[Ndense=0, Nsparse=2, Tdense=[], _output_shapes=[[-1,2], [-1,2], [-1], [-1], [2], [2]], dense_shapes=[], sparse_types=[DT_INT64, DT_FLOAT], _device=\"/job:localhost/replica:0/task:0/cpu:0\"](_recv_input_example_tensor_0, ParseExample/ParseExample/names, ParseExample/ParseExample/sparse_keys_0, ParseExample/ParseExample/sparse_keys_1)]]\")"
}

Am I getting this error because the model object is expecting integerized tensors? If so, I attempted to use build_parsing_transforming_serving_input_fn function at https://github.com/tensorflow/transform/blob/master/tensorflow_transform/saved/input_fn_maker.py
to perform transformation at run time, it appears that I need a transform_savedmodel_dir that embodies the transformation model with the parsing logic. I figure, this is achieved by using write_saved_transform_from_session at https://github.com/tensorflow/transform/blob/master/tensorflow_transform/saved/saved_transform_io.py

Can you guys share an example code that exports a transform model?