ibm / data-prep-kit Goto Github PK

Open source project for data preparation of LLM application builders

Home Page: https://ibm.github.io/data-prep-kit/

License: Apache License 2.0

Makefile 7.43% Python 88.00% Dockerfile 2.85% Shell 1.72%

data-preparation finetuning llm llmapps data data-prep data-preprocessing data-preprocessing-pipelines datacuration large-language-models

data-prep-kit's Introduction

Data Prep Kit

Data Prep Kit is a community project to democratize and accelerate unstructured data preparation for LLM app developers. With the explosive growth of LLM-enabled use cases, developers are faced with the enormous challenge of preparing use case-specific unstructured data to fine-tune or instruct-tune the LLMs. As the variety of use cases grows, so does the need to support:

New modalities of data (code, language, speech, visual)
New ways of transforming the data to optimize the performance of the resulting LLMs for each specific use case.
A large variety in the scale of data to be processed, from laptop-scale to datacenter-scale

Data Prep Kit offers implementations of commonly needed data transformations, called modules, for both Code and Language modalities. The goal is to offer high-level APIs for developers to quickly get started in working with their data, without needing expertise in the underlying runtimes and frameworks.

📖 About

Data Prep Kit is a toolkit for streamlining data preparation for developers looking to build LLM-enabled applications via fine-tuning or instruction-tuning. Data Prep Kit contributes a set of modules that the developer can get started with to easily build data pipelines suitable for their use case. These modules have been tested while producing pre-training datasets for the Granite open models, here and here.

The modules are built on common frameworks (for Spark and Ray), called the data processing library that allows the developers to build new custom modules that readily scale across a variety of runtimes. Eventually, Data Prep Kit will offer consistent APIs and configurations across the following underlying runtimes.

Python runtime
Ray runtime (local and distributed)
Spark runtime (local and distributed)
Kubeflow Pipelines (local and distributed, wrapping Ray)

The current matrix for the combination of modules and supported runtimes is shown in the table below. Contributors are welcome to add new modules as well as add runtime support for existing modules!

Modules	Python-only	Ray	Spark	KFP on Ray
No-op / template	✅	✅	✅	✅
Doc ID annotation	✅	✅	✅	✅
Programming language annnotation	✅	✅		✅
Exact dedup filter		✅		✅
Fuzzy dedup filter		✅		✅
Code quality annotation	✅	✅		✅
Malware annotation	✅	✅		✅
Filter on annotations	✅	✅	✅	✅
Language identification	✅	✅		✅
Code (from zip) to Parquet	✅	✅		✅
Profiler		✅		✅
Tokenizer	✅	✅		✅

Features of the toolkit:

It aims to accelerate unstructured data prep for the "long tail" of LLM use cases.
It offers a growing set of module implementations across multiple runtimes, targeting laptop-scale to datacenter-scale processing.
It provides a growing set of sample pipelines developed for real enterprise use cases.
It provides the Data processing library to enable contribution of new custom modules targeting new use cases.
It uses Kubeflow Pipelines-based workflow automation.

Data modalities supported:

Code - support for code datasets as downloaded .zip files of GitHub repositories converted to parquet files.
Language - Future releases will provide transforms specific to natural language, and like the code transformations, will operate on parquet files.

Support for additional data modalities is expected in the future.

Data Processing Library

A Python-based library that has ready-to-use transforms that can be supported across a variety of runtimes. We use the popular parquet format to store the data (code or language). Every parquet file follows a set schema. Data is converted from raw form (e.g., zip files for GitHub repositories) to parquet files by the ingest2parquet tool that also adds the necessary fields in the schema.
A user can then use one or more of the available transforms to process their data.

Transform Design

A transform can follow one of the two patterns: annotator or filter.

Annotator An annotator transform adds information during the processing by adding one more columns to the parquet files. The annotator design also allows a user to verify the results of the processing before the actual filtering of the data.
Filter A filter transform processes the data and outputs the transformed data, e.g., exact deduplication. A general purpose SQL-based filter transform enables a powerful mechanism for identifying columns and rows of interest for downstream processing. For a new module to be added, a user can pick the right design based on the processing to be applied. More details here.

Scaling of Transforms

To enable processing of large data volumes leveraging multi-mode clusters, Ray or Spark wrappers are provided, to readily scale out the Python implementations. A generalized workflow is shown here.

Bring Your Own Transform

One can add new transforms by bringing in Python-based processing logic and using the Data Processing Library to build and contribute transforms. More details on the data processing library are here.

Automation

The toolkit also supports transform execution automation based on Kubeflow pipelines (KFP), tested on a locally deployed Kind cluster and external OpenShift clusters. There is an automation to create a Kind cluster and deploy all required components on it. The KFP implementation is based on the KubeRay Operator for creating and managing the Ray cluster and KubeRay API server to interact with the KubeRay operator. An additional framework along with several kfp components is used to simplify the pipeline implementation.

🚀 Getting Started

There are various entry points that you can choose based on the use case. Each entry point has its pre-requirements and setup steps. The common part of are:

Prerequisites

Python 3.10 or 3.11 -Docker/Podman

Two important development tools will also be installed using the steps below:

Installation Steps

pip install pre-commit
pip install twine
...
git clone [email protected]:IBM/data-prep-kit.git
cd data-prep-kit
pre-commit install

Please note that there are further installation steps for running the transforms in general, as documented here and on a local Kind cluster or on an existing Kubernetes cluster, as documented here.

Below are a few demos to get you started.

Build Your Own Transforms

Follow the documentation here to build your own transform and run it in either the python or Ray runtimes.

Run a Single Transform on Local Ray

Get started by running the "noop" transform that performs an identity operation by following the tutorial and associated noop implementation.

Run a Jupyter notebook on Local Ray cluster

Get started by building a Jupiter notebook executing a sequence of Transforms with our example pipeline that can run on your machine. This implementation can also be extended to connect to the remote Ray cluster.

Automate a Pipeline

The data preprocessing can be automated by running transformers as a Kubeflow pipeline (KFP). The project facilitates the creation of a local Kind cluster with all the required software and test data, or deployment of required software on an existing cluster. See Set up a Kubernetes clusters for KFP execution

A simple transform pipeline tutorial explains the pipeline creation and execution. In addition, if you want to combine several transformers in a single pipeline, you can look at multi-steps pipeline

When you finish working with the cluster, and want to clean up or destroy it. See the clean up the cluster

How to Navigate and Use the Repository

See the documentation on repository structure and its use.

🤝 How to Contribute

See the contribution guide

⭐ Acknowledgements

Thanks to the BigCode Project, which served as the source for borrowing the code quality metrics.

data-prep-kit's People

Contributors

Stargazers

Watchers

data-prep-kit's Issues

[Feature] Clarity in Readme file for noop transform

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/universal/noop

Feature

Clarity requested by Dean in: data-prep-kit/transforms/universal/noop/README.md

Can you provide more explanation about what the configuration values, do, e.g.,
• noop_sleep_sec : specifies the number of seconds to sleep during table transformation. What does "during" mean. Is it really before or after processing starts? Is this something "real" transformers might want, e.g., to allow something to happen in the background before reading starts?
• noop_pwd - specifies a dummy password not included in metadata. I assume this is for accessing the file. Is a user name also required?

Of course, you said these are examples, but I would go to extra effort to clarify specific behaviors, to encourage users to do the same.
As a general note about configuration parameter names, would it be better to use a hierarchy, like noop.sleep.sec so that in the noop context, you would just look at sleep.sec in the config file. I made sleep a parent in case you added more options, e.g.,
noop.sleep.enable = false
noop.sleep.sec = 2

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Develop the testing strategy for new NLP modules being added to the kit

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

We are adding document quality and spoken language id NLP modules and new code modules for HAP, License filtering and PII to the kit and we need testing similar (or better!) to what was done for the initial set of code modules.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Change example notebook to use a other than test-data as location to store intermediate files

Search before asking

I searched the issues and found no similar issues.

Component

Other

Feature

The example notebook uses the directory test-data to hold both the allowed-languages.txt file and the files generated while running the notebook. This makes it hard to start over. In general, the convention is to have the test-data directory as read only. So the ask is

Put the files generated while running the notebook somewhere else. .../examples/tempdir or something is fine.
Add the the make clean target the removal of the intermediate files.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Build a transform to remove dead code from code files

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/code/code_quality

Feature

Goal is to remove dead code from code files. The routine should work across 100+ programming languages and should be easily extensible to more.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Provide instructions to deploy and execute the project on a real Kubernetes cluster

Search before asking

I searched the issues and found no similar issues.

Component

KFP workflows

Feature

The currently provided Kind deployment is suitable for testing/development, but for processing of bigger size data sets, the users should execute the project on a real Kubernetes cluster.
We have to provide instructions how to do it.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Unify format of data_sets and files_to_use in DataAccessFactory

Search before asking

I searched the issues and found no similar issues.

Component

Library/core

Feature

According to the way the data-processing-lib parses the parameter data_data_sets, the datasets should be passed as a line (String) with the names of the datasets separated by ",". For example dataset1,dataset2.
While the parameter data_files_to_use should be passed as a list (with []) that includes strings with '. For example ['.ext1', '.ext2'].

This difference is caused by the difference in parsing these two parameters in the library. I think we should unify the way of parsing by changing the way of parsing the data_data_sets to use type=ast.literal_eval. For example:

parser.add_argument(
            f"--{self.cli_arg_prefix}data_sets",
            type=ast.literal_eval,
            default=None,
            help="List of data sets",
        )

Are you willing to submit a PR?

Yes I am willing to submit a PR!

KFP/kind setup - errors due files not found

make setup in data-prep-lab/kind outputs these errors due to missing files.

...-data/input/sample.parquet: 2.46 KiB / 2.46 KiB ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.78 KiB/s 0smc: <ERROR> Unable to validate source `/Users/yan/vmshared/kubuntu14/data/src/fm/data-prep-lab/kind/../transforms/language/doc_quality/test-data/input/`.
mc: <ERROR> Unable to validate source `/Users/yan/vmshared/kubuntu14/data/src/fm/data-prep-lab/kind/../transforms/language/language_id/test-data/input/`.
mc: <ERROR> Unable to validate source `/Users/yan/vmshared/kubuntu14/data/src/fm/data-prep-lab/kind/../transforms/universal/blocklist/test-data/input/`.
mc: <ERROR> Unable to validate source `/Users/yan/vmshared/kubuntu14/data/src/fm/data-prep-lab/kind/../transforms/universal/blocklist/test-data/domains/`.
...t-data/input/test1.parquet: 753 B / 753 B ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30.77 KiB/s 0smc: <ERROR> Unable to validate source `/Users/yan/vmshared/kubuntu14/data/src/fm/data-prep-lab/kind/../transforms/universal/resize/test-data/input/`.
...input/lang=en/pq02.parquet: 31.95 KiB / 31.95 KiB ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 782.29 KiB/s 0ssetup-cluster completed

[Feature] Add support to ingest PDF files

Search before asking

I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

Feature

We would like to add ability to read PDF files and convert them to parquet files, which can go through other processing modules like dedup, filtering etc. Today we have support for .zip files via ingest2parquet module

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Bug] Wrong test data path

Search before asking

I searched the issues and found no similar issues.

Component

KFP workflows

What happened + What you expected to happen

After moving the transform code into ray subdirectory PR #143 , the test data path in the kind/hack/populate_minio.sh is wrong

Reproduction script

run make setup and see the results of data population

Anything else

No response

OS

MacOS (limited support)

Python

3.10.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

ingest2parquet Readme file needs to be cleaned up

There are still references to IBM bluepile in the Readme (command line options)
AST string containing input/output paths. input_folder: Path to input folder of files to
be processed output_folder: Path to output folder of processed files Example: {
'input_folder': '/cos-optimal-llm-pile/bluepile-
processing/rel0_8/cc15_30_preproc_ededup', 'output_folder': '/cos-optimal-llm-
pile/bluepile-processing/rel0_8/cc15_30_preproc_ededup/processed' }
In the section: Run the script via command-line, shouldn't it be:
python ingest2parquet_local.py , instead of python ingest2parque.py ?

[Bug] make publish in kfp_ray_components is not logging into quay.io

Search before asking

I searched the issues and found no similar issues.

Component

KFP workflows

What happened + What you expected to happen

When using the make file to publish kfp-data-processing image, it fails because I'm not authorized. Yet I have the DPK_DOCKER_REGISTRY/USER env vars as are supported by transform publishing. This should be logging the user in with the env vars.

Reproduction script

$ cd kfp/kfp_ray_components
$ make image
$ make publish
docker push quay.io/dataprep1/data-prep-kit/kfp-data-processing:0.1.0
Getting image source signatures
Copying blob sha256:8b2a55f07696d6b9f3367d7d77577a5d8c589d358ca6c89cb8ddf9ea9bdb80d4
Copying blob sha256:6d6a68f624bbd6e78237ef8ee1d6e4ae37c105c17bdc1c9d0c61028592e4cce0
Copying blob sha256:28da0445c4497f3ecb56288bd74d91ed1ff6f86578d1d0f6f9cb2781915163b1
Copying blob sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef
Copying blob sha256:696418cefb060fa658e29b0fe7cf074a7581b558e859d2360cbb6cb90935fbad
Copying blob sha256:0f8b5c9f2bc803c9ef57537e42bf1f977855314e8d463080ab5a20b80f9cc910
Copying blob sha256:9d43689335e2980880d13a36c4aeb151080ff818c52de116f30128470fbe7f96
Copying blob sha256:067112d5bafe7f67682dd678380f426761fe83d1fe17f23ad89d8cb7d34d9ae1
Copying blob sha256:ce61675dbc5e9669b25c4d5b5d753287910ce550b50cec16fd7322e905990e3a
Copying blob sha256:5cc6f1238fac5ae20d628a3b0928c1251c18ad58ca1c3405d1ebf64060a3ffe1
Copying blob sha256:0166a942e7ce9422b37d51ffddef5f0254adc395504b7b4d87fa97615eb75ffe
Copying blob sha256:f900ef5b975c529938bb9c8da081f25c5e33370dce2103883589e8c29cb500ec
Copying blob sha256:e2d425d70ee0fd0d997ab5b84c36cc36927238cc2bc0e5cdcc4195d9b769c25e
Error: writing blob: initiating layer upload to /v2/dataprep1/data-prep-kit/kfp-data-processing/blobs/uploads/ in quay.io: unauthorized: access to the requested resource is not authorized
make: *** [publish] Error 125

Anything else

No response

OS

MacOS (limited support)

Python

3.10.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Python implementation for exact dedup

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/universal/ededup

Feature

Python version for exact deduplication logic that can work across both code and NLP.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Identify/create communication mechanism and add to CONTRIBUTING.md

[Feature] Add support to process HTML file format

Search before asking

I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

Feature

We would like to add ability to read HTML files and convert them to parquet files, which can go through other processing modules like dedup, filtering etc.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Enhance supported platforms

Search before asking

I searched the issues and found no similar issues.

Component

KFP workflows

Feature

We have validated the execution of the data-prep-kit kind deployment on several platforms, see below, and have faced the Ray Job execution issues on the others. We need to extend supported platforms:

Operating System	Container Agent	Support	Priority	Comments
RHEL 7	any	-	-	Kind doesn't support RHEL 7
RHEL 8			?
RHEL 9.4	Docker	Yes	-
RHEL 9.4	Podman	No	?	Issues with Ray job executions
Ubuntu 24-04	Docker	Yes	-	(Validate nginx deployment)
Ubuntu 24-04	Podman		?
Windows WSL2	Docker	Yes	-
Windows WSL2	Podman		?
MacOS amd64	Docker	Yes	-
MacOS amd64	Podman		?
MacOS arm64	Docker		?
MacOS arm64	Podman	No	?	Issues with Ray job executions

In the meantime we don't test investigate other OS and other container agents, such as Rancher, or colima.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Bug] Example notebook finding no input files in ededup

Search before asking

I searched the issues and found no similar issues.

Component

Other

What happened + What you expected to happen

Getting error messages during ededup section saying there are no input files.

Reproduction script

Download zip from data-prep-kit repo into /Users/dawood/Downloads/data-prep-kit-dev.zip

mkdir /tmp/example
cp data-prep-kit-dev.zip /tmp/example
git clone ...
cd data-prep-kit/examples
make venv
make jupyter

Edit notebook

zip_input_folder = "/tmp/example"

Run notebook through ededup section and get logged messages say no input files

3:26:29 INFO - Running locally
13:26:29 INFO - exact dedup params are {'hash_cpu': 0.5, 'num_hashes': 2, 'doc_column': 'contents'}
13:26:29 INFO - data factory data_ is using local data access: input_folder - test-data/parquet_input output_folder - test-data/ededup_out
13:26:29 INFO - data factory data_ max_files -1, n_sample -1
13:26:29 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet']
13:26:29 INFO - number of workers 3 worker options {'num_cpus': 0.8}
13:26:29 INFO - pipeline id pipeline_id; number workers 3
13:26:29 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}
13:26:29 INFO - code location None
13:26:29 INFO - actor creation delay 0
2024-05-14 13:26:31,387	INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(orchestrate pid=24207) 13:26:32 INFO - orchestrator started at 2024-05-14 13:26:32
(orchestrate pid=24207) 13:26:32 ERROR - No input files to process - exiting
13:26:42 INFO - Completed execution in 0.21104646523793538 min, execution result 0

### Anything else

_No response_

### OS

MacOS (limited support)

### Python

3.10.x

### Are you willing to submit a PR?

- [ ] Yes I am willing to submit a PR!

Investigate Podman on RHEL9 for KFP

Search before asking

I searched the issues and found no similar issues.

Component

KFP workflows

Feature

We have not been able to test a transform through KFP successfully on the RHEL VM machine with Podman. It works well with Docker.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Add spark support for filter module

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

We want to support users who work with Spark, Ray and pure python. The goal is to enable the filter framework on Spark.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Fuzzy deduplication on Spark

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/universal/fdedup

Feature

Spark version of fuzzy deduplication that can work across code and language.

Incremental logging and progress indicators
Checkpointing
Resource utilization estimation for network/compute/memory

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Problem with MkDocs creation of GitHub pages

MkDocs is having problems with certain MD file constructs, like "matrix" (MD table) on the root readme file.

Compare https://github.com/IBM/data-prep-lab with https://ibm.github.io/data-prep-lab/ and see the problem.

Nirmit seems to have found a solution here: https://squidfunk.github.io/mkdocs-material/reference/data-tables/

Maybe, as the link says, by just adding the following lines to mkdocs.yml will fix the problem, no?

markdown_extensions:

tables

minio is not able to start

Search before asking

I searched the issues and found no similar issues.

Component

KFP workflows

What happened + What you expected to happen

After installing the pre-required softwares manually, I got this by running make setup:

creating minio alias
mc: Configuration written to `/root/.mc/config.json`. Please update your access credentials.
mc: Successfully created `/root/.mc/share`.
mc: Initialized share uploads `/root/.mc/share/uploads.json` file.
mc: Initialized share downloads `/root/.mc/share/downloads.json` file.
Added `kfp` successfully.
creating test bucket
Bucket created successfully `kfp/test`.
copying data
mc: <ERROR> Failed to copy `/root/test/data-prep-lab/transforms/code/code_quality/test-data/input/sample_1.parquet`. Get "http://localhost:8080/test/?location=": read tcp 127.0.0.1:50442->127.0.0.1:8080: i/o timeout
mc: <ERROR> Failed to copy `/root/test/data-prep-lab/transforms/code/proglang_select/test-data/input/test1.parquet`. Get "http://localhost:8080/test/?location=": read tcp 127.0.0.1:52100->127.0.0.1:8080: i/o timeout
mc: <ERROR> Failed to copy `/root/test/data-prep-lab/transforms/code/proglang_select/test-data/languages/allowed-code-languages.txt`. Get "http://localhost:8080/test/?location=": read tcp 127.0.0.1:53306->127.0.0.1:8080: i/o timeout
mc: <ERROR> Failed to copy `/root/test/data-prep-lab/transforms/code/malware/test-data/input/sample.parquet`. Get "http://localhost:8080/test/?location=": read tcp 127.0.0.1:40052->127.0.0.1:8080: i/o timeout
mc: <ERROR> Failed to copy `/root/test/data-prep-lab/transforms/universal/ededup/test-data/input/sample1.parquet`. Get "http://localhost:8080/test/?location=": read tcp 127.0.0.1:41522->127.0.0.1:8080: i/o timeout
mc: <ERROR> Failed to copy `/root/test/data-prep-lab/transforms/universal/noop/test-data/input/test1.parquet`. Get "http://localhost:8080/test/?location=": read tcp 127.0.0.1:39912->127.0.0.1:8080: i/o timeout
mc: <ERROR> Failed to copy `/root/test/data-prep-lab/transforms/universal/tokenization/test-data/ds01/input/lang=en/dataset=cybersecurity_v2.0/version=2.3.2/pq03.snappy.parquet`. Get "http://localhost:8080/test/?location=": read tcp 127.0.0.1:50202->127.0.0.1:8080: i/o timeout
mc: <ERROR> Failed to copy `/root/test/data-prep-lab/transforms/universal/tokenization/test-data/ds01/input/lang=en/dataset=empty/dpv08_cc02.snappy.parquet`. Get "http://localhost:8080/test/?location=": read tcp 127.0.0.1:50180->127.0.0.1:8080: i/o timeout
mc: <ERROR> Failed to copy `/root/test/data-prep-lab/transforms/universal/tokenization/test-data/ds01/input/lang=en/dataset=empty/dpv08_cc01.snappy.parquet`. Get "http://localhost:8080/test/?location=": read tcp 127.0.0.1:50186->127.0.0.1:8080: i/o timeout
mc: <ERROR> Failed to copy `/root/test/data-prep-lab/transforms/universal/tokenization/test-data/ds01/input/lang=en/pq02.parquet`. Get "http://localhost:8080/test/?location=": read tcp 127.0.0.1:50206->127.0.0.1:8080: i/o timeout
make[2]: *** [Makefile:28: populate-data] Error 1
make[2]: Leaving directory '/root/test/data-prep-lab/kind'
make[1]: *** [Makefile:37: cluster-deploy] Error 2
make[1]: Leaving directory '/root/test/data-prep-lab/kind'
make: *** [Makefile:23: setup] Error 2

Reproduction script

make setup

Anything else

No response

OS

Redhat

Python

Other

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Python support for fuzzy dedup implementation

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/universal/fdedup

Feature

Python code for fuzzy dedup logic for code and NLP.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Tokenizer security issues

See dependabot alerts and transformers dependency

https://github.com/IBM/data-prep-lab/security/dependabot/1
https://github.com/IBM/data-prep-lab/security/dependabot/2

[Feature] Enhance ingest2parquet to support instruction tuning pairs as an input for data prep

Search before asking

I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

Feature

Ability to read instruction pairs with the assumption that they are in JSON format.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Ray error during noop pipeline run - local KFP and Kind

Environment: KFP + Kind on Mac M1 Max with Docker and Colima. VM settings: default Running aarch64 8CPU 32GiB 1000GiB docker

During noop pipeline execution (all default values), execute ray step fails with:

[Feature] Add spark support for document id module

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/universal/doc_id

Feature

Support users who have Spark set up in the environments

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Need tests for DataAccessFactory data_sets, files_to_use and nsamples configurations.

Search before asking

I searched the issues and found no similar issues.

Component

Library/core

Feature

DataAccessFactory uses parameters data_sets, files_to_use and nsamples to subset the input files selected for processing. We have not tests to verify the proper implementation of these for

DataAccessLocal
DataAccessS3

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] update readme on how a user can add their own languages as part of language detection

Search before asking

I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

Feature

Enhancement in readme to help the user to add their own languages to https://github.com/IBM/data-prep-kit/blob/dev/tools/ingest2parquet/src/utils/lang_extensions.json

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Endurance Test!

As we discussed issue #388 in the internal repo, we want to test if the framework has a memory leak or not. To do this, we use the noop transform and a large dataset (like test set 3 that has 1500 zip files) and 1) Run ingest2parquet on it first and 2) Try the noop transform while monitoring the memory usage of the laptop and see if it reaches a flat plateau or not.

[Bug] Notebook example not finding input files during ededup when file not directory is specified.

Search before asking

I searched the issues and found no similar issues.

Component

Other

What happened + What you expected to happen

To be able run the example notebook successfully.

Reproduction script

Download zip from data-prep-kit repo into /Users/dawood/Downloads/data-prep-kit-dev.zip

git clone ...
cd data-prep-kit/examples
make venv
make jupyter

Edit notebook

zip_input_folder = "/Users/dawood/Downloads/data-prep-kit-dev.zip"

Run through to ededup section to get

13:12:31 INFO - Running locally
13:12:31 INFO - exact dedup params are {'hash_cpu': 0.5, 'num_hashes': 2, 'doc_column': 'contents'}
13:12:31 INFO - data factory data_ is using local data access: input_folder - test-data/parquet_input output_folder - test-data/ededup_out
13:12:31 INFO - data factory data_ max_files -1, n_sample -1
13:12:31 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet']
13:12:31 INFO - number of workers 3 worker options {'num_cpus': 0.8}
13:12:31 INFO - pipeline id pipeline_id; number workers 3
13:12:31 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}
13:12:31 INFO - code location None
13:12:31 INFO - actor creation delay 0
2024-05-14 13:12:33,666	INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(orchestrate pid=23909) 13:12:34 INFO - orchestrator started at 2024-05-14 13:12:34
(orchestrate pid=23909) 13:12:34 ERROR - No input files to process - exiting
13:12:44 INFO - Completed execution in 0.21483153502146404 min, execution result 0

Anything else

In the end, this was a user error, in that I set input_folder to the name of the zip file. Perhaps a check to make sure input_folder is a directory and not a file?

OS

MacOS (limited support)

Python

3.10.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Bug] Wrong Link in Readme file

Search before asking

I searched the issues and found no similar issues.

Component

Documentation

What happened + What you expected to happen

In the Readme file of https://github.com/IBM/data-prep-lab/tree/dev/transforms , the link for "transforms.version" is broken. I think the link should go to the ".make.versions" in the repo root directory, correct?

Reproduction script

The problematic line: The transform versions are managed in the repo root named .make.versions.

Anything else

No response

OS

MacOS (limited support)

Python

3.11.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Build a transform to remove headers from code files

Search before asking

I searched the issues and found no similar issues.

Component

Other

Feature

Code files often have headers. These do not contain information relevant to LLMs, and may also contain PII. We want to build a new transform to remove this header information from code files. This transform should be built in such a way that it can work across 300+ programming languages. One possible way to do is that the transform takes as input as a configuration file with Programming language names and characters to used for commenting for that language. It should then identify the header information in various programming languages specified in the input configuration file and edit the files to remove the header information.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Document is missing for building sequence of transforms

When we click at the below link at the top level readme page,

It takes to the correct link, but looks like the document is missing there

Please fix the link or add the missing document.

[Feature] Add license filtering for code modules

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/code/code_quality

Feature

Capability to filter by permissive licenses for any new code data as a new module.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Add an issues template for the repo

We should define a template for submitting issues, to facilitate effective prioritization and clarity.

[Feature] Extend ability to recognise PL for JCL/PLI

Search before asking

I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

Feature

Add JCL/PLI to the list of languages that can be recognised by ingest2parquet module

Are you willing to submit a PR?

Yes I am willing to submit a PR!

ingest2parquet Makefile needs new run-*-sample targets similar to transforms

At least

run-local-sample
run-s3-sample

Like the transforms, the run-s3-sample should start and load the minio server and then ask the user to stop the minio server.
Starting/loading/stopping minio should be done with the following make targets, like the transforms

minio-start - start and load test-data/input
minio-stop - stops the server

.make.defaults has support for these.

Github doc pages hyperlinks and formatting issues

The hyperlinks on this doc page ends up downloading the .py files instead of navigating the browser to the specific code file in the repo: https://ibm.github.io/data-prep-lab/data-processing-lib/doc/overview/

Same issue exists for this page and probably other such pages: https://ibm.github.io/data-prep-lab/data-processing-lib/doc/transform-tutorials/

Numbered list on this page is not correctly formatted, and the powerpoint diagram looks weird with grammar mistake highlights: https://ibm.github.io/data-prep-lab/data-processing-lib/doc/architecture/

[Bug] Issues with Ray execution on the M1 CPU

Search before asking

I searched the issues and found no similar issues.

Component

Library/kfp

What happened + What you expected to happen

In this release, we moved from CodeFlare SDK to RayAPIServer,
I observe different error/warning messages in the Ray logs. See below.
The messages can stem from wrong API parameters or internal RayAPIServer implementation.

From the RayAPIServer Pod logs:

W0504 07:30:13.041268 1 interceptor.go:17] Get compute template failure: NotFoundError: Compute template noop-kfp--78783-head-template not found: configmaps "noop-kfp--78783-head-template" not found. (It looks like the server tries to access the template before it was created)
W0504 07:56:26.660498 1 warnings.go:70] unknown field "spec.headGroupSpec.template.metadata.creationTimestamp"
W0504 07:56:26.660565 1 warnings.go:70] unknown field "spec.workerGroupSpecs[0].template.metadata.creationTimestamp"
W0504 07:56:26.660585 1 warnings.go:70] unknown field "status.desiredCPU"
W0504 07:56:26.660599 1 warnings.go:70] unknown field "status.desiredGPU"
W0504 07:56:26.660630 1 warnings.go:70] unknown field "status.desiredMemory"
W0504 07:56:26.660648 1 warnings.go:70] unknown field "status.desiredTPU"
W0504 07:56:26.680745 1 cluster_server.go:43] Failed to get cluster's event, cluster: kubeflow/noop-kfp--1d2d3, err: No Event with RayCluster name noop-kfp--1d2d3
I0504 07:57:47.189239 1 interceptor.go:14] /proto.RayJobSubmissionService/SubmitRayJob handler starting
{"level":"info","v":0,"logger":"jobsubmissionservice","message":"RayJobSubmissionService submit job"}
[controller-runtime] log.SetLogger(...) was never called; logs will not be displayed.
Detected at:

goroutine 1775 [running]:
runtime/debug.Stack()
/usr/lib/golang/src/runtime/debug/stack.go:24 +0x65
sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/log/log.go:60 +0xcd
sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).WithValues(0xc00042e3c0, {0x0, 0x0, 0x0})
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/log/deleg.go:168 +0x54
github.com/go-logr/logr.Logger.WithValues(...)
/opt/app-root/src/go/pkg/mod/github.com/go-logr/[email protected]/logr.go:323
sigs.k8s.io/controller-runtime/pkg/log.FromContext({0x1d4b538?, 0xc000a86030?}, {0x0, 0x0, 0x0})
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/log/log.go:98 +0xfd
github.com/ray-project/kuberay/ray-operator/controllers/ray/utils.(*RayDashboardClient).SubmitJobReq(0xc000751100, {0x1d4b538, 0xc000a86030}, 0xc000ac96f0?, 0x0)
/workspace/ray-operator/controllers/ray/utils/dashboard_httpclient.go:299 +0x91
github.com/ray-project/kuberay/apiserver/pkg/server.(*RayJobSubmissionServiceServer).SubmitRayJob(0xc000164090, {0x1d4b538, 0xc000a86030}, 0xc00049a0f0)
/workspace/apiserver/pkg/server/ray_job_submission_service_server.go:89 +0x484
github.com/ray-project/kuberay/proto/go_client._RayJobSubmissionService_SubmitRayJob_Handler.func1({0x1d4b538, 0xc000a86030}, {0x1975500?, 0xc00049a0f0})
/workspace/proto/go_client/job_submission_grpc.pb.go:166 +0x78
github.com/ray-project/kuberay/apiserver/pkg/interceptor.ApiServerInterceptor({0x1d4b538, 0xc000a86030}, {0x1975500, 0xc00049a0f0}, 0xc00044e6e0, 0xc0008
> .....
A successfully finished RAY job, returns:
> 00:59:16 INFO - Exception running ray remote orchestration
Initialization failure from server:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 711, in Datapath
raise RuntimeError(
RuntimeError: Starting Ray client server failed. See ray_client_server_23000.err for detailed logs.

There is no errors in ray_client_server_23000.err, but ray_client_server.err we can see some info:

ray_client_server.err.zip

Reproduction script

Run the noop pipeline and check the Ray server logs

Anything else

No response

OS

Other

Python

3.11

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Bug] "Simplest Transform Tutorial" hard-codes IBM Cloud information

Search before asking

I searched the issues and found no similar issues.

Component

Documentation

What happened + What you expected to happen

Running the "simplest" tutorial assumes IBM Cloud:

python noop_main.py --noop_sleep_msec 2 \
  --run_locally True  \
  --s3_cred "{'access_key': 'KEY', 'secret_key': 'SECRET', 'cos_url': 'https://s3.us-east.cloud-object-storage.appdomain.cloud'}" \
  --s3_config "{'input_folder': 'cos-optimal-llm-pile/test/david/input/', 'output_folder': 'cos-optimal-llm-pile/test/david/output/'}"

Can you provide some dummy test data in the repo and use that instead? What would be the command-line flags for local storage?

It's fine with me to leave in an S3 (true S3) example for those users so equipped, but otherwise, this is hardly a "local" example and I'm currently blocked.

I would try submitting a PR, but I'm not sure what the changed CLI flags would be for local use.

Reproduction script

N/A

Anything else

No response

OS

MacOS (limited support)

Python

3.11.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Bug] KFP Workflow step is green although task failed

Search before asking

I searched the issues and found no similar issues.

Component

KFP workflows

What happened + What you expected to happen

The code_quality task is failing but the kfp GUI shows that the execution step is green indicating that the execution step was completed successfully

Reproduction script

Run: make PYTHON=python3.10 build && make PYTHON=python3.10 test under kfp/transform_workflows/code/code_quality directory.

Anything else

No response

OS

Windows WSL

Python

3.10.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Failed Ray server left orphan templates [Bug]

Search before asking

I searched the issues and found no similar issues.

Component

Library/kfp

What happened + What you expected to happen

When Ray cluster creation fails, its template is not removed.

Reproduction script

Run e.g a noop pipeline, and specify the number of replicas to 1 ( min_replicas = 2), so the Ray cluster creation fails with the message:

raise RuntimeError(f"min_replicas {min_replicas} is can't be greater then replicas {replicas} ")
RuntimeError: min_replicas 2 is can't be greater then replicas 1

Another question, why do we need specify replicas, min_replicas and max_replicas

Anything else

No response

OS

Other

Python

3.11

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Clarification in the simple_transform_pipeline.md file

Search before asking

I searched the issues and found no similar issues.

Component

KFP workflows

Feature

I think many first-time users of kfp will start from the simple_transform_pipeline.md file, not having read the kind readme file, in which manual installation of software like helm and kind is mentioned. So, when they get to the preparing kind cluster in this file, they think make setup would install pre-req software too. I suggest adding a link in the simple_transform_pipeline.md file that would take the user to the kind readme file, like this:

Assuming you have pre-installed the software components specified here (the link to the kind readme file), you can create a Kind cluster with all required software installed using the following command: make setup

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code quality needs pure python sample py

transforms/code/code_quality needs src/code_quality_local.py to show how to use the transform outside of ray. See noop as an example.

Fix the github doc pages formatting

Several pages do not seem to render properly at the moment, e.g., https://ibm.github.io/data-prep-lab/transforms

Also, section heading with a trailing ":" look weird when you see the table of contents. Should remove the trailing ":" from section headings.

Attached screenshot shows what I see in Safari and Chrome.

, also validated with one other person that this happens.

[Feature] Change the Kind cluster ingress configuration

Search before asking

I searched the issues and found no similar issues.

Component

Other

Feature

The current ingress configuration is that MinIO takes the default root URL path. Kfp requests should start from kfp/. The finishing slash is important; without it, Nginx forwards user requests to MinIO, which leads to user confusion.
From a user perspective, the KFP entry point is not just important, it's the main entry point. This emphasis guides users and enhances the overall usage of the system.

Note: the nginx has the Ray API Server ingress as well.
Note: the changes requires updates in teh installation scripts and in the documentation.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Bug] make help giving empty output

Search before asking

I searched the issues and found no similar issues.

Component

Makefiles

What happened + What you expected to happen

On a RHEL 9.4 running make help at the top of the repo gives no results.

Reproduction script

make help

Anything else

No response

OS

Red Hat Enterprise Linux (RHEL)

Python

Other

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] New transform to do PII detection on code files

Search before asking

I searched the issues and found no similar issues.

Component

Other

Feature

PII detection and redaction is important step while preparation of data for LLMs. The request is to build a new transform that can perform PII detection on code entities.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Bug] ingest2parquet issuing ERROR messages, but WARNING is preferred.

Search before asking

I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

What happened + What you expected to happen

Testing of ingest2parquet show lots of ERROR messages, but does not fail the test.

Reproduction script

Lots of error messages without faling the run of ingest2parquet_local.py? Perhaps these can be changed to WARNINGS?

cd tools/ingest2parquet
make venv
make test-src

gets

...
Executing: python src/ingest2parquet_local.py
13:02:37 INFO - data factory data_ is using local data accessinput_folder - /home/dawood/git/fm-data-engineering/tools/ingest2parquet/test-data/input output_folder - /home/dawood/git/fm-data-engineering/tools/ingest2parquet/src/../test-data/output
13:02:37 INFO - data factory data_ max_files -1, n_sample -1
13:02:37 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.zip']
Number of files is 2 
filepath /home/dawood/git/fm-data-engineering/tools/ingest2parquet/src/utils/lang_extensions.json
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x80 in position 11: invalid start byte
 skipping environments-master/cfortunes/diebenkorn_notes.dat Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xc3 in position 7: invalid continuation byte
 skipping environments-master/cfortunes/obliquestrategies.dat Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xfc in position 10: invalid start byte
 skipping application-java/lib/application-java.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xe5 in position 14: invalid continuation byte
 skipping application-java/lib/fabric-gateway-java-2.1.1.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xf9 in position 10: invalid start byte
 skipping application-java/lib/fabric-sdk-java-2.1.1.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xa1 in position 11: invalid start byte
 skipping application-java/lib/grpc-protobuf-1.23.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xe1 in position 10: invalid continuation byte
 skipping application-java/lib/protobuf-java-util-3.10.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xaa in position 11: invalid start byte
 skipping application-java/lib/api-common-1.9.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xba in position 25: invalid start byte
 skipping environments-master/commands/grel Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode bytes in position 40-41: invalid continuation byte
 skipping environments-master/commands/ldid Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xb7 in position 10: invalid start byte
 skipping application-java/lib/milagro-crypto-java-0.4.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xa1 in position 11: invalid start byte
 skipping application-java/lib/grpc-stub-1.23.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xa1 in position 11: invalid start byte
 skipping application-java/lib/grpc-netty-1.23.0.jar Error: No contents decoded
output_file_name /home/dawood/git/fm-data-engineering/tools/ingest2parquet/src/../test-data/output/https___github.com_00000o1_environments_archive_refs_heads_master.parquet
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xa1 in position 11: invalid start byte
 skipping application-java/lib/grpc-core-1.23.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xa1 in position 11: invalid start byte
 skipping application-java/lib/grpc-protobuf-lite-1.23.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xa1 in position 11: invalid start byte
 skipping application-java/lib/grpc-api-1.23.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xfe in position 50: invalid start byte
 skipping application-java/lib/guava-29.0-jre.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xfe in position 50: invalid start byte
 skipping application-java/lib/failureaccess-1.0.1.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xf3 in position 50: invalid continuation byte
 skipping application-java/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xe7 in position 12: invalid continuation byte
 skipping application-java/lib/perfmark-api-0.17.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xf0 in position 10: invalid continuation byte
 skipping application-java/lib/jsr305-3.0.2.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xac in position 10: invalid start byte
 skipping application-java/lib/checker-qual-2.11.1.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x82 in position 12: invalid start byte
 skipping application-java/lib/error_prone_annotations-2.3.4.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x99 in position 53: invalid start byte
 skipping application-java/lib/j2objc-annotations-1.3.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xf6 in position 10: invalid start byte
 skipping application-java/lib/cloudant-client-2.19.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x9d in position 89: invalid start byte
 skipping application-java/lib/netty-tcnative-boringssl-static-2.0.30.Final.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xa8 in position 10: invalid start byte
 skipping application-java/lib/netty-codec-http2-4.1.49.Final.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xa7 in position 10: invalid start byte
 skipping application-java/lib/protobuf-java-3.10.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x9c in position 11: invalid start byte
 skipping application-java/lib/bcpkix-jdk15on-1.62.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xc5 in position 10: invalid continuation byte
 skipping application-java/lib/httpclient-4.5.12.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xa1 in position 11: invalid start byte
 skipping application-java/lib/commons-logging-1.2.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xaa in position 14: invalid start byte
 skipping application-java/lib/commons-cli-1.4.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xca in position 14: invalid continuation byte
 skipping application-java/lib/commons-compress-1.20.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xf5 in position 10: invalid start byte
 skipping application-java/lib/cloudant-http-2.19.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xcf in position 15: invalid continuation byte
 skipping application-java/lib/commons-io-2.6.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xc5 in position 10: invalid continuation byte
 skipping application-java/lib/apache-log4j-extras-1.2.17.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xa6 in position 12: invalid start byte
 skipping application-java/lib/log4j-1.2.17.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xfd in position 10: invalid start byte
 skipping application-java/lib/futures-extra-4.2.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xb3 in position 10: invalid start byte
 skipping application-java/lib/javax.json-1.1.4.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xe7 in position 10: invalid continuation byte
 skipping application-java/lib/snakeyaml-1.26.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x97 in position 10: invalid start byte
 skipping application-java/lib/jaxb-api-2.3.1.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xc9 in position 10: invalid continuation byte
 skipping application-java/lib/javax.annotation-api-1.3.2.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xa1 in position 11: invalid start byte
 skipping application-java/lib/gson-2.8.5.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xaa in position 10: invalid start byte
 skipping application-java/lib/commons-codec-1.11.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xfb in position 10: invalid start byte
 skipping application-java/lib/netty-handler-proxy-4.1.38.Final.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x88 in position 11: invalid start byte
 skipping application-java/lib/proto-google-common-protos-1.12.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x88 in position 10: invalid start byte
 skipping application-java/lib/netty-codec-http-4.1.49.Final.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x96 in position 12: invalid start byte
 skipping application-java/lib/netty-handler-4.1.49.Final.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xb2 in position 10: invalid start byte
 skipping application-java/lib/netty-codec-socks-4.1.38.Final.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x96 in position 12: invalid start byte
 skipping application-java/lib/netty-codec-4.1.49.Final.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x96 in position 12: invalid start byte
 skipping application-java/lib/netty-transport-4.1.49.Final.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x96 in position 12: invalid start byte
 skipping application-java/lib/netty-buffer-4.1.49.Final.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x96 in position 12: invalid start byte
 skipping application-java/lib/netty-resolver-4.1.49.Final.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xcc in position 10: invalid continuation byte
 skipping application-java/lib/netty-common-4.1.49.Final.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x9b in position 11: invalid start byte
 skipping application-java/lib/bcprov-jdk15on-1.62.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x94 in position 16: invalid start byte
 skipping application-java/lib/httpcore-4.4.13.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xac in position 10: invalid start byte
 skipping application-java/lib/auto-value-annotations-1.7.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xa5 in position 89: invalid start byte
 skipping application-java/lib/commons-math3-3.6.1.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xa8 in position 10: invalid start byte
 skipping application-java/lib/javax.activation-api-1.2.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x9b in position 11: invalid start byte
 skipping application-java/lib/annotations-4.1.1.4.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x86 in position 10: invalid start byte
 skipping application-java/lib/opencensus-contrib-grpc-metrics-0.21.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x86 in position 11: invalid start byte
 skipping application-java/lib/opencensus-api-0.21.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0xa1 in position 11: invalid start byte
 skipping application-java/lib/grpc-context-1.23.0.jar Error: No contents decoded
13:02:37 ERROR - Error -> 'utf-8' codec can't decode byte 0x8e in position 10: invalid start byte
 skipping application-java/lib/animal-sniffer-annotations-1.17.jar Error: No contents decoded
output_file_name /home/dawood/git/fm-data-engineering/tools/ingest2parquet/src/../test-data/output/application-java.parquet
processing stats generated {'total_files_given': 2, 'total_files_processed': 2, 'total_files_failed_to_processed': 0, 'total_no_of_rows': 54, 'total_bytes_in_memory': 79661, 'failure_details': []}
Metadata file stored - response: {'name': '/home/dawood/git/fm-data-engineering/tools/ingest2parquet/src/../test-data/output/metadata.json', 'size': 445}
[dawood@data-engineering1 ingest2parquet]$

Anything else

No response

OS

MacOS (limited support)

Python

3.10.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

[Feature] Tokenizer needs pure python sample py

transforms/universal/tokenizer needs src/tokenizer_local.py to show how to use the transform outside of ray. See noop as an example.

ibm / data-prep-kit Goto Github PK

data-prep-kit's Introduction

Data Prep Kit

📝 Table of Contents

📖 About

Data Processing Library

Transform Design

Scaling of Transforms

Bring Your Own Transform

Automation

🚀 Getting Started

Prerequisites

Installation Steps

Build Your Own Transforms

Run a Single Transform on Local Ray

Run a Jupyter notebook on Local Ray cluster

Automate a Pipeline

How to Navigate and Use the Repository

🤝 How to Contribute

⭐ Acknowledgements

data-prep-kit's People

Contributors

Stargazers

Watchers

Forkers

data-prep-kit's Issues

Search before asking

Component

Feature

Are you willing to submit a PR?

Search before asking

Component

Feature

Are you willing to submit a PR?

Search before asking

Component

Feature

Are you willing to submit a PR?

Search before asking

Component

Feature

Are you willing to submit a PR?

Search before asking

Component

Feature

Are you willing to submit a PR?

Search before asking

Component

Feature

Are you willing to submit a PR?

Search before asking

Component

Feature

Are you willing to submit a PR?

Search before asking

Component

What happened + What you expected to happen

Reproduction script

Anything else

OS

Python

Are you willing to submit a PR?

Search before asking

Component

What happened + What you expected to happen

Reproduction script

Anything else

OS

Python

Are you willing to submit a PR?

Search before asking

Component

Feature

Are you willing to submit a PR?

Search before asking

Component

Feature

Are you willing to submit a PR?

Search before asking

Component