Comments (6)
麻烦你先check一下language_id_score_filter这个算子所需的模型是否成功下载到本地并确认其完整性以及正确性,模型存放目录默认为~/.cache/data_juicer/models
,该算子所需的模型应该为目录下的lid.176.bin
文件,其大小为131,266,198字节,其md5为01810bc59c6a3d2b79c79e6336612f65
如发现模型存在问题,你可以将问题文件删除后再次运行dj,它会自动进行下载(可能需要花费一些时间)
from data-juicer.
手动删除,自动下载,重新跑
ok
python tools/process_data.py --config configs/demo/process.yaml
error
python tools/process_data.py --config configs/demo/process.yaml --executor_type ray
from data-juicer.
- https://github.com/alibaba/data-juicer/blob/main/data_juicer/core/ray_executor.py#L67
- https://docs.ray.io/en/master/data/api/doc/ray.data.Dataset.map_batches.html#ray.data.Dataset.map_batches
考虑使用dataset.map_batches是否更加高效
from data-juicer.
- perplexity_filter: 在RAY模式下似乎也不行的
AttributeError: 'NoneType' object has no attribute 'score'
from data-juicer.
在RAY模式下这些有模型依赖的OP不可用的问题我们正在 #100 中修复,待review通过merge到main分支后就ok了,到时候我们会告知你的~
from data-juicer.
似乎不行,可否验证下@HYLcool,配置表是默认的:
python tools/process_data.py --config demos/process_on_ray/configs/demo.yaml
2023-12-12 11:07:12.314 | INFO | data_juicer.core.ray_executor:run:62 - Processing data...
2023-12-12 11:07:20.569 | INFO | data_juicer.core.ray_executor:run:83 - Op [alphanumeric_filter] Done. Left 11 samples.
2023-12-12 11:07:20.915 | INFO | data_juicer.core.ray_executor:run:83 - Op [average_line_length_filter] Done. Left 10 samples.
2023-12-12 11:07:21.632 | INFO | data_juicer.core.ray_executor:run:83 - Op [character_repetition_filter] Done. Left 10 samples.
2023-12-12 11:07:22.428 | INFO | data_juicer.core.ray_executor:run:83 - Op [flagged_words_filter] Done. Left 10 samples.
2023-12-12 11:07:23.321 | INFO | data_juicer.core.ray_executor:run:83 - Op [language_id_score_filter] Done. Left 3 samples.
2023-12-12 11:07:24.115 | INFO | data_juicer.core.ray_executor:run:83 - Op [maximum_line_length_filter] Done. Left 3 samples.
2023-12-12 11:07:24.898 | INFO | data_juicer.core.ray_executor:run:83 - Op [perplexity_filter] Done. Left 3 samples.
2023-12-12 11:07:25.818 | INFO | data_juicer.core.ray_executor:run:83 - Op [special_characters_filter] Done. Left 3 samples.
2023-12-12 11:07:26.631 | INFO | data_juicer.core.ray_executor:run:83 - Op [stopwords_filter] Done. Left 3 samples.
2023-12-12 11:07:27.464 | INFO | data_juicer.core.ray_executor:run:83 - Op [text_length_filter] Done. Left 3 samples.
2023-12-12 11:07:28.243 | INFO | data_juicer.core.ray_executor:run:83 - Op [words_num_filter] Done. Left 1 samples.
2023-12-12 11:07:29.052 | INFO | data_juicer.core.ray_executor:run:83 - Op [word_repetition_filter] Done. Left 1 samples.
2023-12-12 11:07:29.053 | INFO | data_juicer.core.ray_executor:run:87 - Exporting dataset to disk...
2023-12-12 11:07:31.917 | ERROR | __main__:<module>:19 - An error has been caught in function '<module>', process 'MainProcess' (41651), thread 'MainThread' (140511941588800):
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 347, in ray._raylet.StreamingObjectRefGenerator._next_sync
File "python/ray/_raylet.pyx", line 4643, in ray._raylet.CoreWorker.try_read_next_object_ref_stream
File "python/ray/_raylet.pyx", line 447, in ray._raylet.check_status
ray.exceptions.ObjectRefStreamEndOfStreamError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 80, in on_data_ready
meta = ray.get(next(self._streaming_gen))
│ │ │ └ <ray._raylet.StreamingObjectRefGenerator object at 0x7fc8746a4ca0>
│ │ └ <ray.data._internal.execution.interfaces.physical_operator.DataOpTask object at 0x7fc8746a4340>
│ └ <function get at 0x7fc88c7e7820>
└ <module 'ray' from '/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/__init__.py'>
File "python/ray/_raylet.pyx", line 302, in ray._raylet.StreamingObjectRefGenerator.__next__
File "python/ray/_raylet.pyx", line 353, in ray._raylet.StreamingObjectRefGenerator._next_sync
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
> File "tools/process_data.py", line 19, in <module>
main()
└ <function main at 0x7fcb7b0f14c0>
File "tools/process_data.py", line 15, in main
executor.run()
│ └ <function RayExecutor.run at 0x7fc88c7e7dc0>
└ <data_juicer.core.ray_executor.RayExecutor object at 0x7fc88e9177f0>
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/ray_executor.py", line 88, in run
dataset.write_json(self.cfg.export_path, force_ascii=False)
│ │ │ │ └ './outputs/demo/demo-processed'
│ │ │ └ Namespace(add_suffix=False, alphanumeric_filter=Namespace(image_key=None, max_ratio=9223372036854775807, min_ratio=0.25, text...
│ │ └ <data_juicer.core.ray_executor.RayExecutor object at 0x7fc88e9177f0>
│ └ <function Dataset.write_json at 0x7fc88c4b61f0>
└ Dataset(
num_blocks=192,
num_rows=1,
schema={
text: string,
__dj...: struct<alnum_ratio: double, avg_lin...
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 2821, in write_json
self.write_datasource(
│ └ <function Dataset.write_datasource at 0x7fc88c4b6940>
└ Dataset(
num_blocks=192,
num_rows=1,
schema={
text: string,
__dj...: struct<alnum_ratio: double, avg_lin...
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 3457, in write_datasource
self._write_ds = Dataset(plan, logical_plan).materialize()
│ │ │ │ └ <ray.data._internal.logical.interfaces.logical_plan.LogicalPlan object at 0x7fc8747ade20>
│ │ │ └ ExecutionPlan(dataset_uuid=ae6c733618d049d495d01de1f9bd255e, run_by_consumer=False, in_blocks=LazyBlockList(owned_by_consumer...
│ │ └ <class 'ray.data.dataset.Dataset'>
│ └ None
└ Dataset(
num_blocks=192,
num_rows=1,
schema={
text: string,
__dj...: struct<alnum_ratio: double, avg_lin...
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 4502, in materialize
copy._plan.execute(force_read=True)
│ │ └ <function ExecutionPlan.execute at 0x7fc88c5485e0>
│ └ ExecutionPlan(dataset_uuid=9cc683010ce54ac1b990d2aacc2f72af, run_by_consumer=False, in_blocks=LazyBlockList(owned_by_consumer...
└ Write
+- MaterializedDataset(
num_blocks=192,
num_rows=1,
schema={
text: string,
_...: st...
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/plan.py", line 599, in execute
blocks = execute_to_legacy_block_list(
└ <function execute_to_legacy_block_list at 0x7fc88c55a3a0>
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py", line 119, in execute_to_legacy_block_list
block_list = _bundles_to_block_list(bundles)
│ └ <ray.data._internal.execution.streaming_executor.StreamingExecutor.execute.<locals>.StreamIterator object at 0x7fc8746f1910>
└ <function _bundles_to_block_list at 0x7fc88c55a700>
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py", line 356, in _bundles_to_block_list
for ref_bundle in bundles:
│ └ <ray.data._internal.execution.streaming_executor.StreamingExecutor.execute.<locals>.StreamIterator object at 0x7fc8746f1910>
└ RefBundle(blocks=((ObjectRef(f1e4ccbdc9f0fac3ffffffffffffffffffffffff0500000002000000), BlockMetadata(num_rows=1, size_bytes=...
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/executor.py", line 37, in __next__
return self.get_next()
│ └ <function StreamingExecutor.execute.<locals>.StreamIterator.get_next at 0x7fc87475f700>
└ <ray.data._internal.execution.streaming_executor.StreamingExecutor.execute.<locals>.StreamIterator object at 0x7fc8746f1910>
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 141, in get_next
raise item
└ RayTaskError(FileNotFoundError)(FileNotFoundError(2, "Failed to open local file './outputs/demo/demo-processed/c4e26ca89c5d4a...
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 201, in run
while self._scheduling_loop_step(self._topology) and not self._shutdown:
│ │ │ │ │ └ True
│ │ │ │ └ <StreamingExecutor(StreamingExecutor-4858e152dee04093abb26996a58fd5d0, stopped daemon 140498538833664)>
│ │ │ └ {InputDataBuffer[Input]: <ray.data._internal.execution.streaming_executor_state.OpState object at 0x7fc874686e20>, InputDataB...
│ │ └ <StreamingExecutor(StreamingExecutor-4858e152dee04093abb26996a58fd5d0, stopped daemon 140498538833664)>
│ └ <function StreamingExecutor._scheduling_loop_step at 0x7fc88c519790>
└ <StreamingExecutor(StreamingExecutor-4858e152dee04093abb26996a58fd5d0, stopped daemon 140498538833664)>
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 252, in _scheduling_loop_step
process_completed_tasks(topology, self._backpressure_policies)
│ │ │ └ []
│ │ └ <StreamingExecutor(StreamingExecutor-4858e152dee04093abb26996a58fd5d0, stopped daemon 140498538833664)>
│ └ {InputDataBuffer[Input]: <ray.data._internal.execution.streaming_executor_state.OpState object at 0x7fc874686e20>, InputDataB...
└ <function process_completed_tasks at 0x7fc88c519160>
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 365, in process_completed_tasks
num_blocks_read = task.on_data_ready(
│ └ <function DataOpTask.on_data_ready at 0x7fc88c7050d0>
└ <ray.data._internal.execution.interfaces.physical_operator.DataOpTask object at 0x7fc8746a4340>
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 88, in on_data_ready
ex = ray.get(block_ref)
│ │ └ ObjectRef(cf74d1b865704d22ffffffffffffffffffffffff0500000001000000)
│ └ <function get at 0x7fc88c7e7820>
└ <module 'ray' from '/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/__init__.py'>
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
│ │ └ {}
│ └ (ObjectRef(cf74d1b865704d22ffffffffffffffffffffffff0500000001000000),)
└ <function get at 0x7fc88c856790>
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
│ │ └ {}
│ └ (ObjectRef(cf74d1b865704d22ffffffffffffffffffffffff0500000001000000),)
└ <function get at 0x7fc88c856700>
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/worker.py", line 2563, in get
raise value.as_instanceof_cause()
│ └ <function RayTaskError.as_instanceof_cause at 0x7fc88cec3700>
└ RayTaskError('ray.data._internal.execution.operators.map_operator._map_task', 'Traceback (most recent call last):\n File "py...
ray.exceptions.RayTaskError(FileNotFoundError): �[36mray::MapBatches(process_batch)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Write()�[39m (pid=42507, ip=10.23.4.252)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 416, in _map_task
for b_out in map_transformer.apply_transform(iter(blocks), ctx):
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 232, in __call__
yield from self._block_fn(input, ctx)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/planner/plan_write_op.py", line 27, in fn
{"write_result": [datasource.write(blocks, ctx, **write_args)]}
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 289, in write
with _open_file_with_retry(
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 881, in _open_file_with_retry
raise e from None
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 863, in _open_file_with_retry
return open_file()
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 291, in <lambda>
lambda: fs.open_output_stream(write_path, **open_stream_args),
File "pyarrow/_fs.pyx", line 868, in pyarrow._fs.FileSystem.open_output_stream
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file './outputs/demo/demo-processed/c4e26ca89c5d4af1a1c63c57d3dc2875_000000_000000.json'. Detail: [errno 2] No such file or directory
from data-juicer.
Related Issues (20)
- OP insight demo enhancement
- DJ-v.0.2 docker image update HOT 1
- DJ-v0.2 API page enhancement
- Video content compliance and privacy protection operators (image, text, audio)
- [Bug]: video split by duration mapper return non-exist video
- support panda's student captioner model in our captioning mapper HOT 3
- [Bug]: Video_split_by_scene_mapper create non-exist video_keys
- [Feature Request] Implement more streamlined interfaces for users seeking minimal functionality (data_juicer.op.functional) HOT 2
- Request a sample code demonstrating the use of image_captioning_from_gpt4v_mapper.py HOT 3
- Can not download the data quality classifier models. HOT 1
- alphanumeric_filter算子清洗疑问 HOT 5
- Absolute path to relative path for multi-source
- [Bug]: process on ray occur "TypeError: 'str' object cannot be interpreted as an integer" HOT 8
- filter是否支持batch处理,以及怎么设置batch_size? HOT 5
- hash calculate in ray deduplicator HOT 4
- 为什么大部分的refined recipe都是用simhash去重? HOT 3
- [Bug]: 运行tools/analyze_data.py报错,出现 KeyError: 'text' HOT 2
- [Question] Can't find evalutor.yaml on the path of `/workspace/data-juicer/demos` HOT 1
- A Compatibility Issue in Environment Installation of DJ-Sandbox HOT 1
- stopwords_filter 为什么是过滤掉小于某个阈值的样本 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from data-juicer.