Before Reporting 报告之前 <li class="task-list-item"

嗨 <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

<a href="https://github.com/alibaba/data-juicer/blob/main/data_juicer/core/ray_e

perplexity_filter: 在RAY模式下似乎也不行的 AttributeErr

嗨 <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/

似乎不行，可否验证下<a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

[Bug]: RAY error about data-juicer HOT 6 CLOSED

modelscope commented on May 18, 2024

[Bug]: RAY error

from data-juicer.

Comments (6)

HYLcool commented on May 18, 2024

嗨 @simplew2011

麻烦你先check一下language_id_score_filter这个算子所需的模型是否成功下载到本地并确认其完整性以及正确性，模型存放目录默认为~/.cache/data_juicer/models，该算子所需的模型应该为目录下的lid.176.bin文件，其大小为131,266,198字节，其md5为01810bc59c6a3d2b79c79e6336612f65

如发现模型存在问题，你可以将问题文件删除后再次运行dj，它会自动进行下载（可能需要花费一些时间）

from data-juicer.

simplew2011 commented on May 18, 2024

手动删除，自动下载，重新跑

ok

python tools/process_data.py --config configs/demo/process.yaml

error

python tools/process_data.py --config configs/demo/process.yaml --executor_type ray

from data-juicer.

simplew2011 commented on May 18, 2024

考虑使用dataset.map_batches是否更加高效

from data-juicer.

simplew2011 commented on May 18, 2024

perplexity_filter: 在RAY模式下似乎也不行的

AttributeError: 'NoneType' object has no attribute 'score'

from data-juicer.

HYLcool commented on May 18, 2024

嗨 @simplew2011

在RAY模式下这些有模型依赖的OP不可用的问题我们正在 #100 中修复，待review通过merge到main分支后就ok了，到时候我们会告知你的~

from data-juicer.

simplew2011 commented on May 18, 2024

似乎不行，可否验证下@HYLcool，配置表是默认的：
python tools/process_data.py --config demos/process_on_ray/configs/demo.yaml

outputs.zip

2023-12-12 11:07:12.314 | INFO     | data_juicer.core.ray_executor:run:62 - Processing data...
2023-12-12 11:07:20.569 | INFO     | data_juicer.core.ray_executor:run:83 - Op [alphanumeric_filter] Done. Left 11 samples.
2023-12-12 11:07:20.915 | INFO     | data_juicer.core.ray_executor:run:83 - Op [average_line_length_filter] Done. Left 10 samples.
2023-12-12 11:07:21.632 | INFO     | data_juicer.core.ray_executor:run:83 - Op [character_repetition_filter] Done. Left 10 samples.
2023-12-12 11:07:22.428 | INFO     | data_juicer.core.ray_executor:run:83 - Op [flagged_words_filter] Done. Left 10 samples.
2023-12-12 11:07:23.321 | INFO     | data_juicer.core.ray_executor:run:83 - Op [language_id_score_filter] Done. Left 3 samples.
2023-12-12 11:07:24.115 | INFO     | data_juicer.core.ray_executor:run:83 - Op [maximum_line_length_filter] Done. Left 3 samples.
2023-12-12 11:07:24.898 | INFO     | data_juicer.core.ray_executor:run:83 - Op [perplexity_filter] Done. Left 3 samples.
2023-12-12 11:07:25.818 | INFO     | data_juicer.core.ray_executor:run:83 - Op [special_characters_filter] Done. Left 3 samples.
2023-12-12 11:07:26.631 | INFO     | data_juicer.core.ray_executor:run:83 - Op [stopwords_filter] Done. Left 3 samples.
2023-12-12 11:07:27.464 | INFO     | data_juicer.core.ray_executor:run:83 - Op [text_length_filter] Done. Left 3 samples.
2023-12-12 11:07:28.243 | INFO     | data_juicer.core.ray_executor:run:83 - Op [words_num_filter] Done. Left 1 samples.
2023-12-12 11:07:29.052 | INFO     | data_juicer.core.ray_executor:run:83 - Op [word_repetition_filter] Done. Left 1 samples.
2023-12-12 11:07:29.053 | INFO     | data_juicer.core.ray_executor:run:87 - Exporting dataset to disk...
2023-12-12 11:07:31.917 | ERROR    | __main__:<module>:19 - An error has been caught in function '<module>', process 'MainProcess' (41651), thread 'MainThread' (140511941588800):
Traceback (most recent call last):

  File "python/ray/_raylet.pyx", line 347, in ray._raylet.StreamingObjectRefGenerator._next_sync
  File "python/ray/_raylet.pyx", line 4643, in ray._raylet.CoreWorker.try_read_next_object_ref_stream
  File "python/ray/_raylet.pyx", line 447, in ray._raylet.check_status

ray.exceptions.ObjectRefStreamEndOfStreamError


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 80, in on_data_ready
    meta = ray.get(next(self._streaming_gen))
           │   │        │    └ <ray._raylet.StreamingObjectRefGenerator object at 0x7fc8746a4ca0>
           │   │        └ <ray.data._internal.execution.interfaces.physical_operator.DataOpTask object at 0x7fc8746a4340>
           │   └ <function get at 0x7fc88c7e7820>
           └ <module 'ray' from '/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/__init__.py'>
  File "python/ray/_raylet.pyx", line 302, in ray._raylet.StreamingObjectRefGenerator.__next__
  File "python/ray/_raylet.pyx", line 353, in ray._raylet.StreamingObjectRefGenerator._next_sync

StopIteration


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

> File "tools/process_data.py", line 19, in <module>
    main()
    └ <function main at 0x7fcb7b0f14c0>

  File "tools/process_data.py", line 15, in main
    executor.run()
    │        └ <function RayExecutor.run at 0x7fc88c7e7dc0>
    └ <data_juicer.core.ray_executor.RayExecutor object at 0x7fc88e9177f0>

  File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/ray_executor.py", line 88, in run
    dataset.write_json(self.cfg.export_path, force_ascii=False)
    │       │          │    │   └ './outputs/demo/demo-processed'
    │       │          │    └ Namespace(add_suffix=False, alphanumeric_filter=Namespace(image_key=None, max_ratio=9223372036854775807, min_ratio=0.25, text...
    │       │          └ <data_juicer.core.ray_executor.RayExecutor object at 0x7fc88e9177f0>
    │       └ <function Dataset.write_json at 0x7fc88c4b61f0>
    └ Dataset(
         num_blocks=192,
         num_rows=1,
         schema={
            text: string,
            __dj...: struct<alnum_ratio: double, avg_lin...

  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 2821, in write_json
    self.write_datasource(
    │    └ <function Dataset.write_datasource at 0x7fc88c4b6940>
    └ Dataset(
         num_blocks=192,
         num_rows=1,
         schema={
            text: string,
            __dj...: struct<alnum_ratio: double, avg_lin...
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 3457, in write_datasource
    self._write_ds = Dataset(plan, logical_plan).materialize()
    │    │           │       │     └ <ray.data._internal.logical.interfaces.logical_plan.LogicalPlan object at 0x7fc8747ade20>
    │    │           │       └ ExecutionPlan(dataset_uuid=ae6c733618d049d495d01de1f9bd255e, run_by_consumer=False, in_blocks=LazyBlockList(owned_by_consumer...
    │    │           └ <class 'ray.data.dataset.Dataset'>
    │    └ None
    └ Dataset(
         num_blocks=192,
         num_rows=1,
         schema={
            text: string,
            __dj...: struct<alnum_ratio: double, avg_lin...
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 4502, in materialize
    copy._plan.execute(force_read=True)
    │    │     └ <function ExecutionPlan.execute at 0x7fc88c5485e0>
    │    └ ExecutionPlan(dataset_uuid=9cc683010ce54ac1b990d2aacc2f72af, run_by_consumer=False, in_blocks=LazyBlockList(owned_by_consumer...
    └ Write
      +- MaterializedDataset(
            num_blocks=192,
            num_rows=1,
            schema={
               text: string,
               _...: st...
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/plan.py", line 599, in execute
    blocks = execute_to_legacy_block_list(
             └ <function execute_to_legacy_block_list at 0x7fc88c55a3a0>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py", line 119, in execute_to_legacy_block_list
    block_list = _bundles_to_block_list(bundles)
                 │                      └ <ray.data._internal.execution.streaming_executor.StreamingExecutor.execute.<locals>.StreamIterator object at 0x7fc8746f1910>
                 └ <function _bundles_to_block_list at 0x7fc88c55a700>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py", line 356, in _bundles_to_block_list
    for ref_bundle in bundles:
        │             └ <ray.data._internal.execution.streaming_executor.StreamingExecutor.execute.<locals>.StreamIterator object at 0x7fc8746f1910>
        └ RefBundle(blocks=((ObjectRef(f1e4ccbdc9f0fac3ffffffffffffffffffffffff0500000002000000), BlockMetadata(num_rows=1, size_bytes=...
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/executor.py", line 37, in __next__
    return self.get_next()
           │    └ <function StreamingExecutor.execute.<locals>.StreamIterator.get_next at 0x7fc87475f700>
           └ <ray.data._internal.execution.streaming_executor.StreamingExecutor.execute.<locals>.StreamIterator object at 0x7fc8746f1910>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 141, in get_next
    raise item
          └ RayTaskError(FileNotFoundError)(FileNotFoundError(2, "Failed to open local file './outputs/demo/demo-processed/c4e26ca89c5d4a...
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 201, in run
    while self._scheduling_loop_step(self._topology) and not self._shutdown:
          │    │                     │    │                  │    └ True
          │    │                     │    │                  └ <StreamingExecutor(StreamingExecutor-4858e152dee04093abb26996a58fd5d0, stopped daemon 140498538833664)>
          │    │                     │    └ {InputDataBuffer[Input]: <ray.data._internal.execution.streaming_executor_state.OpState object at 0x7fc874686e20>, InputDataB...
          │    │                     └ <StreamingExecutor(StreamingExecutor-4858e152dee04093abb26996a58fd5d0, stopped daemon 140498538833664)>
          │    └ <function StreamingExecutor._scheduling_loop_step at 0x7fc88c519790>
          └ <StreamingExecutor(StreamingExecutor-4858e152dee04093abb26996a58fd5d0, stopped daemon 140498538833664)>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 252, in _scheduling_loop_step
    process_completed_tasks(topology, self._backpressure_policies)
    │                       │         │    └ []
    │                       │         └ <StreamingExecutor(StreamingExecutor-4858e152dee04093abb26996a58fd5d0, stopped daemon 140498538833664)>
    │                       └ {InputDataBuffer[Input]: <ray.data._internal.execution.streaming_executor_state.OpState object at 0x7fc874686e20>, InputDataB...
    └ <function process_completed_tasks at 0x7fc88c519160>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 365, in process_completed_tasks
    num_blocks_read = task.on_data_ready(
                      │    └ <function DataOpTask.on_data_ready at 0x7fc88c7050d0>
                      └ <ray.data._internal.execution.interfaces.physical_operator.DataOpTask object at 0x7fc8746a4340>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 88, in on_data_ready
    ex = ray.get(block_ref)
         │   │   └ ObjectRef(cf74d1b865704d22ffffffffffffffffffffffff0500000001000000)
         │   └ <function get at 0x7fc88c7e7820>
         └ <module 'ray' from '/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/__init__.py'>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
           │   │       └ {}
           │   └ (ObjectRef(cf74d1b865704d22ffffffffffffffffffffffff0500000001000000),)
           └ <function get at 0x7fc88c856790>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           │     │       └ {}
           │     └ (ObjectRef(cf74d1b865704d22ffffffffffffffffffffffff0500000001000000),)
           └ <function get at 0x7fc88c856700>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/worker.py", line 2563, in get
    raise value.as_instanceof_cause()
          │     └ <function RayTaskError.as_instanceof_cause at 0x7fc88cec3700>
          └ RayTaskError('ray.data._internal.execution.operators.map_operator._map_task', 'Traceback (most recent call last):\n  File "py...

ray.exceptions.RayTaskError(FileNotFoundError): �[36mray::MapBatches(process_batch)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Write()�[39m (pid=42507, ip=10.23.4.252)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 416, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 232, in __call__
    yield from self._block_fn(input, ctx)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/planner/plan_write_op.py", line 27, in fn
    {"write_result": [datasource.write(blocks, ctx, **write_args)]}
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 289, in write
    with _open_file_with_retry(
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 881, in _open_file_with_retry
    raise e from None
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 863, in _open_file_with_retry
    return open_file()
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 291, in <lambda>
    lambda: fs.open_output_stream(write_path, **open_stream_args),
  File "pyarrow/_fs.pyx", line 868, in pyarrow._fs.FileSystem.open_output_stream
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file './outputs/demo/demo-processed/c4e26ca89c5d4af1a1c63c57d3dc2875_000000_000000.json'. Detail: [errno 2] No such file or directory

from data-juicer.

[Bug]: RAY error about data-juicer HOT 6 CLOSED

Comments (6)

ok

error

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent