Giter VIP home page Giter VIP logo

Comments (6)

HYLcool avatar HYLcool commented on May 18, 2024

@simplew2011

麻烦你先check一下language_id_score_filter这个算子所需的模型是否成功下载到本地并确认其完整性以及正确性,模型存放目录默认为~/.cache/data_juicer/models,该算子所需的模型应该为目录下的lid.176.bin文件,其大小为131,266,198字节,其md5为01810bc59c6a3d2b79c79e6336612f65

如发现模型存在问题,你可以将问题文件删除后再次运行dj,它会自动进行下载(可能需要花费一些时间)

from data-juicer.

simplew2011 avatar simplew2011 commented on May 18, 2024

image

手动删除,自动下载,重新跑

ok

python tools/process_data.py --config configs/demo/process.yaml

error

python tools/process_data.py --config configs/demo/process.yaml --executor_type ray

from data-juicer.

simplew2011 avatar simplew2011 commented on May 18, 2024

考虑使用dataset.map_batches是否更加高效

from data-juicer.

simplew2011 avatar simplew2011 commented on May 18, 2024
  • perplexity_filter: 在RAY模式下似乎也不行的

AttributeError: 'NoneType' object has no attribute 'score'

from data-juicer.

HYLcool avatar HYLcool commented on May 18, 2024

@simplew2011

在RAY模式下这些有模型依赖的OP不可用的问题我们正在 #100 中修复,待review通过merge到main分支后就ok了,到时候我们会告知你的~

from data-juicer.

simplew2011 avatar simplew2011 commented on May 18, 2024

似乎不行,可否验证下@HYLcool,配置表是默认的:
python tools/process_data.py --config demos/process_on_ray/configs/demo.yaml

outputs.zip

2023-12-12 11:07:12.314 | INFO     | data_juicer.core.ray_executor:run:62 - Processing data...
2023-12-12 11:07:20.569 | INFO     | data_juicer.core.ray_executor:run:83 - Op [alphanumeric_filter] Done. Left 11 samples.
2023-12-12 11:07:20.915 | INFO     | data_juicer.core.ray_executor:run:83 - Op [average_line_length_filter] Done. Left 10 samples.
2023-12-12 11:07:21.632 | INFO     | data_juicer.core.ray_executor:run:83 - Op [character_repetition_filter] Done. Left 10 samples.
2023-12-12 11:07:22.428 | INFO     | data_juicer.core.ray_executor:run:83 - Op [flagged_words_filter] Done. Left 10 samples.
2023-12-12 11:07:23.321 | INFO     | data_juicer.core.ray_executor:run:83 - Op [language_id_score_filter] Done. Left 3 samples.
2023-12-12 11:07:24.115 | INFO     | data_juicer.core.ray_executor:run:83 - Op [maximum_line_length_filter] Done. Left 3 samples.
2023-12-12 11:07:24.898 | INFO     | data_juicer.core.ray_executor:run:83 - Op [perplexity_filter] Done. Left 3 samples.
2023-12-12 11:07:25.818 | INFO     | data_juicer.core.ray_executor:run:83 - Op [special_characters_filter] Done. Left 3 samples.
2023-12-12 11:07:26.631 | INFO     | data_juicer.core.ray_executor:run:83 - Op [stopwords_filter] Done. Left 3 samples.
2023-12-12 11:07:27.464 | INFO     | data_juicer.core.ray_executor:run:83 - Op [text_length_filter] Done. Left 3 samples.
2023-12-12 11:07:28.243 | INFO     | data_juicer.core.ray_executor:run:83 - Op [words_num_filter] Done. Left 1 samples.
2023-12-12 11:07:29.052 | INFO     | data_juicer.core.ray_executor:run:83 - Op [word_repetition_filter] Done. Left 1 samples.
2023-12-12 11:07:29.053 | INFO     | data_juicer.core.ray_executor:run:87 - Exporting dataset to disk...
2023-12-12 11:07:31.917 | ERROR    | __main__:<module>:19 - An error has been caught in function '<module>', process 'MainProcess' (41651), thread 'MainThread' (140511941588800):
Traceback (most recent call last):

  File "python/ray/_raylet.pyx", line 347, in ray._raylet.StreamingObjectRefGenerator._next_sync
  File "python/ray/_raylet.pyx", line 4643, in ray._raylet.CoreWorker.try_read_next_object_ref_stream
  File "python/ray/_raylet.pyx", line 447, in ray._raylet.check_status

ray.exceptions.ObjectRefStreamEndOfStreamError


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 80, in on_data_ready
    meta = ray.get(next(self._streaming_gen))
           │   │        │    └ <ray._raylet.StreamingObjectRefGenerator object at 0x7fc8746a4ca0>
           │   │        └ <ray.data._internal.execution.interfaces.physical_operator.DataOpTask object at 0x7fc8746a4340>
           │   └ <function get at 0x7fc88c7e7820>
           └ <module 'ray' from '/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/__init__.py'>
  File "python/ray/_raylet.pyx", line 302, in ray._raylet.StreamingObjectRefGenerator.__next__
  File "python/ray/_raylet.pyx", line 353, in ray._raylet.StreamingObjectRefGenerator._next_sync

StopIteration


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

> File "tools/process_data.py", line 19, in <module>
    main()
    └ <function main at 0x7fcb7b0f14c0>

  File "tools/process_data.py", line 15, in main
    executor.run()
    │        └ <function RayExecutor.run at 0x7fc88c7e7dc0>
    └ <data_juicer.core.ray_executor.RayExecutor object at 0x7fc88e9177f0>

  File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/ray_executor.py", line 88, in run
    dataset.write_json(self.cfg.export_path, force_ascii=False)
    │       │          │    │   └ './outputs/demo/demo-processed'
    │       │          │    └ Namespace(add_suffix=False, alphanumeric_filter=Namespace(image_key=None, max_ratio=9223372036854775807, min_ratio=0.25, text...
    │       │          └ <data_juicer.core.ray_executor.RayExecutor object at 0x7fc88e9177f0>
    │       └ <function Dataset.write_json at 0x7fc88c4b61f0>
    └ Dataset(
         num_blocks=192,
         num_rows=1,
         schema={
            text: string,
            __dj...: struct<alnum_ratio: double, avg_lin...

  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 2821, in write_json
    self.write_datasource(
    │    └ <function Dataset.write_datasource at 0x7fc88c4b6940>
    └ Dataset(
         num_blocks=192,
         num_rows=1,
         schema={
            text: string,
            __dj...: struct<alnum_ratio: double, avg_lin...
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 3457, in write_datasource
    self._write_ds = Dataset(plan, logical_plan).materialize()
    │    │           │       │     └ <ray.data._internal.logical.interfaces.logical_plan.LogicalPlan object at 0x7fc8747ade20>
    │    │           │       └ ExecutionPlan(dataset_uuid=ae6c733618d049d495d01de1f9bd255e, run_by_consumer=False, in_blocks=LazyBlockList(owned_by_consumer...
    │    │           └ <class 'ray.data.dataset.Dataset'>
    │    └ None
    └ Dataset(
         num_blocks=192,
         num_rows=1,
         schema={
            text: string,
            __dj...: struct<alnum_ratio: double, avg_lin...
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 4502, in materialize
    copy._plan.execute(force_read=True)
    │    │     └ <function ExecutionPlan.execute at 0x7fc88c5485e0>
    │    └ ExecutionPlan(dataset_uuid=9cc683010ce54ac1b990d2aacc2f72af, run_by_consumer=False, in_blocks=LazyBlockList(owned_by_consumer...
    └ Write
      +- MaterializedDataset(
            num_blocks=192,
            num_rows=1,
            schema={
               text: string,
               _...: st...
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/plan.py", line 599, in execute
    blocks = execute_to_legacy_block_list(
             └ <function execute_to_legacy_block_list at 0x7fc88c55a3a0>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py", line 119, in execute_to_legacy_block_list
    block_list = _bundles_to_block_list(bundles)
                 │                      └ <ray.data._internal.execution.streaming_executor.StreamingExecutor.execute.<locals>.StreamIterator object at 0x7fc8746f1910>
                 └ <function _bundles_to_block_list at 0x7fc88c55a700>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py", line 356, in _bundles_to_block_list
    for ref_bundle in bundles:
        │             └ <ray.data._internal.execution.streaming_executor.StreamingExecutor.execute.<locals>.StreamIterator object at 0x7fc8746f1910>
        └ RefBundle(blocks=((ObjectRef(f1e4ccbdc9f0fac3ffffffffffffffffffffffff0500000002000000), BlockMetadata(num_rows=1, size_bytes=...
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/executor.py", line 37, in __next__
    return self.get_next()
           │    └ <function StreamingExecutor.execute.<locals>.StreamIterator.get_next at 0x7fc87475f700>
           └ <ray.data._internal.execution.streaming_executor.StreamingExecutor.execute.<locals>.StreamIterator object at 0x7fc8746f1910>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 141, in get_next
    raise item
          └ RayTaskError(FileNotFoundError)(FileNotFoundError(2, "Failed to open local file './outputs/demo/demo-processed/c4e26ca89c5d4a...
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 201, in run
    while self._scheduling_loop_step(self._topology) and not self._shutdown:
          │    │                     │    │                  │    └ True
          │    │                     │    │                  └ <StreamingExecutor(StreamingExecutor-4858e152dee04093abb26996a58fd5d0, stopped daemon 140498538833664)>
          │    │                     │    └ {InputDataBuffer[Input]: <ray.data._internal.execution.streaming_executor_state.OpState object at 0x7fc874686e20>, InputDataB...
          │    │                     └ <StreamingExecutor(StreamingExecutor-4858e152dee04093abb26996a58fd5d0, stopped daemon 140498538833664)>
          │    └ <function StreamingExecutor._scheduling_loop_step at 0x7fc88c519790>
          └ <StreamingExecutor(StreamingExecutor-4858e152dee04093abb26996a58fd5d0, stopped daemon 140498538833664)>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 252, in _scheduling_loop_step
    process_completed_tasks(topology, self._backpressure_policies)
    │                       │         │    └ []
    │                       │         └ <StreamingExecutor(StreamingExecutor-4858e152dee04093abb26996a58fd5d0, stopped daemon 140498538833664)>
    │                       └ {InputDataBuffer[Input]: <ray.data._internal.execution.streaming_executor_state.OpState object at 0x7fc874686e20>, InputDataB...
    └ <function process_completed_tasks at 0x7fc88c519160>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 365, in process_completed_tasks
    num_blocks_read = task.on_data_ready(
                      │    └ <function DataOpTask.on_data_ready at 0x7fc88c7050d0>
                      └ <ray.data._internal.execution.interfaces.physical_operator.DataOpTask object at 0x7fc8746a4340>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 88, in on_data_ready
    ex = ray.get(block_ref)
         │   │   └ ObjectRef(cf74d1b865704d22ffffffffffffffffffffffff0500000001000000)
         │   └ <function get at 0x7fc88c7e7820>
         └ <module 'ray' from '/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/__init__.py'>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
           │   │       └ {}
           │   └ (ObjectRef(cf74d1b865704d22ffffffffffffffffffffffff0500000001000000),)
           └ <function get at 0x7fc88c856790>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           │     │       └ {}
           │     └ (ObjectRef(cf74d1b865704d22ffffffffffffffffffffffff0500000001000000),)
           └ <function get at 0x7fc88c856700>
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/worker.py", line 2563, in get
    raise value.as_instanceof_cause()
          │     └ <function RayTaskError.as_instanceof_cause at 0x7fc88cec3700>
          └ RayTaskError('ray.data._internal.execution.operators.map_operator._map_task', 'Traceback (most recent call last):\n  File "py...

ray.exceptions.RayTaskError(FileNotFoundError): �[36mray::MapBatches(process_batch)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Map(compute_stats)->Filter(process)->Write()�[39m (pid=42507, ip=10.23.4.252)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 416, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 232, in __call__
    yield from self._block_fn(input, ctx)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/planner/plan_write_op.py", line 27, in fn
    {"write_result": [datasource.write(blocks, ctx, **write_args)]}
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 289, in write
    with _open_file_with_retry(
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 881, in _open_file_with_retry
    raise e from None
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 863, in _open_file_with_retry
    return open_file()
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 291, in <lambda>
    lambda: fs.open_output_stream(write_path, **open_stream_args),
  File "pyarrow/_fs.pyx", line 868, in pyarrow._fs.FileSystem.open_output_stream
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file './outputs/demo/demo-processed/c4e26ca89c5d4af1a1c63c57d3dc2875_000000_000000.json'. Detail: [errno 2] No such file or directory

from data-juicer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.