Giter VIP home page Giter VIP logo

Comments (9)

peteryuX avatar peteryuX commented on May 22, 2024

Hi @reeen115 , you can prepare your own dataset like structure bellow. (like the original training dataset MS-Celeb-1M)

/your/path/to/dataset/
    -> 0
        -> image_1.jpg
        -> image_2.jpg
        -> ...
    -> 1
        -> ...
    -> 2
        -> ...

However, I think your task (pen anomalies) might not be really suitable using arcface loss, because
the anomaly detection is often applied on unlabeled data. Maybe other specific paper would be more helpful for you.

from arcface-tf2.

logicmixtape avatar logicmixtape commented on May 22, 2024

Thanks for your advice.
I edited the ./config/*.yaml files.

sub_name = arc_res50_pen
train_dataset = ./data/train_pen
num_classes = 2 (OK , NG)
num_samples = 1000 (total of OK images and NG images)
test_dataset = ./data/test_pen

train
python train.py --mode 'eager_tf' --cfg_path "./configs/arc_res50_pen.yaml"

and test
python test.py --cfg_path "./configs/arc_res50_pen.yaml"
Is this all I should do?

from arcface-tf2.

peteryuX avatar peteryuX commented on May 22, 2024

@reeen115
The part of training seems okay, and take care about tuning your hyper parameters. (BTW the number of sample seems really small, training might not be efficient.)

The part of testing, you need to additionally modify the line 50~70 in test.py to meet what you need. The original testing dataset contain samples structure like (img1, img2, is_same). It's probably like that computing the distance between the embedding vector distance between the OK and NG samples (check the related information in ./modules/evaluations.py), which helps you to understand the performance.

from arcface-tf2.

logicmixtape avatar logicmixtape commented on May 22, 2024

I did
python train.py --cfg_path "./configs/arc_res50_pen.yaml"

I got this

2019-11-13 16:41:36.668726: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2019-11-13 16:41:38.464480: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2019-11-13 16:41:38.501135: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-13 16:41:38.507440: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-13 16:41:38.511371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-13 16:41:38.514595: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-11-13 16:41:38.519785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-13 16:41:38.526468: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-13 16:41:38.532074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-13 16:41:39.134768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-13 16:41:39.139410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2019-11-13 16:41:39.141551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2019-11-13 16:41:39.144632: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4608 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
Model: "arcface_model"
________________________________________________________________________________
Layer (type)              Output Shape      Param #  Connected to
================================================================================
input_image (InputLayer)  [(None, 112, 112, 0
________________________________________________________________________________
resnet50 (Model)          (None, 4, 4, 2048 23587712 input_image[0][0]
________________________________________________________________________________
OutputLayer (Model)       (None, 512)       16787968 resnet50[1][0]
________________________________________________________________________________
label (InputLayer)        [(None,)]         0
________________________________________________________________________________
ArcHead (Model)           (None, 2)         1024     OutputLayer[1][0]
                                                     label[0][0]
================================================================================
Total params: 40,376,704
Trainable params: 40,318,464
Non-trainable params: 58,240
________________________________________________________________________________
I1113 16:41:44.670018  6156 train.py:42] load my dataset.
2019-11-13 16:49:02.024762: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2019-11-13 16:49:03.756048: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2019-11-13 16:49:03.786320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-13 16:49:03.793860: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-13 16:49:03.798482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-13 16:49:03.801963: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-11-13 16:49:03.808455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-13 16:49:03.814553: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-13 16:49:03.819148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-13 16:49:04.410137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-13 16:49:04.414202: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2019-11-13 16:49:04.416258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2019-11-13 16:49:04.419851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4608 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
Model: "arcface_model"
________________________________________________________________________________
Layer (type)              Output Shape      Param #  Connected to
================================================================================
input_image (InputLayer)  [(None, 112, 112, 0
________________________________________________________________________________
resnet50 (Model)          (None, 4, 4, 2048 23587712 input_image[0][0]
________________________________________________________________________________
OutputLayer (Model)       (None, 512)       16787968 resnet50[1][0]
________________________________________________________________________________
label (InputLayer)        [(None,)]         0
________________________________________________________________________________
ArcHead (Model)           (None, 2)         1024     OutputLayer[1][0]
                                                     label[0][0]
================================================================================
Total params: 40,376,704
Trainable params: 40,318,464
Non-trainable params: 58,240
________________________________________________________________________________
I1113 16:49:09.734443 18968 train.py:42] load ms1m dataset.
[*] training from scratch.
Train for 59 steps
Epoch 1/5
2019-11-13 16:49:22.018330: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: NewRandomAccessFile failed to Create/Open: ./data/train_data : Access denied.
; Input/output error
         [[{{node IteratorGetNext}}]]
2019-11-13 16:49:22.549401: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2019-11-13 16:49:24.644105: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: NewRandomAccessFile failed to Create/Open: ./data/train_data :Access denied.
; Input/output error
         [[{{node IteratorGetNext}}]]
         [[IteratorGetNext/_8]]
 1/59 [..............................] - ETA: 11:22Traceback (most recent call last):
  File "train.py", line 136, in <module>
    app.run(main)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "train.py", line 132, in main
    initial_epoch=epochs - 1)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 728, in fit
    use_multiprocessing=use_multiprocessing)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 324, in fit
    total_epochs=epochs)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 123, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 86, in execution_function
    distributed_function(input_fn))
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 520, in _call
    return self._stateless_fn(*args, **kwds)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\function.py", line 1823, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\function.py", line 1141, in _filtered_call
    self.captured_inputs)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\function.py", line 511, in call
    ctx=ctx)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\execute.py", line 61, in quick_execute
    num_outputs)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 120: invalid start byte
```
`

from arcface-tf2.

peteryuX avatar peteryuX commented on May 22, 2024

Did you run the related code to convert the data to tfrecord files for training like the original implement in this repository?

# Binary Image: convert really slow, but loading faster when traning.
python data/convert_train_binary_tfrecord.py --dataset_path "/path/to/ms1m_align_112/imgs" --output_path "./data/ms1m_bin.tfrecord"

# Online Decoding: convert really fast, but loading slower when training.
python data/convert_train_tfrecord.py --dataset_path "/path/to/ms1m_align_112/imgs" --output_path "./data/ms1m.tfrecord"

from arcface-tf2.

logicmixtape avatar logicmixtape commented on May 22, 2024

I forgot to write
python data/convert_train_binary_tfrecord.py --dataset_path "./data/train_data" --output_path "./data/pen.tfrecord"

2019-11-13 17:49:17.610647: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
I1113 17:49:19.314568 10432 convert_train_binary_tfrecord.py:48] Loading ./data/train_data
I1113 17:49:19.315593 10432 convert_train_binary_tfrecord.py:51] Reading data list...
100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 167.52it/s]
I1113 17:49:19.333978 10432 convert_train_binary_tfrecord.py:59] Writing tfrecord file...
  0%|                                                                               | 0/950 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "data/convert_train_binary_tfrecord.py", line 70, in <module>
    app.run(main)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "data/convert_train_binary_tfrecord.py", line 63, in main
    source_id=int(id_name),
ValueError: invalid literal for int() with base 10: 'OK'

from arcface-tf2.

peteryuX avatar peteryuX commented on May 22, 2024

Convert

/your/path/to/dataset/
    -> OK
        -> image_1.jpg
        -> image_2.jpg
        -> ...
    -> NG
        -> ...

to

/your/path/to/dataset/
    -> 0
        -> image_1.jpg
        -> image_2.jpg
        -> ...
    -> 1
        -> ...

These bug is the int() convert error, you can find the detail from google by yourself.

from arcface-tf2.

logicmixtape avatar logicmixtape commented on May 22, 2024

As a result of various trials, if you try to train the model, you will never get out of this error forever.
Do you know any solutions?
I want to know your detailed execution environment.

`2019-11-14 19:00:40.234637: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2019-11-14 19:00:42.276039: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2019-11-14 19:00:42.305700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-14 19:00:42.312380: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-14 19:00:42.317878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-14 19:00:42.320481: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-11-14 19:00:42.326245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-14 19:00:42.331967: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-14 19:00:42.337288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-14 19:00:42.927794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-14 19:00:42.931382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2019-11-14 19:00:42.933716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2019-11-14 19:00:42.936699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4606 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
Model: "arcface_model"
________________________________________________________________________________
Layer (type)              Output Shape      Param #  Connected to
================================================================================
input_image (InputLayer)  [(None, 112, 112, 0
________________________________________________________________________________
resnet50 (Model)          (None, 4, 4, 2048 23587712 input_image[0][0]
________________________________________________________________________________
OutputLayer (Model)       (None, 512)       16787968 resnet50[1][0]
________________________________________________________________________________
label (InputLayer)        [(None,)]         0
________________________________________________________________________________
ArcHead (Model)           (None, 2)         1024     OutputLayer[1][0]
                                                     label[0][0]
================================================================================
Total params: 40,376,704
Trainable params: 40,318,464
Non-trainable params: 58,240
________________________________________________________________________________
I1114 19:00:47.251486 12776 train.py:42] load ms1m dataset.
[*] training from scratch.
2019-11-14 19:00:47.767252: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2019-11-14 19:00:49.075699: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-11-14 19:00:49.079945: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
Traceback (most recent call last):
  File "train.py", line 136, in <module>
    app.run(main)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "train.py", line 78, in main
    logist = model(inputs, training=True)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 891, in __call__
    outputs = self.call(cast_inputs, *args, **kwargs)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 708, in call
    convert_kwargs_to_constants=base_layer_utils.call_context().saving)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 860, in _run_internal_graph
    output_tensors = layer(computed_tensors, **kwargs)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 891, in __call__
    outputs = self.call(cast_inputs, *args, **kwargs)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 708, in call
    convert_kwargs_to_constants=base_layer_utils.call_context().saving)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 860, in _run_internal_graph
    output_tensors = layer(computed_tensors, **kwargs)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 891, in __call__
    outputs = self.call(cast_inputs, *args, **kwargs)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\layers\convolutional.py", line 197, in call
    outputs = self._convolution_op(inputs, self.kernel)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 1134, in __call__
    return self.conv_op(inp, filter)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 639, in __call__
    return self.call(inp, filter)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 238, in __call__
    name=self.name)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 2010, in conv2d
    name=name)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 1031, in conv2d
    data_format=data_format, dilations=dilations, name=name, ctx=_ctx)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 1130, in conv2d_eager_fallback
    ctx=_ctx, name=name)
  File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D]`

from arcface-tf2.

peteryuX avatar peteryuX commented on May 22, 2024

It seems like a problem with cuDNN version incompatibility.

Take a look at this solution, hope it can sovle your problem.
tensorflow/tensorflow#24828 (comment)

My environment:

  • nvidia driver 436.48
  • CUDA 10.0
  • cudnn 7.6.3
  • Tensorflow-gpu 2.0.0

from arcface-tf2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.