Comments (9)
Hi @reeen115 , you can prepare your own dataset like structure bellow. (like the original training dataset MS-Celeb-1M)
/your/path/to/dataset/
-> 0
-> image_1.jpg
-> image_2.jpg
-> ...
-> 1
-> ...
-> 2
-> ...
However, I think your task (pen anomalies) might not be really suitable using arcface loss, because
the anomaly detection is often applied on unlabeled data. Maybe other specific paper would be more helpful for you.
from arcface-tf2.
Thanks for your advice.
I edited the ./config/*.yaml files.
sub_name = arc_res50_pen
train_dataset = ./data/train_pen
num_classes = 2 (OK , NG)
num_samples = 1000 (total of OK images and NG images)
test_dataset = ./data/test_pen
train
python train.py --mode 'eager_tf' --cfg_path "./configs/arc_res50_pen.yaml"
and test
python test.py --cfg_path "./configs/arc_res50_pen.yaml"
Is this all I should do?
from arcface-tf2.
@reeen115
The part of training seems okay, and take care about tuning your hyper parameters. (BTW the number of sample seems really small, training might not be efficient.)
The part of testing, you need to additionally modify the line 50~70 in test.py to meet what you need. The original testing dataset contain samples structure like (img1, img2, is_same). It's probably like that computing the distance between the embedding vector distance between the OK and NG samples (check the related information in ./modules/evaluations.py), which helps you to understand the performance.
from arcface-tf2.
I did
python train.py --cfg_path "./configs/arc_res50_pen.yaml"
I got this
2019-11-13 16:41:36.668726: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2019-11-13 16:41:38.464480: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2019-11-13 16:41:38.501135: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-13 16:41:38.507440: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-13 16:41:38.511371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-13 16:41:38.514595: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-11-13 16:41:38.519785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-13 16:41:38.526468: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-13 16:41:38.532074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-13 16:41:39.134768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-13 16:41:39.139410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2019-11-13 16:41:39.141551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2019-11-13 16:41:39.144632: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4608 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
Model: "arcface_model"
________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
================================================================================
input_image (InputLayer) [(None, 112, 112, 0
________________________________________________________________________________
resnet50 (Model) (None, 4, 4, 2048 23587712 input_image[0][0]
________________________________________________________________________________
OutputLayer (Model) (None, 512) 16787968 resnet50[1][0]
________________________________________________________________________________
label (InputLayer) [(None,)] 0
________________________________________________________________________________
ArcHead (Model) (None, 2) 1024 OutputLayer[1][0]
label[0][0]
================================================================================
Total params: 40,376,704
Trainable params: 40,318,464
Non-trainable params: 58,240
________________________________________________________________________________
I1113 16:41:44.670018 6156 train.py:42] load my dataset.
2019-11-13 16:49:02.024762: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2019-11-13 16:49:03.756048: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2019-11-13 16:49:03.786320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-13 16:49:03.793860: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-13 16:49:03.798482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-13 16:49:03.801963: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-11-13 16:49:03.808455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-13 16:49:03.814553: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-13 16:49:03.819148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-13 16:49:04.410137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-13 16:49:04.414202: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2019-11-13 16:49:04.416258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2019-11-13 16:49:04.419851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4608 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
Model: "arcface_model"
________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
================================================================================
input_image (InputLayer) [(None, 112, 112, 0
________________________________________________________________________________
resnet50 (Model) (None, 4, 4, 2048 23587712 input_image[0][0]
________________________________________________________________________________
OutputLayer (Model) (None, 512) 16787968 resnet50[1][0]
________________________________________________________________________________
label (InputLayer) [(None,)] 0
________________________________________________________________________________
ArcHead (Model) (None, 2) 1024 OutputLayer[1][0]
label[0][0]
================================================================================
Total params: 40,376,704
Trainable params: 40,318,464
Non-trainable params: 58,240
________________________________________________________________________________
I1113 16:49:09.734443 18968 train.py:42] load ms1m dataset.
[*] training from scratch.
Train for 59 steps
Epoch 1/5
2019-11-13 16:49:22.018330: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: NewRandomAccessFile failed to Create/Open: ./data/train_data : Access denied.
; Input/output error
[[{{node IteratorGetNext}}]]
2019-11-13 16:49:22.549401: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2019-11-13 16:49:24.644105: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: NewRandomAccessFile failed to Create/Open: ./data/train_data :Access denied.
; Input/output error
[[{{node IteratorGetNext}}]]
[[IteratorGetNext/_8]]
1/59 [..............................] - ETA: 11:22Traceback (most recent call last):
File "train.py", line 136, in <module>
app.run(main)
File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File "train.py", line 132, in main
initial_epoch=epochs - 1)
File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 728, in fit
use_multiprocessing=use_multiprocessing)
File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 324, in fit
total_epochs=epochs)
File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 123, in run_one_epoch
batch_outs = execution_function(iterator)
File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 86, in execution_function
distributed_function(input_fn))
File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 457, in __call__
result = self._call(*args, **kwds)
File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 520, in _call
return self._stateless_fn(*args, **kwds)
File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\function.py", line 1823, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\function.py", line 1141, in _filtered_call
self.captured_inputs)
File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\function.py", line 1224, in _call_flat
ctx, args, cancellation_manager=cancellation_manager)
File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\function.py", line 511, in call
ctx=ctx)
File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\tensorflow_core\python\eager\execute.py", line 61, in quick_execute
num_outputs)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 120: invalid start byte
```
`
from arcface-tf2.
Did you run the related code to convert the data to tfrecord files for training like the original implement in this repository?
# Binary Image: convert really slow, but loading faster when traning.
python data/convert_train_binary_tfrecord.py --dataset_path "/path/to/ms1m_align_112/imgs" --output_path "./data/ms1m_bin.tfrecord"
# Online Decoding: convert really fast, but loading slower when training.
python data/convert_train_tfrecord.py --dataset_path "/path/to/ms1m_align_112/imgs" --output_path "./data/ms1m.tfrecord"
from arcface-tf2.
I forgot to write
python data/convert_train_binary_tfrecord.py --dataset_path "./data/train_data" --output_path "./data/pen.tfrecord"
2019-11-13 17:49:17.610647: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
I1113 17:49:19.314568 10432 convert_train_binary_tfrecord.py:48] Loading ./data/train_data
I1113 17:49:19.315593 10432 convert_train_binary_tfrecord.py:51] Reading data list...
100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 167.52it/s]
I1113 17:49:19.333978 10432 convert_train_binary_tfrecord.py:59] Writing tfrecord file...
0%| | 0/950 [00:00<?, ?it/s]
Traceback (most recent call last):
File "data/convert_train_binary_tfrecord.py", line 70, in <module>
app.run(main)
File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "C:\Users\user\Anaconda3\envs\arcface-tf2\lib\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File "data/convert_train_binary_tfrecord.py", line 63, in main
source_id=int(id_name),
ValueError: invalid literal for int() with base 10: 'OK'
from arcface-tf2.
Convert
/your/path/to/dataset/
-> OK
-> image_1.jpg
-> image_2.jpg
-> ...
-> NG
-> ...
to
/your/path/to/dataset/
-> 0
-> image_1.jpg
-> image_2.jpg
-> ...
-> 1
-> ...
These bug is the int() convert error, you can find the detail from google by yourself.
from arcface-tf2.
As a result of various trials, if you try to train the model, you will never get out of this error forever.
Do you know any solutions?
I want to know your detailed execution environment.
`2019-11-14 19:00:40.234637: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2019-11-14 19:00:42.276039: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2019-11-14 19:00:42.305700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-14 19:00:42.312380: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-14 19:00:42.317878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-14 19:00:42.320481: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-11-14 19:00:42.326245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-11-14 19:00:42.331967: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-14 19:00:42.337288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-14 19:00:42.927794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-14 19:00:42.931382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2019-11-14 19:00:42.933716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2019-11-14 19:00:42.936699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4606 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
Model: "arcface_model"
________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
================================================================================
input_image (InputLayer) [(None, 112, 112, 0
________________________________________________________________________________
resnet50 (Model) (None, 4, 4, 2048 23587712 input_image[0][0]
________________________________________________________________________________
OutputLayer (Model) (None, 512) 16787968 resnet50[1][0]
________________________________________________________________________________
label (InputLayer) [(None,)] 0
________________________________________________________________________________
ArcHead (Model) (None, 2) 1024 OutputLayer[1][0]
label[0][0]
================================================================================
Total params: 40,376,704
Trainable params: 40,318,464
Non-trainable params: 58,240
________________________________________________________________________________
I1114 19:00:47.251486 12776 train.py:42] load ms1m dataset.
[*] training from scratch.
2019-11-14 19:00:47.767252: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2019-11-14 19:00:49.075699: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-11-14 19:00:49.079945: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
Traceback (most recent call last):
File "train.py", line 136, in <module>
app.run(main)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File "train.py", line 78, in main
logist = model(inputs, training=True)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 891, in __call__
outputs = self.call(cast_inputs, *args, **kwargs)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 708, in call
convert_kwargs_to_constants=base_layer_utils.call_context().saving)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 860, in _run_internal_graph
output_tensors = layer(computed_tensors, **kwargs)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 891, in __call__
outputs = self.call(cast_inputs, *args, **kwargs)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 708, in call
convert_kwargs_to_constants=base_layer_utils.call_context().saving)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 860, in _run_internal_graph
output_tensors = layer(computed_tensors, **kwargs)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\engine\base_layer.py", line 891, in __call__
outputs = self.call(cast_inputs, *args, **kwargs)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\keras\layers\convolutional.py", line 197, in call
outputs = self._convolution_op(inputs, self.kernel)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 1134, in __call__
return self.conv_op(inp, filter)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 639, in __call__
return self.call(inp, filter)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 238, in __call__
name=self.name)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 2010, in conv2d
name=name)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 1031, in conv2d
data_format=data_format, dilations=dilations, name=name, ctx=_ctx)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 1130, in conv2d_eager_fallback
ctx=_ctx, name=name)
File "C:\Users\user\Anaconda3\envs\ac-tf2\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D]`
from arcface-tf2.
It seems like a problem with cuDNN version incompatibility.
Take a look at this solution, hope it can sovle your problem.
tensorflow/tensorflow#24828 (comment)
My environment:
- nvidia driver 436.48
- CUDA 10.0
- cudnn 7.6.3
- Tensorflow-gpu 2.0.0
from arcface-tf2.
Related Issues (20)
- Can anyone explain why model loading fails when I try 'archead' while it works with 'normhead' during custom training with16 classes HOT 2
- How to get the classification result?
- get nan result for a whole batch HOT 1
- How can i apply augmentation
- How can I still get loss=nan
- 您好,我想請教一下tfrecord中輸入是image的資訊,那label是原始的image的id嗎?
- Accuracy difference from insightface mxnet implementation
- Fine-tunning ArcFace
- Asking test.py
- [BUG] lost GlobalAveragePooling
- I can't achieve the accuracy in bench mark, could somebody help? HOT 18
- Issues with perform_val HOT 2
- Colab notebook does not work HOT 1
- Colab notebook does not work(for downloading arc_res50.zip file) HOT 1
- good performance is not obtained HOT 1
- Asian-celeb dataset download link
- Cosine Similarity, Best Threshold
- my desktop use two gpu. occurred error that 'Memory growth cannot differ between GPU devices' HOT 1
- Load arc_res50 pretained model with tensorflow keras load_model HOT 1
- Cannot find ckpt from None HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arcface-tf2.