Comments (14)
It is strange. What is your dataset? How big is? In normal case, it should print loss, precision, recall, F1 score, training speed per every few iterations (depend on settings in project.json
) If the total dataset size is smaller than minimum logging interval, it only prints "Saving checkpoint" like yours.
from deepdanbooru.
The dataset is the entire Danbooru2020 set. I just filtered out some of the useless tags.
from deepdanbooru.
Can you show your project.json
?
from deepdanbooru.
Yeah, I'll post all of the contents, except the checkpoints folder.
ugh it only lets me do txt files, not json. Whatever, categories, tags_log, and project are jsons.
tags.txt
tags-character.txt
tags-general.txt
tags_log.txt
categories.txt
project.txt
from deepdanbooru.
Hmm, I have it set to 200mb per checkpoint, and my IO can process several times that per second. Could that somehow be related? I expected to be bottlenecked on GPU power
from deepdanbooru.
From your settings, checkpoint is saved per every 3200 samples (or per epoch) and log per 320 samples. If your dataset has images fewer than 320, it only prints "saving checkpoint" but it is normal. Wait for completion.
If your dataset has many images, check your SQLite database and the count of actual images which has larger tag_count_general
value than your minimum_tag_count
in project.json
.
from deepdanbooru.
SELECT Count(1) FROM posts WHERE posts.tag_count_general >= 10
3956484
I don't think that's the issue.
Oh, I think I know the issue. I didn't have the database in the directory with the images folder, but one above it. Do I just delete the checkpoints folder to restart?
from deepdanbooru.
da3dsoul@THE-THRONE:/media/da3dsoul/Golias/DeepDanbooru$ deepdanbooru train-project /media/da3dsoul/Golias/DeepDanbooru/unbooru_model/
Using Adam optimizer ...
Loading tags ...
Creating model (resnet_custom_v2) ...
Model : (None, 299, 299, 3) -> (None, 14176)
Loading database ...
No checkpoint. Starting new training ... (2021-08-31 00:47:17.383431)
Shuffling samples (epoch 0) ...
Trying to change learning rate to 0.001 ...
Learning rate is changed to <tf.Variable 'learning_rate:0' shape=() dtype=float32, numpy=0.001> ...
2021-08-31 00:47:31.829958: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:592] remapper failed: Not found: Op type not registered '_CopyFromGpuToHost' in binary running on THE-THRONE. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
MIOpen(HIP): Warning [ParseAndLoadDb] File is unreadable: /opt/rocm-4.3.0/miopen/share/miopen/db/gfx803_32.HIP.fdb.txt
Traceback (most recent call last):
File "/usr/local/bin/deepdanbooru", line 8, in <module>
sys.exit(main())
File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/da3dsoul/.local/lib/python3.8/site-packages/deepdanbooru/__main__.py", line 52, in train_project
dd.commands.train_project(project_path, source_model)
File "/home/da3dsoul/.local/lib/python3.8/site-packages/deepdanbooru/commands/train_project.py", line 196, in train_project
step_result = model.train_on_batch(
File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1727, in train_on_batch
logs = self.train_function(iterator)
File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
result = self._call(*args, **kwds)
File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
return self._stateless_fn(*args, **kwds)
File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
return graph_function._call_flat(
File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
outputs = execute.execute(
File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node resnet_custom_v2/batch_normalization_63/FusedBatchNormV3 (defined at home/da3dsoul/.local/lib/python3.8/site-packages/deepdanbooru/commands/train_project.py:196) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Func/assert_greater_equal/Assert/AssertGuard/else/_1/input_control_node/_62/_83]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node resnet_custom_v2/batch_normalization_63/FusedBatchNormV3 (defined at home/da3dsoul/.local/lib/python3.8/site-packages/deepdanbooru/commands/train_project.py:196) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_52220]
Function call stack:
train_function -> train_function
That definitely did something different, but that looks like a me problem with my ROCm setup.
Can you add a simple check with an error and exit when the database isn't in the same folder as the images folder? It's a little thing that would help save a headache.
from deepdanbooru.
Can you add a simple check with an error and exit when the database isn't in the same folder as the images folder? It's a little thing that would help save a headache.
DeepDanbooru simply ignore image file I/O error because when you using extremely large dataset, it has number of incorrect/broken files and it should not disturb entire training process.
Anyway, I recommend that reducing minibatch size because your log shows OOM (out of memory) error.
from deepdanbooru.
I can try, thanks. It's an 8G GPU, which is reasonable, but this isn't exactly a reasonable workload, so fair.
from deepdanbooru.
I see, it eats RAM by the gig and asks for seconds. I had to turn the batch size down to 8 to make it not crash, but it's still over 10x faster with the RX570 vs the Ryzen 3700X. Thanks much for all of the help.
When I said run a check, I meant something as simple as
image_dir = *pull it from the database dir*
if not os.isdir(image_dir):
throw new Exception("Missing images folder!")
It's been a while since I've done python. The syntax isn't important there...
from deepdanbooru.
Out of curiosity, what's an acceptable sample rate? I'm getting about 12 samples/s
from deepdanbooru.
When I said run a check, I meant something as simple as
Ah, I'll add it.
what's an acceptable sample rate?
In my case, I got ~20 samples/s by using Ryzen 1700x + Geforce RTX 2080ti.
from deepdanbooru.
Ok, a 2080 is way faster than a RX570, so that's fair. Thanks.
from deepdanbooru.
Related Issues (20)
- Help with training with optional tags
- pose detection
- Clarification of README.md HOT 6
- sqlite3.OperationalError: no such table: posts HOT 2
- ValueError: The `save_format` argument is deprecated in Keras 3. Please remove this argument and pass a file path with either `.keras` or `.h5` extension.Received: save_format=h5 HOT 2
- The `save_format` argument is deprecated in Keras 3 HOT 1
- How to properly train it? HOT 3
- Training script can't read dataset? HOT 3
- about model HOT 3
- Add a progress bar
- Best learnig rate
- How to output the result to txt? HOT 3
- is there any GPU acceleration? HOT 1
- Model input and output? HOT 4
- About Character Tags
- Docker image with DeepDanbooru HOT 4
- Error reading tags with Unicode in them HOT 1
- Does "deepdanbooru-v3-20211112-sgd-e28" contain nsfw tags? HOT 4
- How to compile this lib into C++
- Help deploying locally HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepdanbooru.