Giter VIP home page Giter VIP logo

Comments (3)

bertsky avatar bertsky commented on August 22, 2024 2

Hi @BartChris,

thanks a lot for your report and questions!

As to 1., please bear in mind that the data is not actually copied each time:

  • copy the images from Kitodo process folder to the "WORKDIR" which is located on the manager server

This uses reflink copies if possible on the FS. Essentially, that is a lazy copy (just another inode pointing to the same blocks).

Of course, if the user starts up Kitodo and Manager on distinct physical filesystems, then it will take the full cost of copy I/O.

  • copy the images from the "WORKDIR" to the "REMOTE_DIR" on the processing server

That's unavoidable. (But if you configure the Controller and Manager to use the same volume on the same host, then rsync will not do any actual work.)

  • after the OCR is done copy the whole OCR data back to the "WORKDIR"

Yes, but that means, it will skip all data that have already been there (including full-size images). Retrieving the results from the remote side (where storage might be fast but short-lived) is also not avoidable.

  • copy the OCR results (ALTO) from the "WORKDIR" to the Kitodo process folder

Yes, for the moment it's only the ALTO files. In the future, we might try to add more (like structMap or other file formats), perhaps in a later workflow stage, when Production already exported the final METS.

The ALTO files are small, so that should do no harm.

The actual OCR-D workspace, on the other hand, will be preserved – it might be needed for re-processing with another workflow, or visual inspection in the Monitor. We have not decided yet when to delete these workspaces from the Manager. (It would probably make sense to tie them to the lifetime of the process in Kitodo, and then archive or delete.)

What is the rationale behind the "WORKDIR" for example and why does the data have to be copied so many times? I reduced the number of copy processes by using shared volumes between the Servers and e.g. copied directly from the process folder to the remote folder, but i would like to be sure that i am not violating some deeper architectural ideas here.

See above answers. And you don't need to change any code to reduce the amount of copying/synchronization: just set up your environment variabes (see make help and .env) to suit your all-local use-case.

2. I am running the ocrd_manager standalone right now. For that i run docker compose up for the ocrd_manager component and for the ocrd_monitor component seperateley. The idea is probably that both services are using a shared volume to store job data

Yes, that's exactly the reason.

But right now i have two folders in /var/lib/docker/volume named ocrd_manager_shared and ocrd_monitor_shared. What can i do that both services are actually using the same shared folder?

You probably did not use the top-level repo https://github.com/markusweigelt/kitodo_production_ocrd for the integrated docker-compose. The top-level is where we provide most documentation and the easiest makefile entrypoints. The submodules only have very limited documentation and flexibility. (In this case, you would need to combine docker-compose.yml and ocrd_monitor/docker-compose.yml as one compose call, so they get the same network and volumes.)

Hope that helps.

from ocrd_manager.

BartChris avatar BartChris commented on August 22, 2024 1

yes, sounds good, i think you can close here. Thanks!

from ocrd_manager.

bertsky avatar bertsky commented on August 22, 2024

@BartChris so I think the main issue here is slub/ocrd_kitodo#35

Would you agree? Can we close here?

from ocrd_manager.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.