Giter VIP home page Giter VIP logo

browse-ocrd's People

Contributors

bertsky avatar hnesk avatar kba avatar lgtm-migrator avatar sulzbals avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

browse-ocrd's Issues

show physical page ids instead of thumbnail image file ids

In the navigation view on the left where the thumbnail images are displayed an pages can be selected, pages can only be identified via their file ID (and the contiguous integer in between the navigation buttons).

However, I would find it more useful to display the physical page ID there. (Image file IDs can still be shown as text overlay in the image view.)

This became more pressing with METS files from Kitodo.Production v3, which uses UUIDs instead of contiguous numbers for pages and files, e.g. http://digital.slub-dresden.de/id1883722624-19520000 (to browse this, you first need to create an OCR-D workspace from the OAI link via ocrd workspace clone, then download ORIGINAL and FULLTEXT files, convert the latter to PAGE via ocrd-fileformat-transform -P from-to "alto page", finally use this script to fix the empty @imageFilenames).

PageView: PNG-Export

@bertsky in #30 (comment)

"Save view to PNG" (like a screengrabber for the current view, perhaps even including the mouse overlays)...
Shouldn't be too hard, the result already is a PIL.Image, except for the mouse-overlays, that are drawn on top via cairo

False warning about number of images per grp/page

images = [image for image in file_index.values() if
image.static_page_id == page_id and image.fileGrp == file_group]

This does not actually single out files with .mimetype.startswith('image/'). So we end up with a false warning:

ocrd_browser.model.document.Document.get_image_paths - Found 2 images for PAGE PHYS_0001 and fileGrp BIN, expected 1

Could also be worse than just a false warning...

add PAGE annotation view

Without re-inventing the wheel for displaying PAGE annotations, there is a lot of added-value in having one view option for this here. PageViewer does not know of OCR-D's relative path convention and can only show isolated pages, LAREX cannot cope with METS and OCR-D directory structures.

We already discussed integrating PageViewer loosely by just triggering a command-line call (in the simplest case, using --resolve-dir workspace-directory), or adding some IPC capability to PageViewer itself and then remote-controlling it from ocrd_browser.

Alternatively, one might be able to integrate nw-page-editor's HTML via GTK's WebKit component.

AlternativeImage selection glitch

Since the recent changes regarding image selection, I am getting the following:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/model/page.py", line 102, in get_image
    page_image, page_coords, page_image_info = ws.image_from_page(self.page, self.id, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/ocrd/workspace.py", line 730, in image_from_page
    filename, page_id))
Exception: Found no AlternativeImage that satisfies all requirements filename="." in page "PHYS_0001"

This goes away when I explicitly select an image version in the feature configurator. So the problem is the initialization.

I believe the cause is in b19ae33, where view.page.ImageVersion.list_from_page will serialize an uninitialized pathlib.Path (which happens to be . as str) instead of the empty string or None (which would have worked with the existing code in ImageVersionSelector and ViewPage).

PageView: component menu not editable

For me sometimes the Gtk.Menu of the PageFeaturesSelector is grayed out (so I cannot de/activate components).

I have not found a reason or workaround yet myself. (Tried item.set_sensitive(True) without effect...)

support path names with spaces

When opening a workspace with spaces anywhere in the directory names, ocrd_browser fails:

browse-ocrd PRImA\ Layout\ Analysis\ Dataset/mets.xml
FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: /daten/PRImA%20Layout%20Analysis%20Dataset/mets.xml

The cause is not on the OCR-D side AFAICS, but here:

mets_url = cls._strip_local(mets_url)

which in turn does

def _strip_local(mets_url: Union[Path, str], disallow_remote: bool = True) -> str:
result = urlparse(str(mets_url))

allow changing relative width of views

It would be wonderful if the vertical divider between multiple views was a slider that could be clicked and dragged to change their relative width (as can already be done for the preview pane on the left).

cannot run in Python pre 3.7.2 anymore

With the recent additions, I get:

   File "/lib/python3.6/site-packages/ocrd_browser/util/config.py", line 5, in <module>
    from typing import List, Optional, OrderedDict as OrderedDictType
ImportError: cannot import name 'OrderedDict'

(This is Python 3.6.7 – typing.OrderedDict was only introduced in 3.7.2.)

How can we avoid this dependency?

Segmentation fault

When I open a PAGE view on the attached workspace and click on the first TextRegion browse-ocrd segfaults.

I'm not sure if I'm using the program correctly, so I have no idea what it should do instead ;-)

AttributeError: 'EntryPoints' object has no attribute 'get'

What is going wrong here? Native installation or installation using pip ends up producing the same error.

mm@MM-Notebook:~$ sudo docker run -it --rm -v /home/mm/Desktop/workflows/:/data -p 8085:8085 -p 8080:8080 hnesk/ocrd_browser
+ python serve.py -p 8080 -P 8085 -d /data
+ broadwayd :5
Listening on /root/.cache/broadway6.socket
/data/ocrd-ws-serial-cleaned/mets.xml
/data/ocrd-ws-serial/mets.xml
/data/ocrd-ws-pagewise/mets.xml
/data/ocrd-ws-pagewise-cleaned/mets.xml
172.17.0.1 - - [30/Nov/2022 23:02:05] "GET / HTTP/1.1" 200 -
172.17.0.1 - - [30/Nov/2022 23:02:05] "GET /favicon.ico HTTP/1.1" 403 -
172.17.0.1 - - [30/Nov/2022 23:02:05] "GET /favicon.ico HTTP/1.1" 403 -
172.17.0.1 - - [30/Nov/2022 23:02:05] "GET /favicon.ico HTTP/1.1" 403 -
172.17.0.1 - - [30/Nov/2022 23:02:05] "GET /favicon.ico HTTP/1.1" 403 -
172.17.0.1 - - [30/Nov/2022 23:02:05] "GET /favicon.ico HTTP/1.1" 403 -
172.17.0.1 - - [30/Nov/2022 23:02:05] "GET /favicon.ico HTTP/1.1" 403 -
172.17.0.1 - - [30/Nov/2022 23:02:05] "GET /favicon.ico HTTP/1.1" 403 -
172.17.0.1 - - [30/Nov/2022 23:02:05] "GET /favicon.ico HTTP/1.1" 403 -
172.17.0.1 - - [30/Nov/2022 23:02:05] "GET /favicon.ico HTTP/1.1" 403 -
172.17.0.1 - - [30/Nov/2022 23:02:05] "GET /favicon.ico HTTP/1.1" 403 -
172.17.0.1 - - [30/Nov/2022 23:02:10] "GET /browse/ocrd-ws-serial-cleaned/mets.xml HTTP/1.1" 303 -
Traceback (most recent call last):
  File "/usr/local/bin/browse-ocrd", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/main.py", line 63, in main
    app = OcrdBrowserApplication()
  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/application.py", line 28, in __init__
    self.view_registry = ViewRegistry.create_from_entry_points()
  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/view/registry.py", line 21, in create_from_entry_points
    for entry_point in entry_points().get('ocrd_browser_view', []):
AttributeError: 'EntryPoints' object has no attribute 'get'

Support remote images

We frequently have the use-case where some (or even all) the file references have not been downloaded yet.

But these URL references for images make OcrdBrowser stumble:

today at 22:59:06Traceback (most recent call last):
today at 22:59:06  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/ui/window.py", line 92, in _open
today at 22:59:06    self.page_list.set_document(self.document)
today at 22:59:06  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/ui/page_browser.py", line 39, in set_document
today at 22:59:06    self.model = PageListStore(self.document)
today at 22:59:06  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/ui/page_store.py", line 57, in __init__
today at 22:59:06    file_lookup = document.get_image_paths(self.file_group)
today at 22:59:06  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/model/document.py", line 275, in get_image_paths
today at 22:59:06    image_paths[page_id] = self.path(images[0])
today at 22:59:06  File "/usr/local/lib/python3.7/site-packages/ocrd_browser/model/document.py", line 169, in path
today at 22:59:06    return self.directory.joinpath(other.local_filename)
today at 22:59:06  File "/usr/local/lib/python3.7/pathlib.py", line 922, in joinpath
today at 22:59:06    return self._make_child(args)
today at 22:59:06  File "/usr/local/lib/python3.7/pathlib.py", line 704, in _make_child
today at 22:59:06    drv, root, parts = self._parse_args(args)
today at 22:59:06  File "/usr/local/lib/python3.7/pathlib.py", line 658, in _parse_args
today at 22:59:06    a = os.fspath(a)
today at 22:59:06TypeError: expected str, bytes or os.PathLike object, not NoneType

That's because in …

if isinstance(other, OcrdFile):
return self.directory.joinpath(other.local_filename)

… we do not differentiate between an OcrdFile's .local_filename (which may be empty) and its .url. The latter could still be downloaded into the document.directory under some name and returned here.

Or perhaps one could somehow make this downloading a lazy operation only to be triggered when actually needed.

Application crashes on launch

I installed the latest version of browse-ocrd from the master branch of the git repository. When launching, I get the following error message:

Traceback (most recent call last):
  File "/usr/local/bin/browse-ocrd", line 5, in <module>
    from ocrd_browser.main import main
  File "/usr/local/lib/python3.10/site-packages/ocrd_browser/main.py", line 26, in <module>
    resources = Gio.resource_load(str(BASE_PATH / "ui.gresource"))
gi.repository.GLib.GError: g-file-error-quark: Failed to open file/usr/local/lib/python3.10/site-packages/ocrd_browser/ui.gresource”: open() failed: No such file or directory (4)

It seems that ui.gresource is missing. I also cannot find it in the repository, maybe it wasn't pushed with the last update?

page view: add baselines if available

For handwritten text, as for any segmentation produced via baseline detection, it is usually helpful to see the detected baselines directly. We should make that available next to

Feature.LINES: FeatureDescription('𝌆', 'lines', '/page:PcGts/page:Page/*//page:TextLine/page:Coords'),

by something like

    Feature.BASELINES: FeatureDescription('–', 'baselines', '/page:PcGts/page:Page/*//page:TextLine/page:Baseline'), 

These coordinates are linestrings instead of linear rings / polygons, and we may need to use a stronger line width when drawing to make it visible. But other than that, it should be fairly simple.

clean temporary files

ocrd_browser needs to store preview images in the OS' temporary files location. This can easily sum up to a few gigabytes for large workspaces.

IMHO it would be better to remove these files when closing (even if they could be re-used) to prevent partitions (or RAM disks) from filling up.

chdir to workspace

Since merging editable, there's a new problem: If I pass a filesystem path for METS on the command line that does not resolve to the current working directory, or if I select a METS in the open dialog that is in another directory, then as soon as I try to open the PageView or TextView, it crashes with the following trace:

  File "ocrd_browser/view/base.py", line 66, in <lambda>
    configurator.connect('changed', lambda _source, *value: self.config_changed(name, value))
  File "ocrd_browser/view/text.py", line 45, in config_changed
    self.reload()
  File "ocrd_browser/view/base.py", line 86, in reload
    self.current = self.document.page_for_id(self.page_id, self.use_file_group)
  File "ocrd_browser/model/document.py", line 356, in page_for_id
    image, _, _ = self.workspace.image_from_page(pcgts.get_Page(), page_id)
  File "ocrd/workspace.py", line 419, in image_from_page
    page_image_info = self.resolve_image_exif(page.imageFilename)
  File "ocrd/workspace.py", line 271, in resolve_image_exif
    ocrd_exif = exif_from_filename(image_filename)
  File "ocrd_modelfactory/__init__.py", line 32, in exif_from_filename
    with Image.open(image_filename) as pil_img:
  File "PIL/Image.py", line 2878, in open
    fp = builtins.open(filename, "rb")
FileNotFoundError: [Errno 2] Datei oder Verzeichnis nicht gefunden: 'OCR-D-IMG/OCR-D-IMG_0001.tif'

This looks like we should chdir to the workspace directory after opening it (as all processors do; most core classes were written for them).

show fileGrp names exploiting full window width

OCR-D fileGrp USE strings can be quite long. In the drop-down list of the fileGrp dialog, longer names extruding from the pre-allocated width of the button are abbreviated with ellipses, which is a good thing. But why does the width of that button not utilise the full room that the parent panel provides?

ocrd_browser_ellipsis

other MIME types

Without digging, I am not sure why exactly, trying to open the PAGE-XML view on a workspace with ALTO files (text/xml) gives this:

  File "ocrd_browser/view/base.py", line 66, in <lambda>
    configurator.connect('changed', lambda _source, *value: self.config_changed(name, value))
  File "ocrd_browser/view/xml.py", line 50, in config_changed
    self.reload()
  File "ocrd_browser/view/base.py", line 86, in reload
    self.current = self.document.page_for_id(self.page_id, self.use_file_group)
  File "ocrd_browser/model/document.py", line 356, in page_for_id
    image, _, _ = self.workspace.image_from_page(pcgts.get_Page(), page_id)
  File "ocrd/workspace.py", line 384, in image_from_page
    page_image = self._resolve_image_as_pil(page.imageFilename)
  File "ocrd/workspace.py", line 295, in _resolve_image_as_pil
    pil_image = Image.open(image_filename)
  File "PIL/Image.py", line 2930, in open
    raise UnidentifiedImageError(
PIL.UnidentifiedImageError: cannot identify image file 'FULLTEXT/FILE_0001_FULLTEXT'

Looks like it tried to interpret this as an image (and make a PAGE-XML for it).

ViewPage: ignore AlternativeImage if not retrievable

Currently, ocrd_browser.view.page tries to add an image version for each AlternativeImage referenced in the page. But that can lead to an uncaught FileNotFoundError during

Image.open(path).size,
if the file happens to be missing.

Furthermore, even when I catch this, if the missing path is the last image, it resurfaces as preferred version during

page_image, page_coords, page_image_info = ws.image_from_page(self.page, self.id, transparency=True, feature_selector=feature_selector, feature_filter=feature_filter)
, because the selection mechanism for image_from_page is not by file path but by features.

IMHO this should be more robust: An image file reference (irrespective if it is derived or original) that is nowhere to be found in the filesystem should simply be rendered with an all-white canvas of the same size.

base file group other than OCR-D-IMG

I have a METS here which does contain a fileGrp OCR-D-IMG, but not comprising all physical pages. This gives me:

INFO ocrd.resolver.workspace_from_nothing - Writing METS to /tmp/ocrd-core-ttugo0kk/mets.xml
Traceback (most recent call last):
  File "ocrd_browser/ui/window.py", line 88, in _open
    self.page_list.set_document(self.document)
  File "ocrd_browser/ui/page_browser.py", line 39, in set_document
    self.model = PageListStore(self.document)
  File "ocrd_browser/ui/page_store.py", line 56, in __init__
    file = str(file_lookup[page_id])
KeyError: 'f00037100714864

So I digged into ocrd_browser.ui.page_store and thought it might be sufficient to just check page_id in file_lookup before appending a row to the Gtk list. But this raises bigger questions:

  1. Why should the initial view be restricted to pages contained in OCR-D-IMG at all? This could easily just be empty. With practical library systems, the initial image fileGrp could realistically be called MAX, ORIGINAL or something else instead. My understanding of this program is that it should try to present a view of all physical pages (at least initially, before selecting a fileGrp explicitly). So how about presenting all structMap entries sorted by their @ORDER (if present) or @ID with the first fptr that shows up?

  2. How do you change to a different fileGrp? ui.view.base has a View.use_file_group property fixed to OCR-D-IMG.

Integrate page-xml-draw

There is a new library that can generate an opencv image from Page-Xml.
It would be great to use it for visualizing page-xml in browse-ocrd
Ideas (from simple to complex):

  1. Use the generated image directly in an image view
  2. Wrap the image in HTML with an imagemap with actionable areas and display it via a Webkit view.
  3. Write a special view for interacting with the library, that renders shapely-"polygons" and tests for actionable areas

Feature Request: Scroll lock panels

When comparing PAGE-XML and, to a lesser extent, zoomed in images, It would be neat if one could link two panels (e.g. by clicking on a lock symbol in the panel's toolbar), so that scrolling in one would scroll the other one automatically.

For a purely textual comparison, dinglehopper or a future diff-view (#13) is best but I sometimes use it to compare things like the generated IDs of elements or to spot if any text is missing or not consistent across TextEquiv levels and not having to scroll two panels manually would help.

add OCR alignment and difference view

This is clearly a desideratum here, but how do we approach it?

Considerations:

  1. The additional view would need 2 FileGroupSelectors instead of 1
  2. There are 2 cases:
    • A: equal segmentation but different recognition results: character alignment and difference highlighting within lines only
    • B: different segmentation and recognition results: textline alignment and difference highlighting within larger chunks
  3. The actual alignment code needs to be fast and reliable. The underlying problem of global sequence alignment (Needleman-Wunsch algorithm) has O(n²) (or O(n³) under arbitrary weights). There are many different packages for this on PyPI with various levels of features (including cost functions or weights) and efficiency (including C library backends). But not all of them are
    • suited for Unicode (or arbitrary lists of objects),
    • robust (both in terms of crashes and glitches on strange input and heap/stack restrictions),
    • actually efficient (in terms of average complexity or best case complexity)
    • well maintained and packaged.
  4. For historical text specifically, one must treat grapheme clusters as single objects to compare, and probably normalize certain sequences (or at least reduce their distance/cost to the normalized equivalent), e.g. vs ä or vs ſt or even ſ vs s.
  5. It would therefore seem natural to delegate to one of the existing OCR-D processors for OCR evaluation (or its backend library modules), i.e. ocrd-dinglehopper and ocrd-cor-asv-ann-evaluate, which have quite a few differences:
ocrd-dinglehopper ocrd-cor-asv-ann-evaluate
CER and WER and visualization only CER (currently)
only single pages aggregates over all pages
result is HTML with visual diff + JSON report result is logging
alignment written Python (slow) difflib.SequenceMatcher (fast; I tried many libraries on lots of data for robustness and speed, and decided to revert to that by consequence)
uniseg.graphemeclusters to get alignment+distances on graphemes (lists of objects) calculates alignment on codepoints (faster) but then post-processes to join combining sequences with their base character, so distances are almost always on graphemes as well
a set of normalizations that (roughly) target OCR-D GT transcription guidelines level 3 to level 2 (which is laudable) offers plain Levenshtein for GT level 3, NFC/NFKC/NFKD/NFD for GT level 2, and a custom normalization (called historic_latin) that targets GT level 1 (because NFKC is both quite incomplete and too much already)
text alignment of complete page text concatenated (suitable for A or B) text alignment on identical textlines (suitable for B only)
compares 1:1 compares 1:N
  1. Whatever module we choose, and whatever method to integrate its core functionality (without the actual OCR-D processor), we need to visualise the difference with Gtk facilities. For GtkSource.LanguageManager, an off-the-shelf highlighter that would lend itself is diff (coloring diff -u line output). But this does not colorize within the lines (like git diff --word-diff, wdiff, dwdiff etc), which is the most important contribution IMHO. So perhaps we need to use some existing word-diff syntax and write our own highlighter after all. Or we integrate dinglehopper's HTML and display it via WebKit directly.

list index out of range on non-XML fileGrp

When I try to open a fileGrp which contains only images, but no PAGE files, and if it does not contain the original image (like OCR-D-IMG) but only derived images (referenced by other fileGrps via AlternativeImage, e.g. OCR-D-IMG-BIN), then browse-ocrd collapses with the following error:

Traceback (most recent call last):
  File "ocrd_browser/view/base.py", line 66, in <lambda>
    configurator.connect('changed', lambda _source, *value: self.config_changed(name, value))
  File "ocrd_browser/view/images.py", line 40, in config_changed
    self.reload()
  File "ocrd_browser/view/images.py", line 69, in reload
    self.pages.append(self.document.page_for_id(display_id, self.use_file_group))
  File "ocrd_browser/model/document.py", line 233, in page_for_id
    pcgts = self.page_for_file(page_files[0])
IndexError: list index out of range

(Otherwise display of derived page images works fine.)

Oh, and could you please set the ocr-d topic for this repo to increase its visibility with OCR-D users?

use last fileGrp as default

def use_file_group(self) -> str:
return 'OCR-D-IMG'

This often does not exist in real life (DFG profile has DEFAULT or MAX or ORIGINAL), and has empty PAGE XML anyway. So IMO it would make more sense to pick the last fileGrp with PAGE-XML mimetype by default.

(Or even better, make this configurable as well, like the default image fileGrp.)

Make the page browser just another View

At the moment the page browser (ocrd_browser.ui.PagePreviewList) is a hardcoded part of the UI, with a pre-set file_group and position in the UI.
It would be nice if it would behave the same as the other views, so that the user would be able to instantiate, configure and close it.
This opens the possibility to have more than one page browser showing more than one file_group at once, e.g. seeing the binarized and cropped versions of all pages next to each other.
Also an user selected file_group would avoid to restrict the view to preconfigured file_group names like 'OCR-D-IMG.*' as discussed in #7
This can be achieved by changing the page browser to extend and utilize ocrd_browser.view.View with an ocrd_browser.view.FileGroupSelector

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.