Check out this <a href="https://github.com/camelot-dev/camelot/blob/master/docs/benchm

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Duplicate strings assigned to the same cell about camelot HOT 12 CLOSED

vinayak-mehta commented on July 26, 2024

Duplicate strings assigned to the same cell

from camelot.

Comments (12)

vinayak-mehta commented on July 26, 2024 2

@edugonza Thank you for fixing this! The PR looked good! Thank you for adding a test too 👍

I'll start working on a release soon.

from camelot.

edugonza commented on July 26, 2024 1

Hi guys, I sent a PR with a working solution to the issue. I added a unittest with the PDF file mentioned in the first comment.

from camelot.

davidkong0987 commented on July 26, 2024

I believe this occurs when bold characters are created by putting duplicate characters instead of widdening the character. I've noticed it often creates 4 copies of each, although in your example it is 2x. That implies it might be at the pdf level. I think it might be at the pdf level because these bold characters don't have any difference in terms of font and other characteristics.

from camelot.

davidkong0987 commented on July 26, 2024

In addition, this is made worse by the fact in some duplicates, the LTHorizontal Object splits the line into two, and in some duplicates it is not split.

from camelot.

TheNetJedi commented on July 26, 2024

Yep, facing the same issue.
And yes, this only occurs with bold characters AFAIK.
Any workaround for this apart from fixing the PDFs?

from camelot.

davidkong0987 commented on July 26, 2024

There's a relatively easy fix that probably works most of the time (haven't seen a counter example but assume there might be some) by simply eliminating any fully overlapping LTHorizontals.

from camelot.

TheNetJedi commented on July 26, 2024

@davidkong0987

Can you please guide me on how I would do that?
I'm a noob.

from camelot.

davidkong0987 commented on July 26, 2024

You need to change the source code so this isn't a great task if you're not comfortable with programming.

Whenever you see horizontals = get_text_objects(ltype=LThorizontal), you can do the following code to delete horizontals.

        deletes = []
        for i in horizontals:
            if i not in deletes:
                for obj in horizontals:

                    if obj is not i:
                        try:
                            if all([
        min([t.x0 for t in obj if not isinstance(t,LTAnno) and t.get_text().strip()]) > min([t.x0 for t in i if not isinstance(t,LTAnno) and t.get_text().strip()])-1,
        min([t.y0 for t in obj if not isinstance(t,LTAnno) and t.get_text().strip()]) > min([t.y0 for t in i if not isinstance(t,LTAnno) and t.get_text().strip()])-1,
        max([t.x1 for t in obj if not isinstance(t,LTAnno) and t.get_text().strip()]) < max([t.x1 for t in i if not isinstance(t,LTAnno) and t.get_text().strip()])+1,
        max([t.y1 for t in obj if not isinstance(t,LTAnno) and t.get_text().strip()]) < max([t.y1 for t in i if not isinstance(t,LTAnno) and t.get_text().strip()])+1,
        ]):
                                print('largest',i)
                                print('delete',obj)
                                deletes += [obj]
                                i.customBold = True
                                for char in i:
                                    char.customBold = True
                        except:
                            pass
                horizontals = [obj for obj in horizontals if obj not in deletes]

If anyone notices cases that this does not cover, please let me know.

from camelot.

TheNetJedi commented on July 26, 2024

@davidkong0987

Thanks, I'll try this out and get back to you!

from camelot.

davidkong0987 commented on July 26, 2024

sometimes text is stacked on top of each other intentionally, this doesn't adjust for that

from camelot.

vinayak-mehta commented on July 26, 2024

There's a relatively easy fix that probably works most of the time (haven't seen a counter example but assume there might be some) by simply eliminating any fully overlapping LTHorizontals.

Yes! Let me see if I can get this into the library. Would you like to raise a PR with a corresponding test with the example PDF?

sometimes text is stacked on top of each other intentionally, this doesn't adjust for that

Yes.

from camelot.

rain01 commented on July 26, 2024

Can't wait. Any idea when it will be released?

from camelot.

Duplicate strings assigned to the same cell about camelot HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent