Giter VIP home page Giter VIP logo

Comments (12)

vinayak-mehta avatar vinayak-mehta commented on July 26, 2024 2

@edugonza Thank you for fixing this! The PR looked good! Thank you for adding a test too 👍

I'll start working on a release soon.

from camelot.

edugonza avatar edugonza commented on July 26, 2024 1

Hi guys, I sent a PR with a working solution to the issue. I added a unittest with the PDF file mentioned in the first comment.

from camelot.

davidkong0987 avatar davidkong0987 commented on July 26, 2024

I believe this occurs when bold characters are created by putting duplicate characters instead of widdening the character. I've noticed it often creates 4 copies of each, although in your example it is 2x. That implies it might be at the pdf level. I think it might be at the pdf level because these bold characters don't have any difference in terms of font and other characteristics.

from camelot.

davidkong0987 avatar davidkong0987 commented on July 26, 2024

In addition, this is made worse by the fact in some duplicates, the LTHorizontal Object splits the line into two, and in some duplicates it is not split.

from camelot.

TheNetJedi avatar TheNetJedi commented on July 26, 2024

Yep, facing the same issue.
And yes, this only occurs with bold characters AFAIK.
Any workaround for this apart from fixing the PDFs?

from camelot.

davidkong0987 avatar davidkong0987 commented on July 26, 2024

There's a relatively easy fix that probably works most of the time (haven't seen a counter example but assume there might be some) by simply eliminating any fully overlapping LTHorizontals.

from camelot.

TheNetJedi avatar TheNetJedi commented on July 26, 2024

@davidkong0987

Can you please guide me on how I would do that?
I'm a noob.

from camelot.

davidkong0987 avatar davidkong0987 commented on July 26, 2024

You need to change the source code so this isn't a great task if you're not comfortable with programming.

Whenever you see horizontals = get_text_objects(ltype=LThorizontal), you can do the following code to delete horizontals.

        deletes = []
        for i in horizontals:
            if i not in deletes:
                for obj in horizontals:

                    if obj is not i:
                        try:
                            if all([
        min([t.x0 for t in obj if not isinstance(t,LTAnno) and t.get_text().strip()]) > min([t.x0 for t in i if not isinstance(t,LTAnno) and t.get_text().strip()])-1,
        min([t.y0 for t in obj if not isinstance(t,LTAnno) and t.get_text().strip()]) > min([t.y0 for t in i if not isinstance(t,LTAnno) and t.get_text().strip()])-1,
        max([t.x1 for t in obj if not isinstance(t,LTAnno) and t.get_text().strip()]) < max([t.x1 for t in i if not isinstance(t,LTAnno) and t.get_text().strip()])+1,
        max([t.y1 for t in obj if not isinstance(t,LTAnno) and t.get_text().strip()]) < max([t.y1 for t in i if not isinstance(t,LTAnno) and t.get_text().strip()])+1,
        ]):
                                print('largest',i)
                                print('delete',obj)
                                deletes += [obj]
                                i.customBold = True
                                for char in i:
                                    char.customBold = True
                        except:
                            pass
                horizontals = [obj for obj in horizontals if obj not in deletes]

If anyone notices cases that this does not cover, please let me know.

from camelot.

TheNetJedi avatar TheNetJedi commented on July 26, 2024

@davidkong0987

Thanks, I'll try this out and get back to you!

from camelot.

davidkong0987 avatar davidkong0987 commented on July 26, 2024

sometimes text is stacked on top of each other intentionally, this doesn't adjust for that

from camelot.

vinayak-mehta avatar vinayak-mehta commented on July 26, 2024

There's a relatively easy fix that probably works most of the time (haven't seen a counter example but assume there might be some) by simply eliminating any fully overlapping LTHorizontals.

Yes! Let me see if I can get this into the library. Would you like to raise a PR with a corresponding test with the example PDF?

sometimes text is stacked on top of each other intentionally, this doesn't adjust for that

Yes.

from camelot.

rain01 avatar rain01 commented on July 26, 2024

Can't wait. Any idea when it will be released?

from camelot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.