Giter VIP home page Giter VIP logo

Comments (3)

JorjMcKie avatar JorjMcKie commented on July 2, 2024

This is no bug.

As explained in the documentation, naive text extraction always delivers in the sequence as the text is present in the page's appearance source.
So in this case, the PDF creator first stored "体重指数增⾼", and then "01".

You have to develop code that also extracts text coordinates and use them to sort text pieces accordingly.

But there exists a solution for this problem already: install package pymupdf4llm via python -m pip install pymupdf4llm and then do this:

import pathlib
import pymupdf4llm
data = pymupdf4llm.to_markdown("2024_5_.pdf")
pathlib.Path("2024_5_.pdf.md").write_bytes(data.encode())

This will extract the text in Markdown format and in correct reading sequence and also determine header lines from the font sizes present in the document. The result is this:
2024_5_.pdf.md

from pymupdf.

wencan avatar wencan commented on July 2, 2024

@JorjMcKie
I suggest that the program's layout processing should be closer to human reading habits. In my reading habits, "01" is to the left of "number", not below it.

from pymupdf.

JorjMcKie avatar JorjMcKie commented on July 2, 2024

First of all, there is no such thing as a universal "reading habits" - just think of Arabian / Persian / Hebrew and other right-to-left reading habits ... or even worse: mixtures of right-to-left with left-to-right text documents.

There are also scripting system top-to-bottom, character-wise.

Above all, there are numerous situations where exact information is needed about the "physical" sequence of objects (including text objects), e.g. when it comes to determine which object is covering which other one.

Last but not least: I just showed you a way how to extract text in a top-left to bottom-right sequence.

PyMuPDF lets you choose among multiple alternatives.

from pymupdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.