Deion of the bug <a href="https://github.com/pymupdf/PyMuPDF

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

When extracting a numbered list, the result is not as expected. about pymupdf HOT 3 CLOSED

wencan commented on September 25, 2024

When extracting a numbered list, the result is not as expected.

from pymupdf.

Comments (3)

JorjMcKie commented on September 25, 2024

This is no bug.

As explained in the documentation, naive text extraction always delivers in the sequence as the text is present in the page's appearance source.
So in this case, the PDF creator first stored "体重指数增⾼", and then "01".

You have to develop code that also extracts text coordinates and use them to sort text pieces accordingly.

But there exists a solution for this problem already: install package pymupdf4llm via python -m pip install pymupdf4llm and then do this:

import pathlib
import pymupdf4llm
data = pymupdf4llm.to_markdown("2024_5_.pdf")
pathlib.Path("2024_5_.pdf.md").write_bytes(data.encode())

This will extract the text in Markdown format and in correct reading sequence and also determine header lines from the font sizes present in the document. The result is this:
2024_5_.pdf.md

from pymupdf.

wencan commented on September 25, 2024

@JorjMcKie
I suggest that the program's layout processing should be closer to human reading habits. In my reading habits, "01" is to the left of "number", not below it.

from pymupdf.

JorjMcKie commented on September 25, 2024

First of all, there is no such thing as a universal "reading habits" - just think of Arabian / Persian / Hebrew and other right-to-left reading habits ... or even worse: mixtures of right-to-left with left-to-right text documents.

There are also scripting system top-to-bottom, character-wise.

Above all, there are numerous situations where exact information is needed about the "physical" sequence of objects (including text objects), e.g. when it comes to determine which object is covering which other one.

Last but not least: I just showed you a way how to extract text in a top-left to bottom-right sequence.

PyMuPDF lets you choose among multiple alternatives.

from pymupdf.

Recommend Projects

When extracting a numbered list, the result is not as expected. about pymupdf HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent