Comments (3)
This is no bug.
As explained in the documentation, naive text extraction always delivers in the sequence as the text is present in the page's appearance source.
So in this case, the PDF creator first stored "体重指数增⾼", and then "01".
You have to develop code that also extracts text coordinates and use them to sort text pieces accordingly.
But there exists a solution for this problem already: install package pymupdf4llm via python -m pip install pymupdf4llm
and then do this:
import pathlib
import pymupdf4llm
data = pymupdf4llm.to_markdown("2024_5_.pdf")
pathlib.Path("2024_5_.pdf.md").write_bytes(data.encode())
This will extract the text in Markdown format and in correct reading sequence and also determine header lines from the font sizes present in the document. The result is this:
2024_5_.pdf.md
from pymupdf.
@JorjMcKie
I suggest that the program's layout processing should be closer to human reading habits. In my reading habits, "01" is to the left of "number", not below it.
from pymupdf.
First of all, there is no such thing as a universal "reading habits" - just think of Arabian / Persian / Hebrew and other right-to-left reading habits ... or even worse: mixtures of right-to-left with left-to-right text documents.
There are also scripting system top-to-bottom, character-wise.
Above all, there are numerous situations where exact information is needed about the "physical" sequence of objects (including text objects), e.g. when it comes to determine which object is covering which other one.
Last but not least: I just showed you a way how to extract text in a top-left to bottom-right sequence.
PyMuPDF lets you choose among multiple alternatives.
from pymupdf.
Related Issues (20)
- class improves namedDest handling HOT 2
- Extract Model Tree Info HOT 6
- truncated font names HOT 4
- Correct Text box is not picked up HOT 3
- get_images() smask value becomes 0 HOT 2
- test_bboxlog_2885 fails with (lib)mupdf 1.24.9 HOT 3
- inconsistent radio button group behavior in Adobe Reader vs Brave browser PDF viewer HOT 1
- i cannot get anything from this pdf,i cannot copy,cannot extract
- Partial OCR using "get_textpage_ocr" ignores image masks while extracting text HOT 3
- get_images() function doubt HOT 3
- Possible invalid interpretation of StructureTree information HOT 2
- Piximap program crash HOT 3
- The image generated by get_pixmap() is abnormal HOT 1
- The image generated by get_pixmap() is abnormal, but the text result is correct HOT 1
- search_for does not work as expected HOT 1
- `doc.need_appearances()` fails with "AttributeError: module 'pymupdf.mupdf' has no attribute 'PDF_TRUE' " HOT 1
- apply_redactions() does not work as expected HOT 7
- Password protected PDF documents HOT 2
- Documentation Request: Page on docs explaining PDF errors and warnings HOT 1
- ReferenceError: weakly-referenced object no longer exists HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pymupdf.