Comments (6)
I think there is a basic misconception here:
PyMuPDF text extraction does not care about at all whether text pieces are in table cells or not!
Whether text particles are regarded as being in the same line is decided based on criteria like font size inter-character and inter-word distances and what not else.
IAW your red arrows are no bug.
from pymupdf.
It looks like you want to locate / extract text from table cells.
This is supported, but you have to identify / find the table via page.find_tables()
and then use table-related method to extract text from single cells.
from pymupdf.
No, I haven't been clear enough it seems:
Text extraction does not know the heck about cells. All it knows is about text. Whether text pieces are joined or not to form one span is decided independently from whether there are any lines, background colors, or whatever.
Just imagine that that page contains no line, no background, nothing of that sort. Only text present.
I made a picture for you: This is what text extraction sees:
from pymupdf.
Thank you so much, now it is more clear for me what you mean.
from pymupdf.
ADDED:
I used now rawdict instead of dict as parameter for page.get_text()
, so the script now gives better results for those two columns but still case number 2 there:
import matplotlib.pyplot as plt
import fitz
#auxilliary function to plot closed polygon
def plot_poly(x1, y1, x2, y2, color = 'k', linewidth = 1):
plt.plot(
[x1, x2, x2, x1, x1], #four x vertices and closed
[y1, y1, y2, y2, y1], #four y vertices and closed
color = color,
linewidth = linewidth
)
pdf_file = "North 02_Minieh_Record 01.pdf"
with fitz.open(pdf_file) as doc:
for page in doc:
dic = page.get_text('rawdict')
#the text
for block in dic['blocks']:
if block['type'] == 0: #Text type
for line in block['lines']:
X1, Y1, X2, Y2 = line['bbox'] #bbox of each line
#plot_poly(X1, Y1, X2, Y2, color = 'r') #no need for it now as previous code
''' #irrelevant for code now
if line['dir'][1] == -1: #Rotated text
angle = 90
else: #Normal text orientation
angle = 0
'''
for span in line['spans']:
ascender = span['ascender']
descender = span['descender']
size = span['size']
font = span['font']
#text = span['text']
x0, y0 = span['origin']
plt.plot(x0, y0,'o', markersize = 1, color = 'b') #origin x, y of each span
spx1, spy1, spx2, spy2 = span['bbox']
plot_poly(spx1, spy1, spx2, spy2, color = 'r')
#the layout
#one can comment this block out, but it is there for better figuring out
drawings = page.get_drawings()
for drawing in drawings:
for item in drawing['items']:
shape, data, num = item
Xr1, Yr1, Xr2, Yr2 = data
width = Xr2 - Xr1
height = Yr2 - Yr1
if width > 1 and height < 1:
plot_poly(Xr1, Yr1, Xr2, Yr2)
elif width < 1 and height > 1:
plot_poly(Xr1, Yr1, Xr2, Yr2 )
else:
pass
ax = plt.gca()
ax.invert_yaxis()
plt.show()
from pymupdf.
Thanks, for remembering! yes, I am already familiar with page.find_tables()
but I am doing something else.
So If I really understand you then you mean that those three adjacent cells are treated as one cell because they may have same y-coordinate for line and other criterias, isn't?
Here below in illustration and just for testing purpose, I used Foxit Editor and selected the middle cell text and slide it vertically a litte bit then I applied fitz to see the result, then it passed and handled as three single cells instead of one single cell.
Can I for example pass some parameter to get_text()
like distance tolerance between words, so if it is greater than some number then handle it as another cell?!
![Untitled](https://private-user-images.githubusercontent.com/33861815/339973021-4bf64b8f-a880-48da-9dd2-b807a647bbed.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTkyNzg2MTYsIm5iZiI6MTcxOTI3ODMxNiwicGF0aCI6Ii8zMzg2MTgxNS8zMzk5NzMwMjEtNGJmNjRiOGYtYTg4MC00OGRhLTlkZDItYjgwN2E2NDdiYmVkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjI1VDAxMTgzNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWYxN2I0ZWY2N2I4NTZkMTdkZThkNDFmMjRjYzRhNzk2YjNlYzExNmZkYmRmYzAyNDk5ZjhiODQ4OTgxY2JjNDUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.vS8o3mt4XW02KE3utcbzItvDLHBvsGIKXEhG6IO6Dok)
from pymupdf.
Related Issues (20)
- Redaction Annotation Fill Not Matching Up With Redacted Section HOT 4
- Updating Annotations HOT 1
- For what reason IRect exists? HOT 1
- MuPDF error: argument error: not a dict (string) HOT 3
- Get image inside table's cell
- `'width'` in `Page.get_drawings()` returns width equal as 0 HOT 2
- trouble in page.find_tables HOT 1
- Garbled extraction for Amazon Sustainability Report HOT 6
- This pdf would cause stack overflow exception, HOT 3
- ImportError: DLL load failed while importing _extra: The specified module could not be found. HOT 1
- Story.fit_width() has a weird line HOT 2
- The position box obtained through the get_text() method is inaccurate HOT 5
- ObjStm compression and PDF linearization doesn't work together HOT 3
- SMask of Image is not detected HOT 8
- insert_htmlbox does not print out characters if there is a mix of non English characters and English characters HOT 5
- find_tables OOM HOT 1
- page.get_pixmap() fails due to `fitz.mupdf.FzErrorLimit: code=5: too many nested graphics states` HOT 5
- No OCR support: TESSDATA_PREFIX not set HOT 1
- apply_redactions moves graphics HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pymupdf.