Describe the bug I have this PDF, table on the third page (Hungari

Multiple letters extracted on PDF table by using extract_text about pdfplumber HOT 5 CLOSED

ervinwirth commented on August 22, 2024

Multiple letters extracted on PDF table by using extract_text

from pdfplumber.

Comments (5)

jsvine commented on August 22, 2024 1

Great, thanks for checking/confirming. Not exactly an "error" (which I'd associate more with incorrect encodings) and more of a redundancy or quirk. Some PDFs (for reasons sometimes unknown), such as this one, write multiple instances of the same character. This is relatively rare, but common enough that it seemed worth adding the .dedupe_chars(...) method.

from pdfplumber.

jsvine commented on August 22, 2024

Have you tried using page.dedupe_chars().extract_text()? Some details on the dedupe_chars(...) method here: https://github.com/jsvine/pdfplumber?tab=readme-ov-file#extracting-text

from pdfplumber.

ervinwirth commented on August 22, 2024

Hmm, it solves the issues. Not sure how:
"Returns a version of the page with duplicate chars — those sharing the same text, fontname, size, and positioning (within tolerance x/y) as other characters — removed."

There was an error with the PDF?

from pdfplumber.

ervinwirth commented on August 22, 2024

Thank you :), is it possible to ask you about another 'interesting' case?

Its more a feature request, or something like that.

from pdfplumber.

jsvine commented on August 22, 2024

Yes! Feel free to open a discussion or feature-request-tagged issue.

from pdfplumber.

Multiple letters extracted on PDF table by using extract_text about pdfplumber HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent