Comments (12)
@edugonza Thank you for fixing this! The PR looked good! Thank you for adding a test too 👍
I'll start working on a release soon.
from camelot.
Hi guys, I sent a PR with a working solution to the issue. I added a unittest with the PDF file mentioned in the first comment.
from camelot.
I believe this occurs when bold characters are created by putting duplicate characters instead of widdening the character. I've noticed it often creates 4 copies of each, although in your example it is 2x. That implies it might be at the pdf level. I think it might be at the pdf level because these bold characters don't have any difference in terms of font and other characteristics.
from camelot.
In addition, this is made worse by the fact in some duplicates, the LTHorizontal Object splits the line into two, and in some duplicates it is not split.
from camelot.
Yep, facing the same issue.
And yes, this only occurs with bold characters AFAIK.
Any workaround for this apart from fixing the PDFs?
from camelot.
There's a relatively easy fix that probably works most of the time (haven't seen a counter example but assume there might be some) by simply eliminating any fully overlapping LTHorizontals.
from camelot.
Can you please guide me on how I would do that?
I'm a noob.
from camelot.
You need to change the source code so this isn't a great task if you're not comfortable with programming.
Whenever you see horizontals = get_text_objects(ltype=LThorizontal), you can do the following code to delete horizontals.
deletes = []
for i in horizontals:
if i not in deletes:
for obj in horizontals:
if obj is not i:
try:
if all([
min([t.x0 for t in obj if not isinstance(t,LTAnno) and t.get_text().strip()]) > min([t.x0 for t in i if not isinstance(t,LTAnno) and t.get_text().strip()])-1,
min([t.y0 for t in obj if not isinstance(t,LTAnno) and t.get_text().strip()]) > min([t.y0 for t in i if not isinstance(t,LTAnno) and t.get_text().strip()])-1,
max([t.x1 for t in obj if not isinstance(t,LTAnno) and t.get_text().strip()]) < max([t.x1 for t in i if not isinstance(t,LTAnno) and t.get_text().strip()])+1,
max([t.y1 for t in obj if not isinstance(t,LTAnno) and t.get_text().strip()]) < max([t.y1 for t in i if not isinstance(t,LTAnno) and t.get_text().strip()])+1,
]):
print('largest',i)
print('delete',obj)
deletes += [obj]
i.customBold = True
for char in i:
char.customBold = True
except:
pass
horizontals = [obj for obj in horizontals if obj not in deletes]
If anyone notices cases that this does not cover, please let me know.
from camelot.
Thanks, I'll try this out and get back to you!
from camelot.
sometimes text is stacked on top of each other intentionally, this doesn't adjust for that
from camelot.
There's a relatively easy fix that probably works most of the time (haven't seen a counter example but assume there might be some) by simply eliminating any fully overlapping LTHorizontals.
Yes! Let me see if I can get this into the library. Would you like to raise a PR with a corresponding test with the example PDF?
sometimes text is stacked on top of each other intentionally, this doesn't adjust for that
Yes.
from camelot.
Can't wait. Any idea when it will be released?
from camelot.
Related Issues (20)
- two bug!
- Installing camelot with "pip install -U 'camelot-py[base]'" installs version 0.9, instead of 0.11 HOT 2
- Match size of Lines mask with Image Table
- mac m1 Ghostscript is not installed. HOT 3
- Difficulties with Multi-line headers. Rows shifted down. HOT 5
- OSS-Fuzz Integration
- Error in PyPDF2 3.0.0 HOT 5
- Updated documentation idea / installation screencasts HOT 2
- Release 0.11.0 uses deprecated pandas encoding parameter
- [Feature Request] Replace text
- Strip more than 1 string
- Test failures on ppc64el (PowerPC architecture), linux
- [Feature Request / Question] Use different OCR engine
- fail when detect abnormal border table HOT 1
- IndexError in lattice HOT 1
- Tables ignored in lattice mode HOT 1
- if (bbox_intersection_area(ba, bb) / bbox_area(ba)) > 0.8: ZeroDivisionError: float division by zero HOT 1
- How to combine tabular and non-tabular content from a PDF?
- CLI: Margins option not processed
- Camelot Dependency Tree
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from camelot.