Describe the bug I have a pipeline to extract text and find releva

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks for the reply <a class="user-mention notranslate" data-hovercard-type="user" da

Big thanks, <a class="user-mention notranslate" data-hovercard-type="user" data-hoverc

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

page.search("text", regex = True) is magnitudes slower in 0.10.4 compared to 0.10.3. about pdfplumber HOT 5 CLOSED

mikejokic commented on July 21, 2024

page.search("text", regex = True) is magnitudes slower in 0.10.4 compared to 0.10.3.

from pdfplumber.

Comments (5)

jsvine commented on July 21, 2024

Hi @mikejokic. Thank you for flagging. Unfortunately, I can't seem to reproduce your findings. If anything, it runs slightly faster on 0.10.4 than 0.10.3 for me. Here's the exact code I'm running:

import pdfplumber
import time
import sys

start = time.time()
with pdfplumber.open(sys.stdin.buffer) as pdf:
   for page in pdf.pages: 
      results = page.search(r'word', regex=True,return_chars=False)
      if hasattr(page, "close"):
         page.close()
      else:
         page.flush_cache()
end = time.time() - start
print(round(end, 3))

And then python test.py < documentation.pdf. On 0.10.3, I'm seeing times of around 7.9 seconds; on 0.10.4, I'm seeing closer to 7.6 seconds.

If you run the same, what do you see?

from pdfplumber.

mikejokic commented on July 21, 2024

Thanks for the reply @jsvine. I ran your code block in Docker and I found similar results to yours. But I have been able to reproduce my issue with the provided pdf.

Here is code I have been able to run in Docker changing just the pdfplumber version number.

I look for a set of relevant keywords/regex patterns (repeated keywords for simplicity), and then take the surrounding line info as well. 0.10.3 runs in around 30-36seconds, and 0.10.4 takes around 90-96 seconds.

keywords = ['capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS''capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS','capabilities','BHRS', 'Intervention', 'assessment','IMD', 'Affidavit', 'subclass', 'IHBS']


import time
import pdfplumber
start = time.time()
with pdfplumber.open('documentation.pdf') as pdf:
   for page in pdf.pages: 
        print(page,flush=True)
        for key in keywords:  
            results = page.search(r'.*\b' + key + r'\b.*', regex=True,case=False,return_chars=False)
        if hasattr(page, "close"):
            page.close()
        else:
            page.flush_cache()
end = time.time() - start
print(round(end, 3))

from pdfplumber.

jsvine commented on July 21, 2024

Big thanks, @mikejokic — that extra detail about looping through a bunch of .search(...) calls per page helped me (a) reproduce your observation, (b) figure out what the problem was, and (c) fix it.

Turns out 0bfffc2 introduced a bug in which the page layout calculations (necessary for .search(...)) were no longer getting cached. The fix in efca277 resolves that, restoring the prior speed/performance. Now available on the develop branch and will be in the next release.

from pdfplumber.

mikejokic commented on July 21, 2024

Thanks @jsvine. Out of curiosity, does .search() run .extract_text() on each run or is the text also cached?

from pdfplumber.

jsvine commented on July 21, 2024

.search(...) uses the text-layout cache, which is based on the layout-dependent parameters you pass. E.g., if you run page.search("q1", x_tolerance=5) and page.search("q2", x_tolerance=5), then the .extract_text(...) is only run once, on the first search; but if you then call page.search("q2", x_tolerance=10), then .extract_text(...) is called again.

from pdfplumber.

page.search("text", regex = True) is magnitudes slower in 0.10.4 compared to 0.10.3. about pdfplumber HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent