Giter VIP home page Giter VIP logo

Comments (8)

nataliia-obraztsova avatar nataliia-obraztsova commented on August 24, 2024 1

Adding fitz.TOOLS.store_shrink(100) after pix = None actually helped a lot. Here is a link to an older issue which I missed at first
#130
I still have some gradual increase so I'll leave the issue open for now.

from pymupdf.

JorjMcKie avatar JorjMcKie commented on August 24, 2024 1

Can you please provide printouts with numbers updated after the mentioned adjustments?

In general, if a permanently low memory footprint is desired (for whatever reasons), shrinking the store usage should be used generously.
This is because of a number of reasons:

  1. MuPDF's strategy is to keep things in memory - especially objects that are prone to be large like images and fonts
  2. Deleting Python objects is only one side of the medal: the shadowing C-object in MuPDF is not necessarily also removed in each case.

from pymupdf.

nataliia-obraztsova avatar nataliia-obraztsova commented on August 24, 2024 1

Below you can see memory profiling after adjustments. The interesting thing is that while processing the file f0 fitz.TOOLS.store_shrink(100) in line 47 seems to made no difference, but memory usage increased only by 7MiB. And didn't shrink back to initial number. While processing file f1, fitz.TOOLS.store_shrink(100) in line 47 reduced memory usage a lot. But still not all of it. Additional 20.12 MB added up. Then it seems to plateau.

P.S. I have upgraded PyMuPDF to 1.24.7

memory profiling after adjustments

processing file f0

Memory usage before function: 53.28 MB

Line # Mem usage Increment Occurrences Line Contents

34     53.5 MiB     53.5 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     53.7 MiB      0.1 MiB           1       file_stream = read_file(file_name)
37     56.0 MiB      2.4 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     56.0 MiB      0.0 MiB           1       try:
39     56.0 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40     67.4 MiB      0.0 MiB           4           for i in range(number_of_pages):
41     67.4 MiB      0.2 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44     67.4 MiB      7.0 MiB           3               pix = page.get_pixmap()
45     67.4 MiB      3.5 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46     67.4 MiB      0.0 MiB           3               pix = None
47     67.4 MiB      0.0 MiB           3               fitz.TOOLS.store_shrink(100)
48                                                     # Convert the PIL Image to a bytes-like object
49     67.4 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
50     67.4 MiB      0.6 MiB           3               img.save(img_byte_buff, format='JPEG')
51     67.4 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
52                                         
53                                                     # Encode the image bytes in base64 and decode to UTF-8 string
54     67.4 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55                                         
56                                             except Exception as e:
57                                                 raise Exception(e.args)
58                                             finally:
59     67.4 MiB      0.0 MiB           1           doc.close()
60     67.4 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 60.41 MB
Memory usage difference total: 7.13 MB

processing file f1

Memory usage before function: 60.41 MB

Line # Mem usage Increment Occurrences Line Contents

34     60.4 MiB     60.4 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     65.7 MiB      5.2 MiB           1       file_stream = read_file(file_name)
37     65.7 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     65.7 MiB      0.0 MiB           1       try:
39     65.7 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40    100.4 MiB    -70.7 MiB          33           for i in range(number_of_pages):
41    100.4 MiB    -56.1 MiB          32               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44    145.3 MiB    194.0 MiB          32               pix = page.get_pixmap()
45    145.3 MiB   -289.6 MiB          32               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46    145.3 MiB   -289.6 MiB          32               pix = None
47    100.4 MiB   -519.4 MiB          32               fitz.TOOLS.store_shrink(100)
48                                                     # Convert the PIL Image to a bytes-like object
49    100.4 MiB    -70.7 MiB          32               img_byte_buff = BytesIO()
50    100.4 MiB    -70.7 MiB          32               img.save(img_byte_buff, format='JPEG')
51    100.4 MiB    -70.7 MiB          32               img_byte_arr = img_byte_buff.getvalue()
52                                         
53                                                     # Encode the image bytes in base64 and decode to UTF-8 string
54    100.4 MiB    -70.7 MiB          32               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55                                         
56                                             except Exception as e:
57                                                 raise Exception(e.args)
58                                             finally:
59     85.8 MiB    -14.6 MiB           1           doc.close()
60     85.8 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 80.53 MB
Memory usage difference total: 20.12 MB

processing file f2

Memory usage before function: 80.53 MB

Line # Mem usage Increment Occurrences Line Contents

34     80.5 MiB     80.5 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     80.5 MiB      0.0 MiB           1       file_stream = read_file(file_name)
37     80.5 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     80.5 MiB      0.0 MiB           1       try:
39     80.5 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40     80.5 MiB      0.0 MiB           4           for i in range(number_of_pages):
41     80.5 MiB      0.0 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44     80.5 MiB      0.0 MiB           3               pix = page.get_pixmap()
45     80.5 MiB      0.0 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46     80.5 MiB      0.0 MiB           3               pix = None
47     80.5 MiB      0.0 MiB           3               fitz.TOOLS.store_shrink(100)
48                                                     # Convert the PIL Image to a bytes-like object
49     80.5 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
50     80.5 MiB      0.0 MiB           3               img.save(img_byte_buff, format='JPEG')
51     80.5 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
52                                         
53                                                     # Encode the image bytes in base64 and decode to UTF-8 string
54     80.5 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55                                         
56                                             except Exception as e:
57                                                 raise Exception(e.args)
58                                             finally:
59     80.5 MiB      0.0 MiB           1           doc.close()
60     80.5 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 80.53 MB
Memory usage difference total: 0.00 MB

processing file f3

Memory usage before function: 80.53 MB

Line # Mem usage Increment Occurrences Line Contents

34     80.5 MiB     80.5 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     80.5 MiB      0.0 MiB           1       file_stream = read_file(file_name)
37     80.5 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     80.5 MiB      0.0 MiB           1       try:
39     80.5 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40     80.5 MiB      0.0 MiB           4           for i in range(number_of_pages):
41     80.5 MiB      0.0 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44     80.5 MiB      0.0 MiB           3               pix = page.get_pixmap()
45     80.5 MiB      0.0 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46     80.5 MiB      0.0 MiB           3               pix = None
47     80.5 MiB      0.0 MiB           3               fitz.TOOLS.store_shrink(100)
48                                                     # Convert the PIL Image to a bytes-like object
49     80.5 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
50     80.5 MiB      0.0 MiB           3               img.save(img_byte_buff, format='JPEG')
51     80.5 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
52                                         
53                                                     # Encode the image bytes in base64 and decode to UTF-8 string
54     80.5 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55                                         
56                                             except Exception as e:
57                                                 raise Exception(e.args)
58                                             finally:
59     80.5 MiB      0.0 MiB           1           doc.close()
60     80.5 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 80.53 MB
Memory usage difference total: 0.00 MB

from pymupdf.

JorjMcKie avatar JorjMcKie commented on August 24, 2024 1

Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?

I'm looking for this PDF. I'll share it once I find it.

It seems that your "issue" goes back to that Page.get_image_infos() uses no restriction on the image bboxes and will include any image - even if it only intersects the MediaBox (and is not fully contained).
Whereas text extractions restrict results (text or image) to objects contained in the MediaBox.
If you for whatever reason need coincidence make sure to adjust the clip to the same value in both cases.

from pymupdf.

yoliax avatar yoliax commented on August 24, 2024

I encountered the same issue! Memory leak!
I wrote a service using PyMuPDF to parse PDFs. Despite using fitz.TOOLS.store_shrink(100) each time, the service crashes due to memory leak after running for a period of time.

try:
    with fitz.Document(stream=data, filetype="pdf") as doc:
        ...
except Exception as e:
    logging...
finally:
    fitz.TOOLS.store_shrink(100)
    gc.collect()

other code:

zoom_x = request.imgsz / page_width
zoom_y = request.imgsz  / page_height
zoom = min(zoom_x, zoom_y)

mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat, colorspace="rgb", alpha=False)

from pymupdf.

yoliax avatar yoliax commented on August 24, 2024

Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?

I'm looking for this PDF. I'll share it once I find it.

from pymupdf.

JorjMcKie avatar JorjMcKie commented on August 24, 2024

Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?

I'm looking for this PDF. I'll share it once I find it.

Please do not mix different things in the same report!
If you find that example please open a separate issue.

from pymupdf.

yoliax avatar yoliax commented on August 24, 2024

Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?
I'm looking for this PDF. I'll share it once I find it.

It seems that your "issue" goes back to that Page.get_image_infos() uses no restriction on the image bboxes and will include any image - even if it only intersects the MediaBox (and is not fully contained). Whereas text extractions restrict results (text or image) to objects contained in the MediaBox. If you for whatever reason need coincidence make sure to adjust the clip to the same value in both cases.

Thank you very much, I will give it a try.

from pymupdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.