Deion of the bug When processing larger PDF files the page.g

Memory Retention with fitz.page.get_pixmap() about pymupdf HOT 8 OPEN

nataliia-obraztsova commented on August 24, 2024

Memory Retention with fitz.page.get_pixmap()

from pymupdf.

Comments (8)

nataliia-obraztsova commented on August 24, 2024 1

Adding fitz.TOOLS.store_shrink(100) after pix = None actually helped a lot. Here is a link to an older issue which I missed at first
#130
I still have some gradual increase so I'll leave the issue open for now.

from pymupdf.

JorjMcKie commented on August 24, 2024 1

Can you please provide printouts with numbers updated after the mentioned adjustments?

In general, if a permanently low memory footprint is desired (for whatever reasons), shrinking the store usage should be used generously.
This is because of a number of reasons:

MuPDF's strategy is to keep things in memory - especially objects that are prone to be large like images and fonts
Deleting Python objects is only one side of the medal: the shadowing C-object in MuPDF is not necessarily also removed in each case.

from pymupdf.

nataliia-obraztsova commented on August 24, 2024 1

Below you can see memory profiling after adjustments. The interesting thing is that while processing the file f0 fitz.TOOLS.store_shrink(100) in line 47 seems to made no difference, but memory usage increased only by 7MiB. And didn't shrink back to initial number. While processing file f1, fitz.TOOLS.store_shrink(100) in line 47 reduced memory usage a lot. But still not all of it. Additional 20.12 MB added up. Then it seems to plateau.

P.S. I have upgraded PyMuPDF to 1.24.7

memory profiling after adjustments

processing file f0

Memory usage before function: 53.28 MB

Line # Mem usage Increment Occurrences Line Contents

34     53.5 MiB     53.5 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     53.7 MiB      0.1 MiB           1       file_stream = read_file(file_name)
37     56.0 MiB      2.4 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     56.0 MiB      0.0 MiB           1       try:
39     56.0 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40     67.4 MiB      0.0 MiB           4           for i in range(number_of_pages):
41     67.4 MiB      0.2 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44     67.4 MiB      7.0 MiB           3               pix = page.get_pixmap()
45     67.4 MiB      3.5 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46     67.4 MiB      0.0 MiB           3               pix = None
47     67.4 MiB      0.0 MiB           3               fitz.TOOLS.store_shrink(100)
48                                                     # Convert the PIL Image to a bytes-like object
49     67.4 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
50     67.4 MiB      0.6 MiB           3               img.save(img_byte_buff, format='JPEG')
51     67.4 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
52                                         
53                                                     # Encode the image bytes in base64 and decode to UTF-8 string
54     67.4 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55                                         
56                                             except Exception as e:
57                                                 raise Exception(e.args)
58                                             finally:
59     67.4 MiB      0.0 MiB           1           doc.close()
60     67.4 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 60.41 MB
Memory usage difference total: 7.13 MB

processing file f1

Memory usage before function: 60.41 MB

Line # Mem usage Increment Occurrences Line Contents

34     60.4 MiB     60.4 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     65.7 MiB      5.2 MiB           1       file_stream = read_file(file_name)
37     65.7 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     65.7 MiB      0.0 MiB           1       try:
39     65.7 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40    100.4 MiB    -70.7 MiB          33           for i in range(number_of_pages):
41    100.4 MiB    -56.1 MiB          32               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44    145.3 MiB    194.0 MiB          32               pix = page.get_pixmap()
45    145.3 MiB   -289.6 MiB          32               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46    145.3 MiB   -289.6 MiB          32               pix = None
47    100.4 MiB   -519.4 MiB          32               fitz.TOOLS.store_shrink(100)
48                                                     # Convert the PIL Image to a bytes-like object
49    100.4 MiB    -70.7 MiB          32               img_byte_buff = BytesIO()
50    100.4 MiB    -70.7 MiB          32               img.save(img_byte_buff, format='JPEG')
51    100.4 MiB    -70.7 MiB          32               img_byte_arr = img_byte_buff.getvalue()
52                                         
53                                                     # Encode the image bytes in base64 and decode to UTF-8 string
54    100.4 MiB    -70.7 MiB          32               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55                                         
56                                             except Exception as e:
57                                                 raise Exception(e.args)
58                                             finally:
59     85.8 MiB    -14.6 MiB           1           doc.close()
60     85.8 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 80.53 MB
Memory usage difference total: 20.12 MB

processing file f2

Memory usage before function: 80.53 MB

Line # Mem usage Increment Occurrences Line Contents

34     80.5 MiB     80.5 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     80.5 MiB      0.0 MiB           1       file_stream = read_file(file_name)
37     80.5 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     80.5 MiB      0.0 MiB           1       try:
39     80.5 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40     80.5 MiB      0.0 MiB           4           for i in range(number_of_pages):
41     80.5 MiB      0.0 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44     80.5 MiB      0.0 MiB           3               pix = page.get_pixmap()
45     80.5 MiB      0.0 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46     80.5 MiB      0.0 MiB           3               pix = None
47     80.5 MiB      0.0 MiB           3               fitz.TOOLS.store_shrink(100)
48                                                     # Convert the PIL Image to a bytes-like object
49     80.5 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
50     80.5 MiB      0.0 MiB           3               img.save(img_byte_buff, format='JPEG')
51     80.5 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
52                                         
53                                                     # Encode the image bytes in base64 and decode to UTF-8 string
54     80.5 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55                                         
56                                             except Exception as e:
57                                                 raise Exception(e.args)
58                                             finally:
59     80.5 MiB      0.0 MiB           1           doc.close()
60     80.5 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 80.53 MB
Memory usage difference total: 0.00 MB

processing file f3

Memory usage before function: 80.53 MB

Line # Mem usage Increment Occurrences Line Contents

34     80.5 MiB     80.5 MiB           1   @profile
35                                         def render_page_to_image(file_name):
36     80.5 MiB      0.0 MiB           1       file_stream = read_file(file_name)
37     80.5 MiB      0.0 MiB           1       doc = fitz.open(stream=file_stream, filetype="pdf")
38     80.5 MiB      0.0 MiB           1       try:
39     80.5 MiB      0.0 MiB           1           number_of_pages = doc.page_count
40     80.5 MiB      0.0 MiB           4           for i in range(number_of_pages):
41     80.5 MiB      0.0 MiB           3               page = doc.load_page(i)
42                                         
43                                                     # Render the page to a pixmap (an image)
44     80.5 MiB      0.0 MiB           3               pix = page.get_pixmap()
45     80.5 MiB      0.0 MiB           3               img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
46     80.5 MiB      0.0 MiB           3               pix = None
47     80.5 MiB      0.0 MiB           3               fitz.TOOLS.store_shrink(100)
48                                                     # Convert the PIL Image to a bytes-like object
49     80.5 MiB      0.0 MiB           3               img_byte_buff = BytesIO()
50     80.5 MiB      0.0 MiB           3               img.save(img_byte_buff, format='JPEG')
51     80.5 MiB      0.0 MiB           3               img_byte_arr = img_byte_buff.getvalue()
52                                         
53                                                     # Encode the image bytes in base64 and decode to UTF-8 string
54     80.5 MiB      0.0 MiB           3               rendered_image = base64.b64encode(img_byte_arr).decode('utf-8')
55                                         
56                                             except Exception as e:
57                                                 raise Exception(e.args)
58                                             finally:
59     80.5 MiB      0.0 MiB           1           doc.close()
60     80.5 MiB      0.0 MiB           1           fitz.TOOLS.store_shrink(100)

Memory usage after function: 80.53 MB
Memory usage difference total: 0.00 MB

from pymupdf.

JorjMcKie commented on August 24, 2024 1

Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?

I'm looking for this PDF. I'll share it once I find it.

It seems that your "issue" goes back to that Page.get_image_infos() uses no restriction on the image bboxes and will include any image - even if it only intersects the MediaBox (and is not fully contained).
Whereas text extractions restrict results (text or image) to objects contained in the MediaBox.
If you for whatever reason need coincidence make sure to adjust the clip to the same value in both cases.

from pymupdf.

yoliax commented on August 24, 2024

I encountered the same issue! Memory leak!
I wrote a service using PyMuPDF to parse PDFs. Despite using fitz.TOOLS.store_shrink(100) each time, the service crashes due to memory leak after running for a period of time.

try:
    with fitz.Document(stream=data, filetype="pdf") as doc:
        ...
except Exception as e:
    logging...
finally:
    fitz.TOOLS.store_shrink(100)
    gc.collect()

other code:

zoom_x = request.imgsz / page_width
zoom_y = request.imgsz  / page_height
zoom = min(zoom_x, zoom_y)

mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat, colorspace="rgb", alpha=False)

from pymupdf.

yoliax commented on August 24, 2024

Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?

I'm looking for this PDF. I'll share it once I find it.

from pymupdf.

JorjMcKie commented on August 24, 2024

Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?

I'm looking for this PDF. I'll share it once I find it.

Please do not mix different things in the same report!
If you find that example please open a separate issue.

from pymupdf.

yoliax commented on August 24, 2024

Another issue: why does calling the page.get_image_rects function return a large number of images (over 40,000), when there are no visible images on that PDF page?
I'm looking for this PDF. I'll share it once I find it.

It seems that your "issue" goes back to that Page.get_image_infos() uses no restriction on the image bboxes and will include any image - even if it only intersects the MediaBox (and is not fully contained). Whereas text extractions restrict results (text or image) to objects contained in the MediaBox. If you for whatever reason need coincidence make sure to adjust the clip to the same value in both cases.

Thank you very much, I will give it a try.

from pymupdf.

Memory Retention with fitz.page.get_pixmap() about pymupdf HOT 8 OPEN

Comments (8)

memory profiling after adjustments

processing file f0

Line # Mem usage Increment Occurrences Line Contents

processing file f1

Line # Mem usage Increment Occurrences Line Contents

processing file f2

Line # Mem usage Increment Occurrences Line Contents

processing file f3

Line # Mem usage Increment Occurrences Line Contents

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent