I'm trying to convert pdf to png, do some stuff with image and then upload result to s

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Memory leaking with pdf about pyvips HOT 21 OPEN

kar-pev commented on July 19, 2024

Memory leaking with pdf

from pyvips.

Comments (21)

jcupitt commented on July 19, 2024

Hi @kar-pev,

You'll need to make a complete test program and a sample PDF before I can reproduce your problem.

I tried here with this benchmark:

#!/usr/bin/python3

import sys
import pyvips

image = pyvips.Image.pdfload(sys.argv[1])
for i in range(image.get("n-pages")):
    # load to fill a 2048 x 2048 box
    page = pyvips.Image.thumbnail(sys.argv[1] + f"[page={i}]", 2048)[:3]
    page = page.gravity("centre", 2048, 2048, extend="white")
    data = page.write_to_buffer(".png")
    print(f"rendered as {len(data)} of PNG")

Don't use thumbnail_image, it's only for emergencies, instead load directly with thumbnail. It'll calculate a DPI for you and render at the correct size immediately.

With this PDF I see:

$ /usr/bin/time -f %M:%e ./try297.py ~/pics/nipguide.pdf 
rendered page 0 as 60533 bytes of PNG
rendered page 1 as 17417 bytes of PNG
rendered page 2 as 170833 bytes of PNG
rendered page 3 as 208890 bytes of PNG
rendered page 4 as 162163 bytes of PNG
rendered page 5 as 17625 bytes of PNG
rendered page 6 as 70013 bytes of PNG
rendered page 7 as 17891 bytes of PNG
rendered page 8 as 200018 bytes of PNG
rendered page 9 as 27530 bytes of PNG
rendered page 10 as 881459 bytes of PNG
rendered page 11 as 857038 bytes of PNG
rendered page 12 as 658160 bytes of PNG
rendered page 13 as 726185 bytes of PNG
rendered page 14 as 851090 bytes of PNG
rendered page 15 as 494613 bytes of PNG
rendered page 16 as 496943 bytes of PNG
rendered page 17 as 471977 bytes of PNG
rendered page 18 as 387558 bytes of PNG
rendered page 19 as 581401 bytes of PNG
rendered page 20 as 433491 bytes of PNG
rendered page 21 as 499961 bytes of PNG
rendered page 22 as 557588 bytes of PNG
rendered page 23 as 30114 bytes of PNG
rendered page 24 as 413044 bytes of PNG
rendered page 25 as 548221 bytes of PNG
rendered page 26 as 618224 bytes of PNG
rendered page 27 as 470931 bytes of PNG
rendered page 28 as 503389 bytes of PNG
rendered page 29 as 534815 bytes of PNG
rendered page 30 as 410243 bytes of PNG
rendered page 31 as 112163 bytes of PNG
rendered page 32 as 384436 bytes of PNG
rendered page 33 as 443909 bytes of PNG
rendered page 34 as 490428 bytes of PNG
rendered page 35 as 450758 bytes of PNG
rendered page 36 as 450702 bytes of PNG
rendered page 37 as 316399 bytes of PNG
rendered page 38 as 354259 bytes of PNG
rendered page 39 as 387184 bytes of PNG
rendered page 40 as 254847 bytes of PNG
rendered page 41 as 426819 bytes of PNG
rendered page 42 as 186205 bytes of PNG
rendered page 43 as 400066 bytes of PNG
rendered page 44 as 380871 bytes of PNG
rendered page 45 as 388221 bytes of PNG
rendered page 46 as 277809 bytes of PNG
rendered page 47 as 399227 bytes of PNG
rendered page 48 as 212261 bytes of PNG
rendered page 49 as 366544 bytes of PNG
rendered page 50 as 467029 bytes of PNG
rendered page 51 as 518713 bytes of PNG
rendered page 52 as 420412 bytes of PNG
rendered page 53 as 206056 bytes of PNG
rendered page 54 as 86764 bytes of PNG
rendered page 55 as 27297 bytes of PNG
rendered page 56 as 379708 bytes of PNG
rendered page 57 as 164827 bytes of PNG
528840:8.94

So 57 pages in 9s, with a peak memory use of 530kb.

Rather than saving to a buffer and then uploading, you can upload directly to S3 with page.write_to_target(), there's some sample code in examples/. It might save a little time and memory.

from pyvips.

kar-pev commented on July 19, 2024

Thx, maybe problem was with not using pdfload too. I'll try myself and write feedback soon

from pyvips.

jcupitt commented on July 19, 2024

pdfload won't make any difference, I was just trying to be clear. If you use new_from_file it'll work for any multi-page format, eg. GIF etc.

from pyvips.

kar-pev commented on July 19, 2024

Actually, about thumbnail, as I see in docs: static thumbnail(filename, ...).
It uses filename, so I should have image in memory and provide it's path, but with thumbnail I could use object. Is there a way to use object in regular thumbnail?

from pyvips.

jcupitt commented on July 19, 2024

There's thumbnail_buffer() to thumbnail image data held as a string-like object, if that's what you mean.

from pyvips.

jcupitt commented on July 19, 2024

Don't resize, just load at the correct size with thumbnail, then do any padding with gravity. You'll get better quality, lower memory use, and it'll be quicker too.

from pyvips.

kar-pev commented on July 19, 2024

I've tried and it works, max memory usage is programm size + pdf size, which seems awesome. Probably, issue was exactly with trying to save each page as object in memory. And speed increased too even without target write.
Thanks a lot

from pyvips.

jcupitt commented on July 19, 2024

GreatGreat!

(tangent, but it wasn't a memory leak -- that implies a memory reference has been lost due to a bug -- you were just seeing unexpectedly high memuse due to your program design)

from pyvips.

kar-pev commented on July 19, 2024

Sorry, I haven't tried it earlier, but I've ran your code in container and had 2+ Gb of using memory. I think, that it could be docker daemon problem, but I'm not sure about it and absolutely don't know how to fix it with using pyvips.
Memuse growth exactly with write_to_buffer func, but without it I have memuse value, that is ok for task

from pyvips.

jcupitt commented on July 19, 2024

Maybe your container isn't using popper to load PDFs, but is falling back to imagemagick? But that's just a guess, you need to share complete examples I can try before I can help.

from pyvips.

kar-pev commented on July 19, 2024

docker-compose.yaml file:

services:
  servise_name:
    restart: always
    build:
      context: .
      dockerfile: Dockerfile
    container_name: name
    image: name

Dockerfile:

FROM python:3.9-slim

RUN apt update && apt install -y \
    libmupdf-dev \
    libfreetype6-dev \
    libjpeg-dev \
    libglib2.0-0 \
    libgl1-mesa-glx \
    libpq-dev \
    libvips-dev --no-install-recommends \
    poppler-utils \
    gcc

WORKDIR /app

COPY . .

RUN python3 -m pip install --no-cache-dir -r requirements.txt

CMD ["python3", "main.py"]

will be enough to have only pyvips in requirements for this example

main.py file:

import pyvips


def main():
    image = pyvips.Image.pdfload("file.pdf")
    for i in range(image.get("n-pages")):
        # load to fill a 2048 x 2048 box
        page = pyvips.Image.thumbnail("file.pdf" + f"[page={i}]", 2048)[:3]
        page = page.gravity("centre", 2048, 2048, extend="white")
        data = page.write_to_buffer(".png")
        print(f"rendered page {i} as {len(data)} bytes of PNG")


if __name__ == "__main__":
    main()

It's all I'm using to get 2G+ of memuse (+ pdf file)

from pyvips.

jcupitt commented on July 19, 2024

There's a lot of stuff you don't need in that dockerfile, I'd just have:

FROM python:3.9-slim

RUN apt-get update \
  && apt-get install -y \
        build-essential \
        pkg-config 

RUN apt-get -y install --no-install-recommends libvips-dev
    
RUN pip install pyvips

I made a test dir:

https://github.com/jcupitt/docker-builds/tree/master/pyvips-python3.9

If I run:

docker build -t pyvips-python3.9 .
docker run -it --rm -v $PWD:/data pyvips-python3.9 ./main.py nipguide.pdf

I see a fairly steady c. 400mb in top.

from pyvips.

kar-pev commented on July 19, 2024

I'm copying your code (with CMD ["python", "main.py"] at the end of dockerfile) and getting 1.5 G+.
Could you please try with my file

from pyvips.

jcupitt commented on July 19, 2024

Yes, I see about 1.5g with that file too. It has huge image overlays on every page, so I think that's to be expected. It's just a very heavy PDF.

from pyvips.

kar-pev commented on July 19, 2024

But images + pdf uses much less memory, why other part of used memory is reserved and wasn't freed? It seems like some cache, that hasn't been cleared, because memuse stays on ~1G after process

from pyvips.

jcupitt commented on July 19, 2024

I would guess it's memory fragmentation from handling those huge overlays.

from pyvips.

jcupitt commented on July 19, 2024

How are you measuring memory use? RES in top is probably the most useful.

from pyvips.

kar-pev commented on July 19, 2024

I'm using docker stats.
Could I clear or free some of this allocated memory? I'm trying to reduce memory limits for container

from pyvips.

jcupitt commented on July 19, 2024

I guess you could use the cli instead:

#!/bin/bash

pdf=$1
n_pages=$(vipsheader -f n-pages $pdf)

for ((i=0; i < n_pages; i++)); do 
  echo processing page $i ...
  vipsthumbnail $pdf[page=$i] --size 2048 -o t1.v
  vips extract_band t1.v t2.v 0 --n 3
  vips gravity t2.v page-$i.png centre 2048 2038 --extend white
done

rm t1.v t2.v

It's a bit slower though, and you'll see a lot of block IO.

from pyvips.

jcupitt commented on July 19, 2024

Another option would be to use a malloc that avoids fragmentation, like jemalloc.

https://jemalloc.net/

But that's harder to set up.

from pyvips.

kar-pev commented on July 19, 2024

Ok, thanks, I'll try one of those

from pyvips.

Memory leaking with pdf about pyvips HOT 21 OPEN

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent