Comments (21)
Hi @kar-pev,
You'll need to make a complete test program and a sample PDF before I can reproduce your problem.
I tried here with this benchmark:
#!/usr/bin/python3
import sys
import pyvips
image = pyvips.Image.pdfload(sys.argv[1])
for i in range(image.get("n-pages")):
# load to fill a 2048 x 2048 box
page = pyvips.Image.thumbnail(sys.argv[1] + f"[page={i}]", 2048)[:3]
page = page.gravity("centre", 2048, 2048, extend="white")
data = page.write_to_buffer(".png")
print(f"rendered as {len(data)} of PNG")
Don't use thumbnail_image
, it's only for emergencies, instead load directly with thumbnail
. It'll calculate a DPI for you and render at the correct size immediately.
With this PDF I see:
$ /usr/bin/time -f %M:%e ./try297.py ~/pics/nipguide.pdf
rendered page 0 as 60533 bytes of PNG
rendered page 1 as 17417 bytes of PNG
rendered page 2 as 170833 bytes of PNG
rendered page 3 as 208890 bytes of PNG
rendered page 4 as 162163 bytes of PNG
rendered page 5 as 17625 bytes of PNG
rendered page 6 as 70013 bytes of PNG
rendered page 7 as 17891 bytes of PNG
rendered page 8 as 200018 bytes of PNG
rendered page 9 as 27530 bytes of PNG
rendered page 10 as 881459 bytes of PNG
rendered page 11 as 857038 bytes of PNG
rendered page 12 as 658160 bytes of PNG
rendered page 13 as 726185 bytes of PNG
rendered page 14 as 851090 bytes of PNG
rendered page 15 as 494613 bytes of PNG
rendered page 16 as 496943 bytes of PNG
rendered page 17 as 471977 bytes of PNG
rendered page 18 as 387558 bytes of PNG
rendered page 19 as 581401 bytes of PNG
rendered page 20 as 433491 bytes of PNG
rendered page 21 as 499961 bytes of PNG
rendered page 22 as 557588 bytes of PNG
rendered page 23 as 30114 bytes of PNG
rendered page 24 as 413044 bytes of PNG
rendered page 25 as 548221 bytes of PNG
rendered page 26 as 618224 bytes of PNG
rendered page 27 as 470931 bytes of PNG
rendered page 28 as 503389 bytes of PNG
rendered page 29 as 534815 bytes of PNG
rendered page 30 as 410243 bytes of PNG
rendered page 31 as 112163 bytes of PNG
rendered page 32 as 384436 bytes of PNG
rendered page 33 as 443909 bytes of PNG
rendered page 34 as 490428 bytes of PNG
rendered page 35 as 450758 bytes of PNG
rendered page 36 as 450702 bytes of PNG
rendered page 37 as 316399 bytes of PNG
rendered page 38 as 354259 bytes of PNG
rendered page 39 as 387184 bytes of PNG
rendered page 40 as 254847 bytes of PNG
rendered page 41 as 426819 bytes of PNG
rendered page 42 as 186205 bytes of PNG
rendered page 43 as 400066 bytes of PNG
rendered page 44 as 380871 bytes of PNG
rendered page 45 as 388221 bytes of PNG
rendered page 46 as 277809 bytes of PNG
rendered page 47 as 399227 bytes of PNG
rendered page 48 as 212261 bytes of PNG
rendered page 49 as 366544 bytes of PNG
rendered page 50 as 467029 bytes of PNG
rendered page 51 as 518713 bytes of PNG
rendered page 52 as 420412 bytes of PNG
rendered page 53 as 206056 bytes of PNG
rendered page 54 as 86764 bytes of PNG
rendered page 55 as 27297 bytes of PNG
rendered page 56 as 379708 bytes of PNG
rendered page 57 as 164827 bytes of PNG
528840:8.94
So 57 pages in 9s, with a peak memory use of 530kb.
Rather than saving to a buffer and then uploading, you can upload directly to S3 with page.write_to_target()
, there's some sample code in examples/
. It might save a little time and memory.
from pyvips.
Thx, maybe problem was with not using pdfload too. I'll try myself and write feedback soon
from pyvips.
pdfload
won't make any difference, I was just trying to be clear. If you use new_from_file
it'll work for any multi-page format, eg. GIF etc.
from pyvips.
Actually, about thumbnail, as I see in docs: static thumbnail(filename, ...).
It uses filename, so I should have image in memory and provide it's path, but with thumbnail I could use object. Is there a way to use object in regular thumbnail?
from pyvips.
There's thumbnail_buffer()
to thumbnail image data held as a string-like object, if that's what you mean.
from pyvips.
Don't resize, just load at the correct size with thumbnail
, then do any padding with gravity
. You'll get better quality, lower memory use, and it'll be quicker too.
from pyvips.
I've tried and it works, max memory usage is programm size + pdf size, which seems awesome. Probably, issue was exactly with trying to save each page as object in memory. And speed increased too even without target write.
Thanks a lot
from pyvips.
GreatGreat!
(tangent, but it wasn't a memory leak -- that implies a memory reference has been lost due to a bug -- you were just seeing unexpectedly high memuse due to your program design)
from pyvips.
Sorry, I haven't tried it earlier, but I've ran your code in container and had 2+ Gb of using memory. I think, that it could be docker daemon problem, but I'm not sure about it and absolutely don't know how to fix it with using pyvips.
Memuse growth exactly with write_to_buffer func, but without it I have memuse value, that is ok for task
from pyvips.
Maybe your container isn't using popper to load PDFs, but is falling back to imagemagick? But that's just a guess, you need to share complete examples I can try before I can help.
from pyvips.
docker-compose.yaml file:
services:
servise_name:
restart: always
build:
context: .
dockerfile: Dockerfile
container_name: name
image: name
Dockerfile:
FROM python:3.9-slim
RUN apt update && apt install -y \
libmupdf-dev \
libfreetype6-dev \
libjpeg-dev \
libglib2.0-0 \
libgl1-mesa-glx \
libpq-dev \
libvips-dev --no-install-recommends \
poppler-utils \
gcc
WORKDIR /app
COPY . .
RUN python3 -m pip install --no-cache-dir -r requirements.txt
CMD ["python3", "main.py"]
will be enough to have only pyvips in requirements for this example
main.py file:
import pyvips
def main():
image = pyvips.Image.pdfload("file.pdf")
for i in range(image.get("n-pages")):
# load to fill a 2048 x 2048 box
page = pyvips.Image.thumbnail("file.pdf" + f"[page={i}]", 2048)[:3]
page = page.gravity("centre", 2048, 2048, extend="white")
data = page.write_to_buffer(".png")
print(f"rendered page {i} as {len(data)} bytes of PNG")
if __name__ == "__main__":
main()
It's all I'm using to get 2G+ of memuse (+ pdf file)
from pyvips.
There's a lot of stuff you don't need in that dockerfile, I'd just have:
FROM python:3.9-slim
RUN apt-get update \
&& apt-get install -y \
build-essential \
pkg-config
RUN apt-get -y install --no-install-recommends libvips-dev
RUN pip install pyvips
I made a test dir:
https://github.com/jcupitt/docker-builds/tree/master/pyvips-python3.9
If I run:
docker build -t pyvips-python3.9 .
docker run -it --rm -v $PWD:/data pyvips-python3.9 ./main.py nipguide.pdf
I see a fairly steady c. 400mb in top
.
from pyvips.
I'm copying your code (with CMD ["python", "main.py"] at the end of dockerfile) and getting 1.5 G+.
Could you please try with my file
from pyvips.
Yes, I see about 1.5g with that file too. It has huge image overlays on every page, so I think that's to be expected. It's just a very heavy PDF.
from pyvips.
But images + pdf uses much less memory, why other part of used memory is reserved and wasn't freed? It seems like some cache, that hasn't been cleared, because memuse stays on ~1G after process
from pyvips.
I would guess it's memory fragmentation from handling those huge overlays.
from pyvips.
How are you measuring memory use? RES
in top
is probably the most useful.
from pyvips.
I'm using docker stats.
Could I clear or free some of this allocated memory? I'm trying to reduce memory limits for container
from pyvips.
I guess you could use the cli instead:
#!/bin/bash
pdf=$1
n_pages=$(vipsheader -f n-pages $pdf)
for ((i=0; i < n_pages; i++)); do
echo processing page $i ...
vipsthumbnail $pdf[page=$i] --size 2048 -o t1.v
vips extract_band t1.v t2.v 0 --n 3
vips gravity t2.v page-$i.png centre 2048 2038 --extend white
done
rm t1.v t2.v
It's a bit slower though, and you'll see a lot of block IO.
from pyvips.
Another option would be to use a malloc that avoids fragmentation, like jemalloc.
But that's harder to set up.
from pyvips.
Ok, thanks, I'll try one of those
from pyvips.
Related Issues (20)
- How to generate patches from coordinate list by a multiprocessing way? HOT 1
- Dtype auto conversion while running script HOT 6
- new release when & ship with libvips? HOT 14
- ICC Profile Application on SVS and tiffsave hangs randomly
- High RAM usage with fetch when reading multiple WSI tiles HOT 2
- Slower first run, even without cache HOT 2
- using resource.setrlimit makes pyvips / liborc spit the dummy HOT 2
- dzsave_target returns NoneType not list[] as in docs HOT 1
- Pyvips fetch returns bad image on .svs files HOT 4
- Reading in image a second time breaks on qptiff file HOT 6
- Documentation link to conda package HOT 1
- Create svs file with image pyramid, label and macro in pyvips? HOT 4
- Image merge stitch HOT 2
- `addalpha` has different behaviour compared to C HOT 1
- Error installing pyvips on Mac OS HOT 15
- draw_rect fill error HOT 2
- creating tiles of 16bit multiband (4 band) tiff image leaves black canvas HOT 4
- class "jxlsave" not found HOT 9
- Assertion failed: sp->cinfo.comm.is_decompressor HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyvips.