Comments (3)
You only rarely ever need to do PDF-to-PDF conversion at all. Previously, a valid motivation was to convert annotations and fields to become permanent parts of the pages. This is now gone since we have Document.bake()
.
For just getting a bytes
object from the reduced PDF (some pages omitted) simply use Document.tobytes(...)
- which is nothing else but a save()
to memory instead of disk.
from pymupdf.
-
We can only accept bug posts that can be reproduced. Your post has no reproducible data like a reproducing file.
-
For building / copying page range subsets of a given PDF, you are using an inadequate method! What you are doing instead is a PDF-to-PDF conversion. As documented, this will work only if the source PDF contains no errors. Obviously this is not the case for your file - of course I am forced to guess here, given the circumstances.
I suggest you use one of the following approaches:
- Directly create a subset of page numbers you are interested in. This happens by specifying a list of relevant 0-based page numbers. For example:
doc.select([0, 2, 4, 8, 4711])
. This will strip down the PDF accordingly - keeping intact things like the (relevant part of) the Table of Contents and more. As a side note: the page numbers must be in valid range, but they may contain duplicates and they need not be ascending. - Make a new (target) PDF and execute one or multiple
target.insert_pdf()
methods, specifing desired page ranges. This will lead to a target PDF without Table of Contents or other document-wide source PDF properties.
When done, don't forget to save the resulting PDF with maximum compression, i.e. execute method ez_save(...)
.
from pymupdf.
Thanks for the detailed reply @JorjMcKie!
I tried it with a couple other PDFs now and it works with them, so it seems to be an issue with the PDFs I have here. They are PDFs of concatenated scans in varying orientations. Sorry for not being able to provide them here for reproducability.
I am however getting the same error with target.insert_pdf()
for these files. And as I was interested in getting a Python bytes object of the page in question (for DB storage and upload to an API), I still feel that doc.convert_to_pdf()
seems to be a fitting option for this use case, or am I missing something?
But as it seems that this is an error in the PDF, this issue can be closed I suppose.
from pymupdf.
Related Issues (20)
- Text Color Change for different fonts HOT 1
- get_toc(simple=False) AttributeError: 'Outline' object has no attribute 'rect' HOT 4
- document.insert_pdf throws TypeError HOT 1
- MuPDF error: syntax error: cannot find ExtGState resource 'BlendMode0' HOT 4
- custom metadata keys HOT 1
- Change Visibility of OCR'd pdf text layer HOT 1
- pdf citation
- 提取中文pdf出现乱码 HOT 9
- Add dotted gridline detection to table recognition HOT 2
- Add Wheel for ARM musllinux aarch64 HOT 5
- font_family in page.get_text() dict at span level instead of font_name HOT 12
- mupdf error when higher than 1.24.1 HOT 2
- How to extract pdf page text line by line? HOT 1
- Remove_rotation() feature HOT 5
- OSS-Fuzz Integration HOT 6
- SegFault 11 when empty H1 H2 H3 H4 etc element is used in insert_htmlbox HOT 3
- ZeroDivisionError: float division by zero with page.apply_redactions() HOT 5
- PyMuPDF apply_redactions crops parts of the PDF in the final output HOT 7
- page.links return all links with same xref, is it something possible ?? HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pymupdf.