unidoc / unipdf Goto Github PK

Golang PDF library for creating and processing PDF files (pure go)

License: Other

Go 100.00%

golang pdf pdf-library pdf-generation pdf-document-processor text-extraction pdf-manipulation pdf-compression pdf-reports signing

unipdf's Introduction

UniPDF - PDF for Go

UniDoc UniPDF is a PDF library for Go (golang) with capabilities for creating and reading, processing PDF files. The library is written and supported by FoxyUtils.com, where the library is used to power many of its services.

Features

Create PDF reports. Example output: unidoc-report.pdf.
Table PDF reports. Example output: unipdf-tables.pdf.
Invoice creation
Styled paragraphs
Merge PDF pages
Split PDF pages and change page order
Rotate pages
Extract text from PDF files
Text extraction support with size, position and formatting info
PDF to CSV illustrates extracting tabular data from PDF.
Extract images with coordinates
Images to PDF
Add images to pages
Compress and optimize PDF
Watermark PDF files
Advanced page manipulation: Put 4 pages on 1
Load PDF templates and modify
Form creation
Fill and flatten forms
Fill out forms and FDF merging
Unlock PDF files / remove password
Protect PDF files with a password
Digital signing validation and signing
CCITTFaxDecode decoding and encoding support
JBIG2 decoding support

Multiple examples are provided in our example repository https://github.com/unidoc/unipdf-examples.

Installation

With modules:

go get github.com/unidoc/unipdf/v3

License key

This software package (unipdf) is a commercial product and requires a license code to operate.

To Get a Metered License API Key in for free in the Free Tier, sign up on https://cloud.unidoc.io

How can I convince myself and my boss to buy unipdf rather using a free alternative?

The choice is yours. There are multiple respectable efforts out there that can do many useful things.

In UniDoc, we work hard to provide production quality builds taking every detail into consideration and providing excellent support to our customers. See our testimonials for example.

Security. We take security very seriously and we restrict access to github.com/unidoc/unipdf repository with protected branches and only the founders have access and every commit is reviewed prior to being accepted.

The profits are invested back into making unipdf better. We want to make the best possible product and in order to do that we need the best people to contribute. A large fraction of the profits made goes back into developing unipdf. That way we have been able to get many excellent people to work and contribute to unipdf that would not be able to contribute their work for free.

Contributing

If you are interested in contributing, please contact us.

Go Version Compatibility

Officially we support three latest Go versions, but internally we would test the build with up to five latest Go versions in our CI runner.

Support and consulting

Please email us at [email protected] for any queries.

If you have any specific tasks that need to be done, we offer consulting in certain cases. Please contact us with a brief summary of what you need and we will get back to you with a quote, if appropriate.

License agreement

The use of this software package is governed by the end-user license agreement (EULA) available at: https://unidoc.io/eula/

unipdf's People

Contributors

Stargazers

Watchers

Forkers

gunnsth jangocheng boutros kucjac gpsbird freman shammishailaj qsdj isgasho koalacxr wudi zanooc peterwilliams97 sait fjorks jangocity hy05190134 stephenmiracle r2pgl daniel-007 innermond xintangli scripbox junftnt priestd09 shnwang pengzhenpin backwardn charygao becoded virtuald johanneskaufmann nickshen3 dontbesad tinygg austinmcrane knadh mrosen95 grailbio-external pyjamaroutines falconandy gabriel-vasile simonxing tonywubo rizalgowandy sprucehealth ericdotwang snwfdhmp juruen theassyrian divan yuzic go-app-cloud pingyuan7 yiqideren iamtekeste zhanghongnian tsdrm lanre-ade taark databill86 fbngrm oscarpfernandez hexennacht pzduniak bitmaskit-forks horsley lu4p igaisin liguoqinjim lianggx6 affordablemobiles timdrysdale wb253 muqp changsongyang rafaelsanzio shuttlestone mgirouard kokizzu rodrickmakore up1 thelinker ahed91 b2nil donjayamanne drizzef binhnguyenduc 4thel00z hakwolf kaunge-fork qichengzhang shouldend safetyculture jiushicheng oliverpool unidoc-build cywforce joicemjoseph kellerli

unipdf's Issues

Memory management for large PDFs

Currently it isn't possible to parse/edit/write PDFs if their size approaches the available memory on the computer. Unidoc parses the object tree and stores them in memory. Based on the architecture of a PDF file we should be able to parse (and for example extract text) from arbitrarily large PDFs that greatly exceed the memory capacity of the server. #128 could resolve a lot of this but would require pages/objects to be able to be freed from memory when no longer needed.

Additionally when writing large PDFs there would need to be something implemented to either cache objects to disk before writing the completed PDF or stream the PDF writing while each page is completed and releasing the memory back. This would get tricky with shared objects.

Compress: Encode binary images as CCITTFaxDecode

The CCITTFaxDecode filter (and JBIG2 which is under development) is particularly well suited for encoding binary images (0/1). Images for optimization that are binary (DeviceGray and/or BitsPerComponent 1) should be encoded with CCITTFaxDecode filter.

Color transformation issues

Issues with certain test files:

Black in background

000008.pdf Black outline box appears around "Mar 27, 2017, 2K50 pm AEDT"
000011.pdf Light text columns on page 2 have become dark e.g. "Technical Support"
000023.pdf Black box appears at top left of page 1
000058.orig.pdf Black background appears below "A marketing email ... "

Other issues

000040.pdf Blue text has same gray level as black text
000050.pdf Blend on left of page 1 is not printed

PDF image file processing issue: bad RST marker error

Reported by Peter Williams:

I get an invalid JPEG format: bad RST marker error on this

This looks like an error in the Go JPEG library.

MFP Scan.pdf

Configure Rotation Origin

We have code to scale down PDFs to A4 documents and center them, as part of this we want to output a portrait PDF. If the input PDF was landscape this means they end up much smaller so we want to make sure they are 90 degrees rotate so they benefit from having as much space as possible.

In the current API it rotates from some sort of origin the library defaults to, we looked through the code and it looks like the only way to control this is to hack it by doing some translation before rotating.

I noticed you have a TODO to control the origin which we would be interested in as we want to be able to control it so we can simply rotate by 90 degrees and center it (The centering code completely breaks at the moment as it assumes the origin is in the middle.

Add support for Crypt filter in streams

Current state

PDF has a feature where streams can be encrypted with a crypt filter that is specified via DecodeParms that refers to the /CF dictionary of the /Encrypt dictionary. If DecodeParms is missing the Identity filter is used (raw data unchanged)

From 7.6.5 Crypt Filters (PDF32000_2008.PDF):

A stream filter type, the Crypt filter (see 7.4.10, "Crypt Filter") can be specified for any stream in the
document to override the default filter for streams.

For example here is a case where /Crypt is specified and DecodeParms is missing (Identity filter) so the data is left in tact. Seems to be used for metadata sometimes.

165 0 obj<</Length 3575/Filter[/Crypt]/Type/Metadata/Subtype/XML>>stream^M
<?xpacket begin='ï»¿' id='W5M0MpCehiHzreSzNTczkc9d'?>

Another example from PDF32000_2008.PDF (p. 77)

5 0 obj
<< /Title ($#*#%*$#^&##) >> % Info dictionary: encrypted text string
endobj
6 0 obj
<< /Type /Metadata
/Subtype /XML
/Length 15
/Filter [/Crypt] % Uses a crypt filter
/DecodeParms % with these parameters
<< /Type /CryptFilterDecodeParms
/Name /Identity % Indicates no encryption
>>
>>
stream
XML metadata % Unencrypted metadata
endstream
endobj
8 0 obj % Encryption dictionary
<< /Filter /MySecurityHandlerName
/V 4 % Version 4: allow crypt filters
/CF % List of crypt filters
<< /MyFilter0
<< /Type /CryptFilter
/CFM V2 >> % Uses the standard algorithm
>>
/StrF /MyFilter0 % Strings are decrypted using /MyFilter0
/StmF /MyFilter0 % Streams are decrypted using /MyFilter0
... % Private data for /MySecurityHandlerName
/MyUnsecureKey (12345678)
/EncryptMetadata false
>>
endobj

Proposed change

Add support for supporting the Crypt filter in a similar fashion as other stream filters in core/stream.go

Support JBIG2Decode filter

Add support for decoding and encoding with the JBIG2 standard.

See section 7.4.7 JBIG2Decode Filter in the PDF reference (PDF32000_2008):

The JBIG2Decode filter decodes monochrome (1 bit per pixel) image data
that has been encoded using JBIG2 encoding.

JBIG stands for the Joint Bi-Level Image Experts Group, a group within the ISO that developed the format. JBIG2 is the second version of a standard originally released as JBIG1.
JBIG2 encoding which provides for both lossy and lossless compression, is only useful monochrome images, not for color images, grayscale images or general data.
The algorithms are described in ISO/IEC 11544 published standard for the current JBIG2 specification.
In general JBIG2 provides considerably better compression than the existing CCITT standard.

The optional parameters for JBIG2Decode filter in PDF are:

JBIG2Globals - a stream containing the JBIG2 global (page 0) segments.

See also Example 1 in the standard which can be used as a testcase.

Implementation

Makes sense to implement as a package jbig2 that can be included internally in unidoc. Should be licensed with the same license as the unidoc project.
Start by focusing on decoding, can use the example provided in the PDF reference, and extract some JBIG2Decoded data from PDF files
Code should follow the unidoc style guide
Encoding should also be implemented in the package

Notes

I am currently not aware of any golang implementations of JBIG2. However, there are a few open source implementions in other languages that might be a good reference.

model.PdfAcroForm should have a Merge function to merge forms

This should also include a test case with merging PDFs with forms and annotations writing out, loading and checking that it is as expected.

Support for Type0 CIDType0 fonts

Currently text extraction fails on some text using this font type. Need to add support for it to properly work for extraction.

Compress: Expand colorspace handling for images and make processing more generic

Currently the colorspace handling only supports DeviceGray and DeviceRGB and the handling is simplistic only looping through the images in XObject and compressing all of those. If any image was never used in the contentstream it would still not be removed for example.
Also this means that inline images are not handled.

The handling should be made more generic and use the ContentStreamProcessor to process the contents. The colorspace handling should also be more generic and fall back to alternative colorspaces in cases where not properly supported. The handling should be similar as for example in:
https://github.com/unidoc/unidoc-examples/blob/v3/pdf/advanced/pdf_grayscale_transform.go

although we can ignore handling of patterns and shadings at the moment.

Take care to remove resources that are not used. Perhaps that should be its own optimization as it makes sense to remove images, fonts, colorspaces or any other resources that are not actually used.

Change WriteString signature to a Write method (with io.Reader)

The string type is really not the right one for the writer and the eventual output is always []byte. Would make sense to change the signature to Write(io.Reader) instead of the WriteString. Easy to use the interface for any kinds of testing as well. May improve performance.

Compress: Detect binary images and encode with a binary image encoding algorithm

Images with only 2 values (min/max) are suitable for encoding with CCITTFaxDecode and JBIG2 (when encoding support is ready).

Add a flag BinaryImageOptimization = true by default which looks at the image values and detects whether it is suitable for encoding with CCITTFaxDecode. Note that this is slightly different from #428 as this involves looking at the values (color histogram), for example a BitPerComponent = 8 using only color values 0/255 would be suitable for encoding as a binary image.

It might also make sense to define some threshold so that the number of pixels outside the min/max bins could be defined so that a certain percentage of pixels could fall outside and would be interpolated to the closest bin. For example if 99% of pixels fall into 0, 255 and then there are some pixels with values 1,5,9,230, 150, 250, those would be interpolated to the closest value 0,0,0,255,255,255 so that there are only 2 values prior to feeding to the CCITTFaxDecode algorithm. The threshold should be defined as BinaryImageOptimizationThreshold = 0.99 (float64).

Inspect method not picking up embedded JavaScript

Hey guys, really enjoying using unidoc, just spotted something (and apologies if this is an oversight)

The pdfReader.Inspect method doesn't appear to pull the JavaScript out of the following file:

%PDF-1.7
4 0 obj
<<
/Length 0
>>
stream

endstream endobj
5 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 3 0 R
/Contents 4 0 R
/MediaBox [ 0 0 612 792 ]

>>
 endobj
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
/OpenAction [ 5 0 R /Fit ]
  /Names << % the Javascript entry
    /JavaScript <<
      /Names [
        (EmbeddedJS)
        <<
          /S /JavaScript
          /JS (
            app.alert('Hello, World!');
          )
        >>
      ]
    >>
  >> % end of the javascript entry
>>
 endobj
2 0 obj
<<
/Type /Pages
/Count 1
/Kids [ 5 0 R ]

>>
 endobj
3 0 obj
<<
>>
 endobj
xref
0 6
0000000000 65535 f 
0000000166 00000 n 
0000000244 00000 n 
0000000305 00000 n 
0000000009 00000 n 
0000000058 00000 n 
trailer <<
/Size 6
/Root 1 0 R
>>
startxref
327
%%EOF

It correctly identifies the number of pages and etc, just doesn't pick up the JS around line 25

Additional question: do you guys support stripping this kind of stuff from the file? The example does a great job of explaining how to find JS/Flash/Video in a PDF, but not how to remove (if possible)

Thanks!

Reduce package size - Move large testdata to a separate package

Currently the size of a fresh checkout of unidoc v3 is over 50mb.
Go through each testdata file and determine if it needs to be in unidoc testing or can go into private testdata and be activated via environment variable.

Create model.PdfCatalog type

Create a model.PdfCatalog type

// PdfCatalog represents the root Catalog dictionary (section 7.7.2 p. 79)
type PdfCatalog struct {
    Type *core.PdfObjectName
    Version *core.PdfObjectName
etc...
}

Will make it easier to work with externally and simplify usage within the model package.

Unable to extract and add again a font to a new pdf document

Hello to all, I am trying to extract certain text from a PDF page, and writing it back on a new PDF document.
I am experimenting with v3 branch, that have new api tro extract vectorized text from the page.
In order to return all the text marks -that are private in the current version of the api- I created a new convencience struct returning from a new getter in PageText struct.

The text extraction works well, the problem is that I am unable to set the font for a new paragraph element, created iterating returning marks, the same as the original one.

I also tried to add the font before in the page and then to the paragraph, but I have only errors, such as:

[DEBUG] simple.go:56 ERROR: NewSimpleTextEncoder. Unknown encoding "default"
[DEBUG] simple.go:56 ERROR: NewSimpleTextEncoder. Unknown encoding "custom"
error is: unsupported font encoding

or, with another PDF files:

[DEBUG] ttfparser.go:527 parseCmapVersion: format=0 length=262 language=0
[DEBUG] ttfparser.go:732 No PostScript name information is provided for the font.

This is the test code I am using: Gist
You can pull the library with changes here: Repo Link

I attach the PDF files I'm testing on.

Thank you in advance!
newspaper.pdf

Add test cases for FlateDecode post-decode prediction

Should test:

TIFF encoding prediction
PNG prediction

None, Sub, Up, Average, Paeth, Combination

Include cases with 1,3,4 color components.

Can assume BitsPerComponent = 8 for now. Support for more BPC will be included in a separate ticket which will include making tests.

Advanced text extraction on columns, tables, equations

To properly extract certain text in PDF, it may be necessary to detect/group lines, identify tables, equations. This may either be done post-extraction of objects or before, depending on what is easier to implement and gives good results.

Also need to assemble a solid corpus for testing, as well as an API prototyping. Tabular extraction may need a different approach than equations and possibly a different API.

At this point we are collecting input so that we can define this issue better.

Support JPXDecode Filter (JPEG2000)

Add support for decoding and encoding with the JPEG2000 standard.

See section 7.4.9 JPXDecode Filter (PDF32000_2008):

The JPXDecode filter decodes data that has been encoded using the JPEG2000 compression
method, an ISO standard for the compression and packaging of image data.

JPEG2000 defines a wavelet-based method for image compression that gives somewhat better size reduction than other methods such as regular JPEG or CCITT.
Only applied to image XObjects, not inline images
Suitable for both images with a single or multiple color components and bits per sample ranging between 1-38, so it is quite flexible.
Very flexible in terms of colorspaces etc, see 7.4.9 for more information.
See http://www.jpeg.org/jpeg2000/ and ISO/IEC 15444-2, Information Technology, JPEG 2000 Image Coding System: Extensions.

Implementation

Makes sense to implement as a package jpeg2000 that can be included internally in unidoc. Should be licensed with the same license as the unidoc project.
Start by focusing on decoding, can use the example provided in the PDF reference, and extract some JPXDecode data from PDF files
Code should follow the unidoc style guide
Encoding should also be implemented in the package such that go/pdf image data can be reencoded with JPEG2000 and embedded in the PDF.
Make some benchmarks to set a baseline and fix obvious performance issues

Notes

I am currently not aware of any golang implementations of JPEG2000. However, there are a few open source implementions in other languages that might be a good reference in addition to the standard.

support convert pdf to image?

is plan support convert pdf to image?

Compress: Defer to original image if optimized image gets bigger

Playing around with the compression I found that in some cases the "optimized" images can get bigger than the original ones. In some of the cases this was for example because a CCITTFaxDecoded image was converted to JPEG (DCT). That will be handled in a separate ticket. However, also saw that a JPEG image with exactly the same parameters could become bigger after decoding and reencoding.

In any case, it is always better to go back to the original object for the cases when the optimized image is bigger.

Loading error: Encrypt dict null object

Loading certain PDFs from a test corpus leads to the following error:

[TRACE]  parser.go:1647 Trailer: Dict("Size": 53, "Info": Ref(29 0), "Encrypt": null, "Root": Ref(35 0), "Prev": 256729, "ID": [qXv�b, ��$6"h- %], )
[TRACE]  parser.go:1681 Checking encryption dictionary!
[TRACE]  parser.go:1686 Is encrypted!
[DEBUG]  pdf_passthrough_bench.go:282 Reader create error unsupported type: *core.PdfObjectNull

/tmp/encrypt-dict-null/263071.pdf - fail unsupported type: *core.PdfObjectNull
                    263071.pdf  0.3     false   1.6     Error: unsupported type: *core.PdfObjectNull

Attachment: failing files
encrypt-dict-null.zip

Add compare methods for dictionaries

Make shallow/deep compare methods for dictionaries & indirect objects. This will allow unneccessary duplication of objects that are identical, for example when creating resource dictionaries etc.

Font subsetting for embedded truetype fonts

Current state

Currently, font files are embedded their entirety. This can be somewhat wasteful, as often only a small portion of glyphs are used, and font files can be large especially for unicode fonts with large numbers of glyphs.

There are two use cases:

Creating reports containing large fonts (typically CJK fonts can be very big).
(2. Optimizing already created PDF files that contain embedded fonts.)

Those two cases may require slightly different approaches to be done efficiently. So it is probably best to keep them separate. Here we will focus on the first use case (for creating PDFs).

Proposed changes

This requires.

Identifying fonts for subsetting. Probably best if user marks the font for subsetting since subsetting may not be desired in all cases.
Identifying which glyphs to keep. Perhaps the encoder could track all glyphs/runes that are referenced (Encode func).
Creating subsetted fonts and labelling them as such (postscript naming convention). We will
use https://github.com/unidoc/unitype to do the subsetting.
This would probably be best to do at serialization time, if the font had been marked for subsetting along with which glyphs to keep. Example use case:

fnt, _ := NewCompositePdfFontFromTTFFile("largefnt.ttf")
fnt.Subset(true) // Marks font for subsetting on write
// then use fnt as normally.
// Each call to the font's encoder Encode will record use of glyph to be used.

Expected results

Significantly smaller generated PDF files using TTF fonts.

References

Section 9.6.4 Font Subsets (PDF32000_2008):

PDF documents may include subsets of Type 1 and TrueType fonts. The font and font descriptor
that describe a font subset are slightly different from those of ordinary fonts. These differences 
allow a conforming reader to recognize font subsets and to merge documents containing different
subsets of the same font. (For more information on font descriptors, see 9.8, "Font Descriptors".)

For a font subset, the PostScript name of the font—the value of the font’s BaseFont entry and the 
font descriptor’s FontName entry— shall begin with a tag followed by a plus sign (+). The tag shall
consist of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the
same PDF file shall have different tags.

EXAMPLE EOODIA+Poetica is the name of a subset of Poetica®, a Type 1 font

And in section 9.9 (Embedded Font Programs) it states:

A TrueType font program may be used as part of either a font or a CIDFont. Although the basic
font file format is the same in both cases, there are different requirements for what information
shall be present in the font program. These TrueType tables shall always be present if present in 
the original TrueType font program:
    “head”, “hhea”, “loca”, “maxp”, “cvt”, “prep”, “glyf”, “hmtx”, and “fpgm”. 
If used with a simple font dictionary, the
font program shall additionally contain a cmap table defining one or more encodings, 
as discussed in 9.6.6.4, "Encodings for TrueType Fonts". If used with a CIDFont dictionary,
the cmap table is not needed and shall not be present, since the mapping from character codes 
to glyph descriptions is provided separately.

Section 9.6.6.4 (Encodings for TrueType fonts) additionally describes how TrueType cmaps and font dictionary's Encoding are used to map between character codes and glyph descriptions.

Improve readability of test cases in core package

The goal is to make the tests more clear and easier to read as well as improve coverage.
Steps:

Go through the tests and for each function define the inputs and expected outputs.
Use testify to improve readability
Process each testcase (t.Run) and compare to expected value with assert.Equal / assert.NoError etc
Go through and add tests if any obvious test cases that can be added.

Currently the unit test coverage is [v3]:
$ go test -cover .
ok github.com/unidoc/unidoc/pdf/core 0.214s coverage: 48.7% of statements

whereas with cross-package tests the coverage is ~63.85% according to codecov.io.

Core: Refactor parser to use bufferedReadSeeker type

Based on discussion in #441. The proposal is to create a buffered reader type which encapsulates a ReadSeeker and a buffered Reder (bufio.Reader). Rather than accessing and working with both separately.

The suggested type is something as follows:

// bufferedReadSeeker offers buffered read access to a seekable source (such as a file) with seek capabilities.
type bufferedReadSeeker struct {
rs io.ReadSeeker
reader *bufio.Reader
}
// Implement the Read and Seek methods.

The reason for using the bufio.Reader is purely based on performance during parsing as it has a buffer. Any time we change the offset position of rs a new reader must be constructed with a new buffer (and use the buffered information to correct for offsets as done in parser.GetPosition).

Validation Standards PDF/A

Improper handling of 1 bit images when converting to native go images

Extracting 1 bit grayscale images from a pdf and using ToGoImage() result in 8bit golang grayscale images with 0 for black (correct) but 1 for white (instead of 255). The rest of the extraction appears to be correct.

Code in ToGoImage():

if this.ColorComponents == 1 {
    if this.BitsPerComponent == 16 {
        val := uint16(samples[i])<<8 | uint16(samples[i+1])
        c = gocolor.Gray16{val}
    } else {
        val := uint8(samples[i] & 0xff)
        c = gocolor.Gray{val}
    }
}

Switching val := uint8(samples[i] & 0xff) to val := uint8(samples[i] * uint32(256 / this.BitsPerComponent-1)) resolves the issue for 1 bit images but haven't tested it for other cases.

Loading error: Outline root should be an indirect object

Problem with loading files leads to the error:

[TRACE]  reader.go:274 Outline root: Object stream 375: Dict("Filter": FlateDecode, "Length": Ref(376 0), )
[DEBUG]  reader.go:246 ERROR: Failed to build outline tree (outline root should be an indirect object)
[DEBUG]  pdf_passthrough_bench.go:282 Reader create error outline root should be an indirect object

/tmp/outline-root/005168.pdf - fail outline root should be an indirect object
                    005168.pdf  2.6     false   0.7     Error: outline root should be an indirect object

Attachments:
outline-root.zip

v3: Generated file 2_p_multi.pdf has error: The font 'FreeSans' contains bad /Widths in Adobe Reader

No errors in ghostscript. Needs investigation.

Text redaction support

Does the library support the primitives required to implement some kind of redaction function? I'm figuring it would require:

producing a stream of text content which contains sufficient information to map text sequences back to their underlying tokens
a method to calculate a rectangle around a sequence of text (for the purpose of creating a rectangular annotation over the region on the page where the text is)
Some way of altering the original text it can't be recovered, but is replaced with something that doesn't disturb the layout of the rest of the page.

Support for search replace

Extraction of diagonal text is problematic

@peterwilliams97 Can you provide a specific example of something that does not work? Would be good to have a snippet from the content stream

Any support for embedded files?

Is there currently any support for embedded files?

From the looks of it, I would need to read in content streams not attached to a specific page.

Thanks

pdf_describe doesn't handle 4 bpc images

A simple neural network module for relational reasoning.pdf

ExtractText and Titles and paragraphs

The extractor ExtractText method doesn't seem to extract paragraphs or take titles into account. It would be great to have access to titles and paragraphs of text while able to ignore headers and footers.

This JavaScript implementation of PDF to Markdown is actually quite good:

Is this possible using UniDoc?

Go over lengthy examples in unipdf-examples and consider adding higher level functions in unipdf

Overview of long examples with potential enhancements:

Example	#Lines	Comment
pages/pdf_merge_advanced.go	334	Advanced merging - done in unicli/pdf
page/pdf_list_images.go	253	Reimplement with extractor
image/pdf_extract_images.go	253	Reimplement with extractor
forms/pdf_form_list_fields.go	198	High level interface - fjson/extractor/form
analysis/pdf_fonts.go	189	Extractor should have capability to extract fonts (page basis)
metadata/pdf_metadata_get_xml.go	181	High level metadata interface
barcode/pdf_add_barcode.go	173	Could be easier to add an image to a PDF page of existing document?
forms/pdf_form_add.go	163	High level forms interface (creator/form/fjson)

Optimization issue: Output PDF larger and looks worse

Problem file: 095121_v02.pdf

Options:

unioptimize.Options{
    CombineDuplicateDirectObjects:   true,
    CombineIdenticalIndirectObjects: true,
    ImageUpperPPI: 100.0,
    CombineDuplicateStreams:         true,
    CompressStreams:                 true,
    UseObjectStreams:                true,
    ImageQuality:                    80,
}

Output PDF (1083kB) is larger than the original PDF (672kB) and lookse more blurry. Needs some investigation. Could be related to BitsPerComponent as this is a scanned image probably low bit depth.

Support arbitrary bits per component in FlateEncoder (both encoding, decoding)

Current state

Currently our support is limited to BitsPerComponent = 8 (BPC).
According to PDF32000_2008 the value for BitsPerComponent:

The number of bits used to represent each colour component in a sample. Valid values
are 1, 2, 4, 8, and (PDF 1.5) 16. Default value: 8.

In practice, BPC=8 is by far the most common. However, for completeness we need to support all and it will improve the code.

Proposal

Add a benchmark for the current implementation with BPC = 8. This will help to avoid any regression.
Develop the improved algorithm with an alternative name. Consider putting in a separate package if it is more than a few lines.
Create test cases for the new algorithm and ensure that it works as intended for different BPC values.
Try the benchmark on the new algorithm and compare results, should not be regressing.
Check memory use. Ensure that the new algorithm is not taking much more memory than the current implementation.

PDF compression testing and enhancements

The current implementation of PDF compression has some issues, in particular with image handling.

We need to test on some actual PDF files and check whether the output is as expected. Checking errors, file size, and comparing pages in a PDF viewer.

For any bug that comes up we need to set the PDF aside and create a ticket (unless we can fix the problem easily right away).

For implementation: Can use identity/passthrough benchmark as a basis (pdf_passthrough_bench.go)
and add compression flag:

https://github.com/unidoc/unidoc-examples/blob/v3/pdf/testing/pdf_passthrough_bench.go

Something like:

if params.optimize {
	optim := optimize.New(optimize.Options{
		CombineDuplicateDirectObjects: true,
		CombineIdenticalIndirectObjects: true,
		ImageUpperPPI: 100.0,
		UseObjectStreams: true,
		ImageQuality: 50,
		CombineDuplicateStreams: true,
	})
	writer.SetOptimizer(optim)
}

Clearly need to try changing the parameters and see if can find more bugs.

Creator: PDF form builder

A simple way to build forms should be present in the creator package. Together with tables etc it should provide a an easy way to make PDF forms using the creator. Advanced support for forms exists in package model although the use is pretty low level and will serve as the basis for this.

Should support:

Input text form (text fields 12.7.4.3 PDF32000_2008).
Checkboxes (12.7.4.2.3)
Radio group selection (12.7.4.2.4)
Buttons (12.7.4.2.4)
Dropdown menu - Choice fields (12.7.4.4)

Basic API ideas:

c := creator.New()
form := c.NewForm()  // Serves as the AcroForm for the document represented by `c`.

tf := form.NewTextField("field_name1")  // occupies available width with some default height
tf.SetDefaultText("Enter your name")
// can force certain width with tf.SetWidth(100)
// tf.SetRelHeight(1.2)  1.2 X fontheight
c.Draw(tf)

cf := form.NextCheckboxField("checkbox_male")
cf.SetText("Male")
cf.SetChecked(false)
c.Draw(tf)   // Draws at current position

cf = form.NextCheckboxField("checkbox_female")
cf.SetText("Female")
cf.SetChecked(false)
c.Draw(tf)   // Draws at current position

The drawing does not involve adding anything to the page contentstream, but rather creating Fields in the AcroForms (that refer the page) and the actual content goes into widget annotations. The annotations should also be added to the page dictionaries Annots array.

For arranging fields it would in many cases make sense to place inside a Table to arrange nicely over the page.

Render PDF pages (PDF to image)

Implement a renderer for PDF pages which can be used to render pages to images.

Based on the ContentStreamProcessor
Should use a 2D rendering library such as https://github.com/fogleman/gg

Can be implemented in a few steps/milestones:

Render images.
Render shapes
Render text

The final step will be the most challenging, however, we are already building a strong foundation for font and text support which makes it possible.

Prototype code for rendering images/shapes exists in: https://github.com/unidoc/unipdf-examples/blob/v3-render-support/pdf/render/pdf_render.go

The rendering should be implemented as a package (renderer) inside unipdf. The prototype code could used as a base and refactored into a package.

For rendering text it might make sense to start by using fonts that are available on the system or fixed local fonts. Typically PDF viewers rely on the system fonts, as well as loading embedded fonts.

Support for Type3 fonts

Currently text extraction fails on some text using this font type. Need to add support for it to properly work for extraction.

Optimize issue: Failed creating multi encoder: invalid filter in multi filter array

Problem file: TheCaseStudyMethod.pdf

Options:

unioptimize.Options{
    CombineDuplicateDirectObjects:   true,
    CombineIdenticalIndirectObjects: true,
    ImageUpperPPI: 100.0,
    CombineDuplicateStreams:         true,
    CompressStreams:                 true,
    UseObjectStreams:                true,
    ImageQuality:                    80,
}

When optimizing get:

[ERROR]  encoding.go:1855 Unsupported filter CCITTFaxDecode
[ERROR]  stream.go:42 Failed creating multi encoder: invalid filter in multi filter array

Seems like the CCITTFaxDecode filter is not ignored during compression when inside a multi filter encoder.

Package/type naming

Currently we have our core/... packages that are truly core to unipdf as it can be imported anywhere, should not rely on any other package (except internal utility packages).
It defines all the primitive types:

core.PdfObject
core.PdfIndirectObject
core.PdfObjectDictionary
core.PdfObjectArray

Would it be nicer to have

core.Object
core.IndirectObject
core.Dictionary or pdfcore.Dict
core.Array
core.String

? or is that not specific enough, maybe

pdfcore.Object
pdfcore.Dictionary or pdfcore.Dict
pdfcore.Array
pdfcore.String

etc. ?

Similarly for model package, there are some pretty lengthy names:

model.PdfPage
model.PdfPageResourcesColorspaces
model.PdfColorspaceDeviceNAttributes

Clearly the name space in PDF models is pretty huge, however it might be possible to improve here. What about

pdfmodel.Page
pdfmodel.ResourceColorspace
pdfmodel.ColorspaceDeviceNAttributes

pdf.Page
pdf.ResourceColorspace
pdf.ColorspaceDeviceNAttributes

Would be interesting to get some input on this. We are always looking on ways to improve the internals, although it can take time and would obviously not appear until in a future major version.

Vectorized PDF text and object extraction

This issue is a master issue/epic and can lead to subissues that will be referenced from here.

Proposal

The extractor package will have the capability to extract vectorized text and objects (with position and dimensions).

Goal: Extract a list of graphics objects from each PDF page.

There are three types of graphics objects:

text
path (a PDF path that has been stroked or filled)
image

Each of these objects has a

bounding box in device coordinates
color
rendering mode (fill, stroke, clip or some combination of these)
content (e.g. text)
optionally other properties
transparency?

This is not a rendering system but we hope to design it in a way that will allow it to be extended to become a renderer. Initial versions of the renderer could convert the lists of graphics objects to PDF or PostScript pages. This would provide closed-loop tests.

Definitions

text: Text objects and operators. The text operators specify the glyphs to be painted, represented by string objects whose values shall be interpreted as sequences of character codes. A text object encloses a sequence of text operators and associated parameters. (page 237)
Paragraph fragments are the largest substrings in text paragraphs that are rendered contiguously on a PDF page. If a paragraph is split between pages or columns then the parts of the paragraph that appear at the end of the first page / column and the start of the second page / column are paragraph fragments. When a paragraph fits entirely within a single column and page, the entire paragraph is a paragraph fragment.

There are at least three levels of text objects, all of which are composed of lower level (lower numbered in the following list) objects.

Text elements emitted by the renderer as a result of PDF text operators like Tj.
a. A text element’s properties include the text content, location and size in device coordinates, font etc
b Text elements can be used to recreate the text as it appears on the page
Paragraph fragments are created from the text elements on a page. Each paragraph fragment occupies a contiguous region on a single page.
a. Paragraph fragments include the start of a paragraph that is completed on the following page / column, captions, form field labels, footnotes, etc
b. The paragraph fragments in a page can be used to make inferences about the page.
Paragraphs are created from the paragraph fragments
a. Paragraphs can be used to create extract the text of a PDF in plain text format

path: A path is made up of one or more disconnected subpaths, each comprising a sequence of connected segments. (page 131)

Initially we will only concern ourselves with stroked and filled paths and ignore clipping paths

// Path can define shapes, trajectories and regions of all sorts. Used to draw lines and define shapes of filled areas.
type Path struct {
	segments []lineSegments
}

// Only export if deemed necessary for outside access.
// For connected subpaths (segments), the x1, y1 coordinate will start at x2, y2 coordinate of the previous segment.
type lineSegment struct {
	isCurved bool  // Bezier curve if true, otherwise line
	x1, y1 float64
	x2, y2 float64
	cx, cy float64 // Control point (if curved)

	isNoop bool	// Path ended without filling/stroking.
	isStroked bool
	strokeColor model.PdfColor
	isFilled bool
	fillColor model.PdfColor
	fillRule windingTypeRule
}

type windingNumberRule int
const (
	nonZeroWindingNumberRule windingNumberRule = iota
	evenOddWindingNumberRule 
)

image. A sampled image (or just image for short) is a rectangular array of sample values, each representing a colour. (page 203)

This should include inline images, XObject images, possibly some shadings etc. UniDoc already has a pretty good framework for this.

API ideas

Extractor

func (e *Extractor) GraphicsObjects() []GraphicsObject

type GraphicsObject interface {
// What do graphics objects have in common, or what common operations can be applied to them?
// Possibly make into a struct rather than an interface and convert to an interface if we think it makes sense.
}

Rendering Interface Ideas
Renderers may need access to the graphics context to render each graphics object.
Imagine a callback to emit graphics objects to a renderer (or other caller).

func render(o GraphicsObject, gs GraphicsState)

The rendering would be over all graphics objects on a page in the order they occur. This would be driven by a single processor.AddHandler() that could be configured to emit any combination of text, shape, and image objects.

func renderCore(doText, doShapes, doImages bool, render Renderer)

or rendering context/state rather than doX...

Use cases

Potential use cases that should be possible to base on this implementation:

Find text/shapes/images within a specified area.
Remove/redact text/shapes/images within a specified area.
Characterize headings, normal text.
Detect tables and inner contents
Detect mathematical formulas
PDF to markdown conversion: Requires basic heading detection, text style, tables
PDF to word/excel: Requires advanced detection of detailed features to reproduce in oxml.

Going from the primitive contentstream operands to a higher level representation, there is a need to have a connection from the higher level representation to the lower level. For example if removing content, may need to filter on a higher level basis but have a connection down to the primitive operands to actually filter those out.

There may be a cascade/sequence of processing operations, initially on the primitive operands, for example grouping.

It should be clear whether those processes are lossy or lossless, where lossless would mean that they could reproduce the exact same operands as originally and same look. Lossy would mean that some aspect was lost, for example if grouping text together, character spacing/kerning info could be lost.

Preferably all processing would have the capability to be lossless, but it remains to be seen whether that is practical.

Performance: Adding paragraphs

With the following code, I'm seeing about 40-45ms per "paragraph" addition:

        p := creator.NewParagraph(content)
        p.SetFont(fonts.NewFontCourier())
        p.SetFontSize(fontSize)
        p.SetPos(xPos, yPos)
        err = c.Draw(p)

For building a page from small elements (50-100 elements) for multi-page documents, this can get very time-consuming. It looks as thought the performance issue is the Draw() method:

Function	Execution Time
NewParagraph	2.341µs
SetFont/Pos	455ns
Draw	38.043205ms

Creator: NewBlockFromPage should flatten Page rotations

Description

Currently loading a page that has page.Rotate with an angle of 90,180,270 (not 0): The page contents are loaded into the block and the rotation not accounted for.
Thus drawing the block onto a page, things will normally appear as incorrect, i.e. rotated.

Current behavior

When loading a page NewBlockFromPage does not account for page's Rotate flag.

Expected behavior

NewBlockFromPage accounts for the page's rotate flag and rotates the contents accordingly. Page size is also adjusted based on rotation, for example 90 degrees, original width and height are swapped.
When the block is added to a new page, the contents appear in the right orientation (although the new page has Rotate flag not set corresponding to 0).

Proposed change

Change NewBlockFromPage to rotate the contents based on the rotate flag and take rotation into block size.

The changes are non-breaking from a programmatic endpoint, but can break code that is already doing this rotation manually, easy to account for that though although one should be aware of it.

Support ICC colorspace

Implement a package to interpret and utilize ICC based colorspaces.

See section 8.6.5.5 ICCBased Colour Spaces (PDF32000_2008):

ICCBased colour spaces shall be based on a cross-platform colour profile as defined by the
International Color Consortium (ICC)... an ICCBased colour space shall be characterized by
a sequence of bytes in a standard format. Details of the profile format can be found in the
ICC specification.

There are multiple versions of the ICC specification that are supported in PDF as
shown in Table 67 (PDF32000_2008). However, it also says that

A conforming reader shall support ICC.1:2004:10 as required by PDF 1.7, which will enable it to properly render all embedded ICC profiles regardless of PDF version.
Thus it should be enough to support that version or newer if it exists.

Notes

Should be designed such that it can be used within unidoc for image processing
The current implementation uses the alternative colorspaces that are provided, so this will be a good improvement for color handling.

unidoc / unipdf Goto Github PK

unipdf's Introduction

UniPDF - PDF for Go

Features

Installation

License key

How can I convince myself and my boss to buy unipdf rather using a free alternative?

Contributing

Go Version Compatibility

Support and consulting

License agreement

unipdf's People

Contributors

Stargazers

Watchers

Forkers

unipdf's Issues

Black in background

Other issues

Current state

Proposed change

Implementation

Notes

Implementation

Notes

Current state

Proposed changes

Expected results

References

Current state

Proposal

Proposal

Definitions

API ideas

Use cases

Description

Current behavior

Expected behavior

Proposed change

Implement a package to interpret and utilize ICC based colorspaces.

Notes

Recommend Projects

Recommend Topics

Recommend Org