danfickle / openhtmltopdf Goto Github PK

An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!

Home Page: https://danfickle.github.io/pdf-templates/index.html

License: Other

Batchfile 0.01% HTML 37.14% Java 60.13% Lex 0.11% CSS 2.17% FreeMarker 0.45%

java html css pdf pdfbox svg accessibility pdf-generation

openhtmltopdf's Introduction

OPEN HTML TO PDF

OVERVIEW

Open HTML to PDF is a pure-Java library for rendering a reasonable subset of well-formed XML/XHTML (and even some HTML5) using CSS 2.1 (and later standards) for layout and formatting, outputting to PDF or images.

Use this library to generated nice looking PDF documents. But be aware that you can not throw modern HTML5+ at this engine and expect a great result. You must special craft the HTML document for this library and use it's extended CSS feature like #31 or #32 to get good results. Avoid floats near page breaks and use table layouts.

GETTING STARTED

Integration guide - get maven artifacts and code to get started.
1.0.10 Online Sandbox - Now with logs!
Templates for Openhtmltopdf - MIT licensed templates that work with this project. Updated 2021-09-21.
Showcase Document - PDF
Documentation wiki
Template Author Guide - PDF - DEPRECATED - Prefer wiki - Moving info to wiki
Sample Project - Pretty Resume Generator

DIFFERENCES WITH FLYING SAUCER

Uses the well-maintained and open-source (LGPL compatible) PDFBOX as PDF library, rather than iText.
Proper support for generating accessible PDFs (Section 508, PDF/UA, WCAG 2.0).
Proper support for generating PDF/A standards compliant PDFs.
New, faster renderer means this project can be several times faster for very large documents.
Better support for CSS3 transforms.
Automatic visual regression testing of PDFs, with many end-to-end tests.
Ability to insert pages for cut-off content.
Built-in plugins for SVG and MathML.
Font fallback support.
Limited support for RTL and bi-directional documents.
On the negative side, no support for OpenType fonts.
Footnote support.
Much more. See changelog below.

LICENSE

Open HTML to PDF is distributed under the LGPL. Open HTML to PDF itself is licensed under the GNU Lesser General Public License, version 2.1 or later, available at http://www.gnu.org/copyleft/lesser.html. You can use Open HTML to PDF in any way and for any purpose you want as long as you respect the terms of the license. A copy of the LGPL license is included as license-lgpl-2.1.txt or license-lgpl-3.txt in our distributions and in our source tree.

An exception to this is the pdf-a testing module, which is licensed under the GPL. This module is not distributed to Maven Central and is for testing only.

Open HTML to PDF uses a couple of FOSS packages to get the job done. A list of these can be found in the dependency graph.

CREDITS

Open HTML to PDF is based on Flying-saucer. Credit goes to the contributors of that project. Code will also be used from neoFlyingSaucer

FAQ

OPEN HTML TO PDF is tested with OpenJDK 8, 11 and 17 (early access). It requires at least Java 8 to run.
No, you can not use it on Android.
You should be able to use it on Google App Engine (Java 8 or greater environment). Let us know your experience.
~~Flowing columns are not implemented.~~ Implemented in RC12.
No, it's not a web browser. Specifically, it does not run javascript or implement many modern standards such as flex and grid layout.

TEST CASES

Test cases, failing or working are welcome, please place them in /openhtmltopdf-examples/src/main/resources/testcases/ and run them from /openhtmltopdf-examples/src/main/java/com/openhtmltopdf/testcases/TestcaseRunner.java.

CHANGELOG

head - 1.0.11-SNAPSHOT

See commit log.

1.0.10 (2021-September-13)

NOTE: After this release the old slow renderer will be deleted. Fast mode has been the default (since 1.0.5) so you only have to check your code if you are calling the useSlowMode method which will be removed.

#551 SECURITY Fix near-infinite loop for very deeply nested content with page-break-inside: avoid constraint. Thanks for persisting @swillis12 and debugging @syjer.
#729 SECURITY Upgrade xmlgraphics-commons (used in SVG rendering) to avoid CVE. Thanks @electrofLy.
#711 Footnote support (beta). See footnote documentation on wiki. Thanks for requesting @a-leithner and @slumki.
#761 CSS property to disable bevels on borders to prevent ugly anti-aliasing effects, especially on table cells. See -fs-border-rendering property on wiki. Thanks for providing sample @gandboy91.
#103 Output exception class name and message by default for log messages with an associated exception.
#711 (mixed) Better boxing for ::before and ::after content. Should now be able to define a border around pseudo content correctly.
#738 Support for additional elements in PDF/UA including art, part, sect, section, caption and blockquote. Thanks @AndreasJacobsen.
#736 New example of using a dom mutator to implement unsupported content such as font tag attributes. Thanks for requesting @mgabhishek06kodur.
#707 Fix regression where PDF/UA documents that weren't also PDF/A compliant were missing Dublin Core metadata. Thanks @mgm-rwagner, @syjer.
#732 Allow table element to be positioned. Thanks @fcorneli.
#727 Allow the use of an initial page number for page and pages counters. Thanks for PR @fanthos.

1.0.9 (2021-June-18)

SECURITY RELEASE: This release was brought forward due to security releases of the PDFBOX and Batik dependencies.

#722 Upgrade PDFBOX (to 2.0.24) - avoids CVEs in earlier versions and PDFBoxGraphics2D. Thanks a lot @rototor.
#678 Upgrade Batik Version to 1.14 (CVE-2020-11987) - Again it is strongly advised to avoid untrusted SVG and XML. Thanks @rototor.
#716 Replace rogue println calls with log calls. Thanks @syjer for PR, @tfo for reporting.
#708 Allow shape-rendering SVG CSS property. Thanks @syjer for PR, @RAlfoeldi for reporting.
#703 Remove calls to deprecated method calls in JRE standard library. May change XML reader class. Implemented by @danfickle.
#702 Set timeouts for default HTTP/HTTPS handlers. Thanks for reporting @gengzi.
162228 Put links to raster images in SVGs through the URL resolver.
#694 Fix incorrect B3 paper size. Thanks @lfintalan for reporting with line number!
ab48fd Do not log a missing font more than once.

NOTE: PDFBOX CVEs relate to the loading of untrusted PDFs in PDFBOX and thus this project is not directly affected. However, it is not a good idea to have CVEs on your classpath.

1.0.8 (2021-March-22)

SECURITY RELEASE

#675 Update PDFBOX to 2.0.23 to avoid CVEs. Thanks for reporting @Samuel3.

NOTE: These CVEs relate to the loading of untrusted PDFs in PDFBOX and thus this project is not directly affected. However, it is not a good idea to have CVEs on your classpath.

1.0.7 (2021-March-19)

#650 Support for multiple background images on the one element. Thanks for requesting @baedorf.
#669 Support fallback fonts. Thanks for requesting @asu2 and assisting @draco1023.
#640 Implement file embeds via the download attribute on links. Thanks for original PR @syjer and for requesting @lindamarieb and @vader.
#666 API to get the bottom-most y position of rendered content to be able to position follow on content with other tools. Thanks for extensive reviewing of PR @stechio and for request by @DSW-AK.
#664 Improved support for PDF/A and PDF/UA standards. Thanks for PR @qligier.
#653 Fix for inline-block elements with a z-index or transform were being output twice. Thanks for reporting @hannes123bsi.
#655 Correct layout of ordered lists in RTL direction. Thanks for PR @johnnyaug.
#658 Implement target-text function for content property. Thanks for PR @BenjaminVega.
#647 Fix race condition in setting up logger in multi-threaded environments. Thanks for PR @syjer.
#638 Ability to plug-in external resource control based on resource type and url. Thanks for original PR @syjer.
#628 Use enhanced image embedding methods from PDF-BOX. Thanks for PR @rototor and your work in PDF-BOX implementing this.
#627 Fix regression where a null font style was causing NPE. Thaks for PR @rototor.
#338 Implement read-only radio button group. Thanks for investigating, reporting and patience @ThoSchCon, @aleks-shbln, @dmitry-weirdo, @syjer and @paulito-bandito.

1.0.6 (2020-December-22)

IMPORTANT: #615 This is a bug fix release for an endless loop issue when using break-word with floating elements with a top/bottom margin.

#624 Update PDFBOX to 2.0.22 and pdfbox-graphics2d to 0.30. Thanks @rototor.
#467 Prevent possibility of CSS import loop.
#621 Allow spaces in data uris. Thanks @syjer.

1.0.5 (2020-November-30)

SECURITY: #609 Updates Apache Batik SVG renderer to latest version to avoid security issue. If you are using this project to render untrusted SVGs (advised against), you should update immediately. Thanks a lot @halvorbmundal.

IMPORTANT: The fast renderer is now the default in preparation of removing the old slow renderer. To temporarily use the slow renderer, you can call the deprecated method builder.useSlowMode() (PDF output only).

IMPORTANT: #543 This version stays on PDFBOX version 2.0.20 due to a bug with non-breaking spaces in version 2.0.21. Please make sure version 2.0.21 is not on your classpath. This bug has been fixed in the upcoming 2.0.22.

#544 Code to create a website for pre-canned PDF templates in thymeleaf and raw XHTML format. Check out the template website to preview templates.
#533 Barcode plugin. Very useful PR supplied by @syjer. Barcode plugin docs.
#521 Move Java2D image output to fast renderer and general improvements. Java2D image output docs.
9ffd0e #568 Filter out problematic characters that are visible in some fonts but should not be such as soft-hyphen. Thanks @StephanSchrader.
#587 Fix for white-space: nowrap cutting off instead of wrapping. Thanks @vipcxj for finally fixing via PR.
#577 Add foreground PDF drawer plugin (useful especially for watermarks). Thanks @rototor for PR and @sillen102 for persisting.
#566 Rename baseUri arg to baseDocumentUri and improve javadoc to avoid confusion. Thanks for reporting @NehalDamania.
801780 Update junit test dependency to 4.13.1 to avoid security scanner warnings (the specific security problem did not impact this library).
#553 Fix for ContentLimitContainer causing NPEs when negative margins are used. Thanks for reporting @adilxoxo.
#552 Optimize the log formatter for j.u.l logging. Thanks for the impressive PR @syjer.
#542 Improve list-decoration placement. Thanks for PR @syjer and reporting @mndzielski.
#458 Fix for list-decorations being output (clipped) in page margin area.
#525 Remove unused schema/DTDs. Significantly reduces size of jar. Thanks for PR @syjer.
#592 Allow unit (px, cm, em, etc) values in the width/height attributes of linked SVG images. Thanks @DanielWulfert.
#594 #458 Fix for more repeating content and PDF/UA crash. Thanks @ThomHurks, @fungc.
#599 Fix RuntimeException ocurring on InlineText.setSubstring. Thanks @LAlves91.
#605 Fix to make justification work with surrogate pairs. Thanks @EmanuelCozariz.
#601 Move CI to Github actions. Thanks @syjer.
#597 Generalize data uri support. Thanks @syjer, @Leostat86.
#613 Allow adding fonts for SVG, MathML as files instead of input streams to avoid JDK bug. Thanks @syjer, @sureshkumar-ramalingam, @olayinkasf.

1.0.4 (2020-July-25)

b88538 Fix for endless loop when using word-wrap: break-word. Thanks for reporting, testing and investigating @swarl. Thanks for tests and debugging @rototor and @syjer.
#492 Lots of testing of the line-breaking algorithm to avoid future endless loops. By @danfickle.
#515 Pass document CSS styles applying to SVG element to SVG implementation. Thanks for requesting and contributing @amckain92.
#514 FIX: Correctly position boxes when justifying rtl lines. Thanks for reporting and testing @lzhy1101.
#512 #507 #502 Cleanup code including deleting unused code, generics, etc. Thanks for PRs @syjer.
#489 Extensive overhaul of logging including per run diagnostic consumer. Huge thanks @syjer, a lot of work in this PR. See logging page on wiki for more info.
#501 Upgrade PDFBOX to 2.0.20 and PDFBox-Graphics2D to 0.26. Thanks for PR @rototor.
#490 Fix for NPE when decoding image data url fails. Thanks for PR @syjer and reporting @AlexisCothenet.
#516 Add OSGI bundle metadata to MANIFEST.MFs. Thanks for requesting and investigating @zspitzer.

1.0.3 (2020-May-25)

IMPORTANT: This release contains fixes for two bugs that may result in endless loops/denial of service when using word-wrap: break-word. If you are using this feature, please upgrade promptly.
#483 Fix for endless loop bug with word-wrap: break-word and soft hyphens. Thanks @rototor for PR, @syjer for analysis and @swarl for reporting.
#466 Fix for endless loop bug with word-wrap: break-word and zero width boxes. Thanks @syjer for analysis and @AlexisCothenet for reporting.
#486 SVG plugin can now provide a list of allowed protocols for external resources and any configured uri resolver/stream handlers will be used. Thanks @syjer for PR and @ieugen for reporting.
#480 Fix for link shapes being returned from custom object drawers. Thanks @rototor for PR and @hbergmey for reporting.
#485 Implement support for SVG data uris. Thanks @syjer for PR and @adrianrodfer for reporting.
#470 Allow mailto: links or any other valid link. Thanks @syjer for PR and @mndzielski for reporting.
#464 Honor the direction CSS property. Thanks @AnanasPizza for reporting.
#460 Change thrown exception class to more specific IOException. Thanks for PR @leonorader.
#459 Implement the rem CSS unit. Thanks to @leonorader for reporting.
#211 Images can now be used in the CSS content property. Thanks for requesting @Kuhlware.
#445 Fix for not picking up attribute values in Jsoup converted documents. Thanks for reporting @testinfected.
#450 Java2D output only: Ability to add fonts via code. Also environment fonts will no longer be used by default. To use environment fonts: builder.useEnvironmentFonts(true).

1.0.2 (2020-February-25)

SECURITY Removed Log4J 1.x adaptor as it had CVE-2019-17571 with no updated version available.
#448 Implement linear-gradient support for background-image property. By @danfickle. Requested by @rja907.
#429 Major overhaul of word-wrap: break-word. Now a word will not be broken unless it is too big for a line by itself. By @danfickle. Thanks for reporting and testing @mndzielski.
#433 Do not justify lines ending with <br/> tag. Thanks for reporting @fcorneli.
#440 Remove trailing white space for right aligned text to avoid jagged appearance. Thanks for reporting @AnanasPizza.
#446 Look for lang attribute on ancestor elements when using lang() selector. Thanks for reporting and tracking down the bug @fungc.
#430 Use relative path to license in source jars instead of absolute path. Thanks for reporting @gabro and fixing via PR @syjer.
#417 Keep aspect ratio of images with width/height properties as well as min/max width/height properties. Thanks for reporting and basis for fix @swarl.
#423 Allow multiple font sources to be specified with format tags. Only use format(truetype). Thanks for requesting @MichaelZaleskovsky and basis for implementation @syjer.
#415 Avoid class cast exception if user tries to float table cell. Thanks for reporting @dmartineau99 and PR @syjer.
#421 Avoid NPE when justified text is mixed with unjustifiable content. Thanks for reporting @Megingjard and PR @syjer.
Updated PDFBOX 2.0.17 to 2.0.19.

1.0.1 (2019-November-18)

#413 Handle form problems such as no name on input element without throwing a NPE. Thanks @syjer for PR and @mmatecki for reporting.
#412 Add HTML block level elements usch as section to default CSS. Thanks @syjer.
#339 Remove the JSoup to DOM converter module. Thanks @kewilson.
0cd098 Fix for letter-spacing support on last line of block with trailing space. Also performance improvements and refactoring. By @danfickle.
#410 Fix for wrong bold setting on list item counters. Thanks @syjer for PR fix (and test!) and @acieplinski for reporting.
Wiki Configurable text justification settings as part of a justification overhaul that also allows more space to be used inter-char when there are no spaces on the line. By @danfickle. Commits listed in #403.
#403 Soft hyphen support. Soft hyphens are now replaced with hard hyphens when used as line ending character. Thanks @sbrunecker.
#408 Fix for bookmarks not working with HTML5 parsers such as JSoup. Thanks @syjer for investigating and fixing and @Milchreis for reporting.
#404 Upgrade Batik to 1.12 and xmlgraphics-common to 2.4 (both used in SVG module) to avoid CVE in one or both. Thanks @avoiculet.
#396 Much faster rendering of boxes using border-radius properties. Thanks @mndzielski.
#400 Support for lang and title attrbiutes and abbr tag for accessible PDFs. Thanks @Ignaciort91.
#394, #395 Upgrade PDFBOX to 2.0.17 and pdfbox-graphics2d to 0.25. Thanks @cristan, @rototor.
#384 Allow user to provide PDFont supplier. Thanks @DSW-PS.
#373 Fix regression where both max-width and max-height are provided for images with certain aspect ratios. Thanks @rototor.
#380 Much better support for flowing columns including explicit column breaks, floating content, block level nested content. By @danfickle.

1.0.0 (2019-July-23)

#372 Much improved sizing support for img, svg and math elements.
#344 Use PDFs in img tag: <img src="some.pdf" page="1" alt="Some alt text" />.

OLDER RELEASES

View CHANGELOG.md.

openhtmltopdf's People

Contributors

Stargazers

Watchers

Forkers

xexes joaomarcusc rototor aleksandr-m bikot1205 freecode4fun bryanchou chord1645 khkremer davidshepherdson s2jcpete debugerman omidp achuinard doccrazy toulezu keratomi varshluck badthinking wangdamu vsch schmitch hsyh backslash47 azechow clevesol grammytraore chakra-coder envisia jjlharrison zhunzhong revoorunischal zhouyinyan louisrenweiwei jabubakar joyner-lu admitted olivierbourgain jartysiewicz aharpour mobabel gitsnaf osuisumi harbulot magic-coder jugen peak-dev won21kr alebar liusf1993 jack-zts xiaoyangxie kennywzhang estevandiedrich mmatecki johnny1952 stethd 985860612 tamilselvansl basirsharif wxd56 jcheung0 jardevbox rajaningle tubbynl mylovetop sosnut techoneel hydrogen2oxygen logsworth mapleleat bharatrsharma zxy1994 virtualadrian droy-sandbox koan00 pkande xz43 netdava gmazhe syduc aiminho cicadadev map-lijie huapeng01016 sillerud zhouzaoyuan hram908 massimiliano76 xiweicheng tguangch justshiv benjak135765 dnguyenminh edgeowner luzhichun scrappythedev xiaobo9 xelloss00x bullqu

openhtmltopdf's Issues

Support for setting form field size

Please add support for allowing the width and height of form field elements to be set.

does this project support css3

i want to use this library but idon't know if this support css3 or not and angularjs

Add PDF properties like owner, security...

Is it possible to add PDF properties ?
I mean properties like owner, title, subject, security...

CSS3 / Please add support for "text-overflow: ellipsis;"

Well it would be really nice if possible :)

Implement cleaner builder API for PDF rendering.

Issue Running on JBoss EAP 6.1

I have an issue running on JBoss EAP 6.1 which seems to be down to some kind of Xalan / XML parser conflict. I note that the line in question has a //FIXME comment so wonder if you can advise of any workaround. My code is running fine on Jetty.

EAP stacktrace:

JBWEB000236: Servlet.service() for servlet dispatcher threw exception: java.lang.ClassNotFoundException: com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl from [Module "deployment.edental-web-pars.war:main" from Service Module Loader]
	at org.jboss.modules.ModuleClassLoader.findClass(ModuleClassLoader.java:213) [jboss-modules.jar:1.3.7.Final-redhat-1]
	at org.jboss.modules.ConcurrentClassLoader.performLoadClassUnchecked(ConcurrentClassLoader.java:459) [jboss-modules.jar:1.3.7.Final-redhat-1]
	at org.jboss.modules.ConcurrentClassLoader.performLoadClassChecked(ConcurrentClassLoader.java:408) [jboss-modules.jar:1.3.7.Final-redhat-1]
	at org.jboss.modules.ConcurrentClassLoader.performLoadClass(ConcurrentClassLoader.java:389) [jboss-modules.jar:1.3.7.Final-redhat-1]
	at org.jboss.modules.ConcurrentClassLoader.loadClass(ConcurrentClassLoader.java:134) [jboss-modules.jar:1.3.7.Final-redhat-1]
	at java.lang.Class.forName0(Native Method) [rt.jar:1.8.0_102]
	at java.lang.Class.forName(Class.java:348) [rt.jar:1.8.0_102]
	at javax.xml.transform.FactoryFinder.getProviderClass(FactoryFinder.java:116) [rt.jar:1.8.0_102]
	at javax.xml.transform.FactoryFinder.newInstance(FactoryFinder.java:169) [rt.jar:1.8.0_102]
	at javax.xml.transform.TransformerFactory.newInstance(TransformerFactory.java:152) [rt.jar:1.8.0_102]
	at com.openhtmltopdf.resource.XMLResource$XMLResourceBuilder.createXMLResource(XMLResource.java:216) [openhtmltopdf-core-0.0.1-RC8.jar:]

which is triggered by the following code:

try {
	// FIXME:
	// Currently, we have to do this as the user may have an older vesion of xalan on their classpath which would be
	// used otherwise.
	xformFactory = TransformerFactory.newInstance("com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl", null);
} catch(Exception e) {
	XRLog.load(Level.SEVERE, "Could not load preferred XML transformer, using default which may not be secure.");
	xformFactory = TransformerFactory.newInstance();
}

As TransformerFactory throws an Error rather than an Exception it looks your catch{} block does not execute.

    public static TransformerFactory newInstance(String factoryClassName, ClassLoader classLoader)
        throws TransformerFactoryConfigurationError{

        //do not fallback if given classloader can't find the class, throw exception
        return  FactoryFinder.newInstance(TransformerFactory.class,
                    factoryClassName, classLoader, false, false);
    }

I don't know much about these things but seems to be because JBoss includes specific version of Xalan and Xerces as modules. Drilling down into the modules directory I can find a file named javax.xml.transform.TransformerFactory with the following content:

org.apache.xalan.processor.TransformerFactoryImpl

I have tried excluding these modules via the jboss-deployment-descriptor and bundling a javax.xml.transform.TransformerFactory file with the application but with no success.

Risks if generating PDF from user supplied HTML?

Hi Daniel,

I want to use openhtmltopdf to generate PDFs that are partially based on user supplied HTML on a corporate website. I'm wondering if you could comment about any security issues that might raise, as it has the security team raising their eyebrows.

For example it might be possible there is a bug in the image rendering code that allows code to be executed on the server if some special image is supplied to it. Personally I think the scenario is very unlikely but I'd like to hear your take on it. The image rendering code is part of the PDFBox library isn't it? If that is pure java then it's not really vulnerable to stack buffer overflow attack is it? To me the probability of such an attack is exceedingly small but I have to convince the security team and I don't have any real knowledge of how that part of the library is implemented.

Are you aware of any other risks and how they can be mitigated when rendering user supplied HTML?

Thanks in advance
Martin

Support font fallback

This is critical for multi script support where text may be in two or more languages but a font only supports one script.

@willamette has proved an implementation for Flying-saucer which we can base our implementation on.

Related issue #9 (RTL support).

Make slash non-breaking.

In the following example, Firefox and Chromium refuse to break within any of the "test/test" sequences.

openhtmltopdf puts a break between "test" and "/test" (having the / on the next line), as shown here.

<!DOCTYPE html>
<html>
    <head>
        <title></title>
    </head>
    <body>
        <p style="font-size: 5em;">
            test/test test/test test/test
        </p>
    </body>
</html>

I believe it would be best if slash ("/") were considered to be a non-breaking character as other browsers seem to do.

Rename sub-modules and packages with openhtmltopdf

Add support for fillable form elements (basic features)

This issues is extended with "basic features", as more advanced features and elements (like drop-downs or definition of text content types) are covered in a separate issue

As in the current flying saucer version, input elements should not only be rendered, but also be created as AcroForm elements in the resulting PDF (note that the implementation in flying saucer has some limitations though - e.g. radioboxes are not yet implemented, and css styles are a bit quirky still). In a first implementation, the following elements should be considered:

HTML <input type="text" name="..." value="..."> : a text form field should be rendered, one line only. If a value is provided in the value attribute, it should be pre-filled with that value. The name attribute should be taken as value for the "name" property of the form field.
HTML <textarea name="..." value="..."> : a multi-line text form field should be rendered. The hight of the form field should be created based on the css style. If no "height" style can be determined from the css, then it should be calculated based on the rows attribute value, multiplied by the line hight and paddings / line spacings. If the <textarea> element has a node child, the form field should be pre-filled with the text content of that child node. The name attribute should be taken as value for the "name" property of the form field.
HTML <input type="radio" name="..." value="..." checked="true"> : a radio form field should be rendered. The value from the _id attribute should be taken as group-id for the radio boxes. The "name" property of the form field should be composed out of the name attribute, extended by a "." and the running number of the element within the group (e.g. if the name is "mysection.myoptions" then the form element name of the first radio button with this name will be "mysection.myoptions.1"). The "option" value of the form element is to be taken directly from the value attribute of the <input> element. The option should be selected by default, if the checked attribute in the <input> element is set respectively.
HTML <input type="checkbox" name="..." value="..."> : a checkbox form field should be created. The name attribute should be taken as value for the "name" property of the form field. The form field value should be taken over from the value property of the <input> element. The checkbox should be checked by default, if the checked attribute in the <input> element is set respectively.

For all the above elements, basic styling for borders (width, color), background-color, inner padding to content and font settings should be applied (for checkboxes and radio buttons, the dot/check sign is rendered in the font-color - so even there it does count)

Add HTML5 to DOM converter using Jsoup

Implementation code available from neoFlyingSaucer.

Remove ITEXT output devices.

With PDF-BOX 2, ITEXT is no longer required.

Spacing ignored when surround by <b> tag

The following causes the leading spaces to be dropped from the rendered output:

<b>    Text</b><br />

Whereas the following renders fine:

    <b>Text</b><br />

Implement text justification in PDF-BOX output device

As of PDF-BOX 2.0.0-RC3 there is no way to implement text justification. Need to open PDF-BOX JIRA issue asking them to implement the text spacing and space spacing operators.

Implement CSS3 transform

I've started implementing CSS3 transform. My primary goal is to get rotate() working. The rest should be relative easy, but I primary need rotate(90deg).

You can view the my current development here https://github.com/rototor/openhtmltopdf/tree/css3-transform-implementation

CSS parsing should be mostly done, but it is not yet applied on the elements. I'll be away for a week on holidays, but will continue to implement this afterwards.

Note, related to #23 I implemented saveSate() and restoreState() - also for Java2D.

Implement pluggable http stream factory in user agent.

Allows users to use an external http library without rewriting an entire user agent callback.

Locale independence

We don't want the situation where it works on the developer's machine and not on the server or vice versa because of different locales.

Form field names incorrect in resulting PDF

the PDField partial name is being set to an internally generated name like "OpenHTMLCtrl1" and only the mapping name is being set to the name in the HTML. This causes lookup from PDAcroForm.getField(String) to fail when looking up by the name in the HTML because the mapping name is not used in that lookup.

Adobe Reader promts for saving

Trying to close generated PDF in Adobe Reader gives Do you want to save changes to 'some.pdf' before closing? dialog.
Tested with rc7 and rc8. Worked fine with rc4.

Found this post.

char != code point != glyph

There are several places in the code where they are treated as equal, such as the first letter breaker, title case transformer, etc. As far as possible we should use ICU4J when the rtl-support module is available.

Keep README and LICENSE up to date.

0.0.1-RC9 is not in maven repository

The jar 0.0.1-RC9 is not in maven repository. Can you upload them to maven repository?

can't get simple font example working

running this HTML:

<html> <head> <STYLE type="text/css"> body { font-family: Sans-Serif; } </STYLE> </head> <body> <p>Sample text</p> </body> </html>

on OSX 10.10.5 with java version "1.8.0_40" openhtmltopdf 0.0.1-RC5-SNAPSHOT
using com.openhtmltopdf.testcases.TestcaseRunner
the output is not san serf

SECURITY ISSUE - Prevent XXE attacks

Big thanks to @lillesand for bringing to my attention that Flying Saucer and OpenHTMLtoPDF are vulnerable to XXE attacks. You may be vulnerable if you use either project and allow the user to supply arbitrary XHTML (which I don't recommend at this stage).

Links:
https://www.gracefulsecurity.com/xml-external-entity-injection-xxe-vulnerabilities/
https://www.owasp.org/index.php/XML_External_Entity_(XXE)_Processing

I am about to commit a fix and then release RC6 so people can update promptly.

Support max-height for img tag

I don't believe the library is supporting a CSS max-height property. I am new to the library but found the same problem in Flying Saucer when I was using that. I came up with this to fix it:


    private static final class MaxWidthHeightSupportingRenderer extends ITextReplacedElementFactory {
        MaxWidthHeightSupportingRenderer(ITextOutputDevice outputDevice) {
            super(outputDevice);
        }

        @Override
        public ReplacedElement createReplacedElement(LayoutContext c, BlockBox box, UserAgentCallback uac, int cssWidth, int cssHeight) {
            Element e = box.getElement();
            if (e != null && "img".equals(e.getNodeName())) {
                String srcAttr = e.getAttribute("src");

                if (srcAttr != null && srcAttr.length() > 0) {
                    FSImage fsImage = uac.getImageResource(srcAttr).getImage();
                    if (fsImage != null) {
                        if (cssWidth != -1 || cssHeight != -1) {
                            long maxWidth = box.getStyle().asLength(c, CSSName.MAX_WIDTH).value();
                            long maxHeight = box.getStyle().asLength(c, CSSName.MAX_HEIGHT).value();
                            if (cssHeight > maxHeight && cssHeight >= cssWidth) {
                                fsImage.scale(-1, (int) maxHeight);
                            } else if (cssWidth > maxWidth) {
                                fsImage.scale((int) maxWidth, -1);
                            } else {
                                fsImage.scale(cssWidth, cssHeight);
                            }
                        }
                        return new ITextImageElement(fsImage);
                    }
                }
            }
            return super.createReplacedElement(c, box, uac, cssWidth, cssHeight);
        }
    }

I don't know how relevant that is to this project, as I just started using the library and need to learn more about how it works. Excited to see this though!

Use custom font

Any way to use custom font? There is PdfBoxFontResolver, but currently don't see how to add custom font to it.

Support right-to-left (RTL) text and mixed direction text.

Support RTL text such as Arabic. I've divided this issue up into the following tasks. Note that Bidi and ArabicShaping classes come from ICU4J. The aim is to provide identical output to Chrome browser with dir="auto" set on html element. The implementation uses interfaces with a do nothing implementation by default to avoid a compulsory dependency on ICU4J.

Split document up into paragraphs. The owner of each text node is said to be the nearest parent block element.
Split paragraphs up into directional runs with Bidi::setPara, Bidi::countRuns and Bidi::getVisualRun.
Determine if a line is predominantly RTL and right align it if it is.
Also if a line is predominantly RTL, lay its children out from right to left instead of left to right.
Shape text with ArabicShaping::shape. This will turn isolate characters into begin, middle or end forms depending on position in word.
Reorder text from RTL to LTR so we can output it with the standard showText instead of character by character backwards. This uses Bidi::writeReverse.
Deshape characters if the shaped versions are missing in the font.
Use fallback font in case character still doesn't exist in font. For example latin characters, when using an Arabic font.

Replace XRLog home-made logging with slf4j

Substitute Non-Breaking-Space with Normal-Space for PDF font character lookup

I´ve investigated the "#" problem I described in #19 a bit future. The problem is, that   is renderd as '#'. The # comes from the default xhtmlrenderer.conf:

# When rendering text, not all fonts support all character glyphs. When set to true, this
# will replace any missing characters with the specified character to aid in the debugging
# of your PDF.  Currently only supported for PDF rendering.
xr.renderer.replace-missing-characters=false
xr.renderer.missing-character-replacement=#

The character is used as replacement even if xr.renderer.replace-missing-characters=false. It seem no font has a   character. This makes somehow sense, as its visual the same character as a normal space.

Just replacing (character 160) with ' ' would fix the problem - but it does not feel like a correct fix to me:

--- a/openhtmltopdf-pdfbox/src/main/java/com/openhtmltopdf/pdfboxout/PdfBoxOutputDevice.java
+++ b/openhtmltopdf-pdfbox/src/main/java/com/openhtmltopdf/pdfboxout/PdfBoxOutputDevice.java
@@ -381,6 +332,8 @@ public class PdfBoxOutputDevice extends AbstractOutputDevice implements OutputDe
         for (int i = 0; i < str.length(); ) {
             int unicode = str.codePointAt(i);
             i += Character.charCount(unicode);
+            if( unicode == 160 )
+                unicode = ' ';
             String ch = String.valueOf(Character.toChars(unicode));
             boolean gotChar = false;

Especially because their are more spaces then just space and non-breaking-space. For examples see here https://www.cs.tut.fi/~jkorpela/chars/spaces.html

Upgrade PDFBOX to 2.04

Unfortunately, this doesn't fix #52

Rework URI resolver so it can be used with any base uri.

Currently it only works with the document base uri. This means it can not be used for import statements in CSS which should be relative to the CSS document. It is also a mess of code.

Once it is fixed, make sure the CSS parser uses it for import statements instead of bypassing the user agent resolver altogether.

Probably a breaking change.

Remove SWT support

Make sure XML Document Builder doesn't resolve external DTDs

Avoid line break between html elements and text

Considering the following example:

<!DOCTYPE html>
<html>
<meta charset="utf8" />
<head>
</head>
<body>
some long text… (<span class="keyword">example</span>) …some long text… «&nbsp;<span class="keyword">example2</span>&nbsp;» some long text… <i>example3</i>.
</body>
</html>

Firefox, Safari, Chrome will never break between the texts in spans and the (immediate-)surrounding characters.

But, depending on the position of the text, openhtmltopdf may put a line break between:

"(" and "example" or "example" and ")"
same between "«"/"»" and "example2"
same between example3 and "."

This results in not properly formatted text.
The texts here in the spans should be considered as being on the same level as the surrounding text.

More generally, this should probably be the default behavior for any inline tags.

format html with no ending tags

Hello,
is it possible to format html with no ending tags?

something like this html file:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//CZ">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1250">
</HEAD>
<BODY TOPMARGIN="10" MARGINWIDTH="10" LEFTMARGIN="10" MARGINHEIGHT="10" BGCOLOR="White" TEXT="Black">
<FONT FACE="Courier New CE, Courier New, Courier CE, Courier, monospaced">
-------------------------------------------------------------------------<BR>
this is just the test
characters: +ľščťžýáíé <br>
</FONT>
</BODY>
</HTML>

thanks

Add Apache PDF-BOX 2 output device

it doesn't support other language.

this is my html.
`

WHITE 白い(default)

but when i run TestcaseRunner, the pdf display
WHITE ##(default)

Release schedule of a stable version.

Hi Danfickle
Currently I using flyingsauser and ITEXT 2.1.2 in my project and I like change those libraries to openhtmltopdf including the pdfbox integration.

I’m interesting to get more information about your release schedule. Is there a plan when this library will be released to a stable version?

Does openhtmltopdf support transform: rotate?

Does openhtmltopdf support css3 like this?

.pic-box a.nth-of-type_1 {
    transform: rotate(-6deg);
    -ms-transform: rotate(-6deg);
    -moz-transform: rotate(-6deg);
    -webkit-transform: rotate(-6deg);
    -o-transform: rotate(-6deg);
    z-index: 2;
}

I removed CSS Pseudo-classes because I can got it by CSS2.1

CSS3 Multi-column Layout

Guys, did anyone try to do this, or thought about it? My current employer needs this feature desperately, willing to offer a compensation for whoever takes on this task.

'java.lang.OutOfMemoryError' when using a Base64 encoded, embedded JPEG image

Hello,

I recently experienced an OutOfMemory error while using the OpenHtmlToPDF framework. Our requirements are rather normal, that is, generating a PDF file out of a simple HTML file which contains only basic CSS 2.0 and XHTML - mainly tables, text and up to three images.

We ran a stress test because the framework should be integrated in our server component, which needs to convert HTML to PDF for our clients. I used a fairly simple for-loop to iterate over HTML files and for each HTML-content, we used OpenHtmlToPDF to generate a PDF file. After about 6000 iterations, the test stopped and a "java.lang.OutOfMemoryError" was shown.

I then took apart all the components, stripped away code step by step to reproduce the OutOfMemoryError with minimal test-code and the result was this simple test case:

@Test
public void test_stressPdfRendererBuilder() throws Exception
{
    int count = 10000;

    String html = FileUtils.readFileToString( new File( "html-with-embedded-jpg.html" ), Charsets.UTF_8 );

    for ( int i = 0; i < count; i++ )
    {
        System.err.println( "i: " + i );

        ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();

        PdfRendererBuilder builder = new PdfRendererBuilder();
        builder.withHtmlContent( html, null );
        builder.toStream( byteArrayOutputStream );

        builder.run();
    }
}

The file html-with-embedded-jpg.html is a simple HTML with a img-Tag with embedded JPEG image (Base64 encoded). You can display that HTML file with the image with any browser.

Running the above test, one can see in the Windows Task Manager, how the occupied memory grows rapidly (interestingly, the Java heap space is doing "ok"). In iteration 5000, it was at ~ 1,6 GB.

After about iteration 6000, the "java.lang.OutOfMemoryError" occurs, with the following stack trace:

java.lang.OutOfMemoryError: Initializing Reader
    at com.sun.imageio.plugins.jpeg.JPEGImageReader.initJPEGImageReader(Native Method)
    at com.sun.imageio.plugins.jpeg.JPEGImageReader.<init>(Unknown Source)
    at com.sun.imageio.plugins.jpeg.JPEGImageReaderSpi.createReaderInstance(Unknown Source)
    at javax.imageio.spi.ImageReaderSpi.createReaderInstance(Unknown Source)
    at javax.imageio.ImageIO$ImageReaderIterator.next(Unknown Source)
    at javax.imageio.ImageIO$ImageReaderIterator.next(Unknown Source)
    at org.apache.pdfbox.pdmodel.graphics.image.JPEGFactory.readJPEG(JPEGFactory.java:103)
    at org.apache.pdfbox.pdmodel.graphics.image.JPEGFactory.createFromStream(JPEGFactory.java:78)
    at com.openhtmltopdf.pdfboxout.PdfBoxOutputDevice.realizeImage(PdfBoxOutputDevice.java:688)
    at com.openhtmltopdf.pdfboxout.PdfBoxUserAgent.getImageResource(PdfBoxUserAgent.java:81)
    at com.openhtmltopdf.pdfboxout.PdfBoxReplacedElementFactory.createReplacedElement(PdfBoxReplacedElementFactory.java:58)
    at com.openhtmltopdf.render.BlockBox.calcMinMaxWidth(BlockBox.java:1524)
    at com.openhtmltopdf.render.BlockBox.calcMinMaxWidthInlineChildren(BlockBox.java:1684)
    at com.openhtmltopdf.render.BlockBox.calcMinMaxWidth(BlockBox.java:1567)
    at com.openhtmltopdf.newtable.TableBox$AutoTableLayout.recalcColumn(TableBox.java:1240)
    at com.openhtmltopdf.newtable.TableBox$AutoTableLayout.fullRecalc(TableBox.java:1214)
    at com.openhtmltopdf.newtable.TableBox$AutoTableLayout.calcMinMaxWidth(TableBox.java:1509)
    at com.openhtmltopdf.newtable.TableBox.calcMinMaxWidth(TableBox.java:158)
    at com.openhtmltopdf.newtable.TableBox.layout(TableBox.java:221)
    at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321)
    at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299)
    at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90)
    at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:990)
    at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:870)
    at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:799)
    at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321)
    at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299)
    at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90)
    at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:990)
    at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:870)
    at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:799)
    at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild0(BlockBoxing.java:321)
    at com.openhtmltopdf.layout.BlockBoxing.layoutBlockChild(BlockBoxing.java:299)
    at com.openhtmltopdf.layout.BlockBoxing.layoutContent(BlockBoxing.java:90)
    at com.openhtmltopdf.render.BlockBox.layoutChildren(BlockBox.java:990)
    at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:870)
    at com.openhtmltopdf.render.BlockBox.layout(BlockBox.java:799)
    at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.layout(PdfBoxRenderer.java:431)
    at com.openhtmltopdf.pdfboxout.PdfRendererBuilder.run(PdfRendererBuilder.java:54)
    ...
    ...

I digged into the code and ended up in PdfBoxOutputDevice.realizeImage(PdfBoxImage) where I found the following lines:

if (img.isJpeg()) {
    xobject = JPEGFactory.createFromStream(_writer,
            new ByteArrayInputStream(img.getBytes()));
} else {
    BufferedImage buffered = ImageIO.read(new ByteArrayInputStream(
            img.getBytes()));

    xobject = LosslessFactory.createFromImage(_writer, buffered);
}

So there is a condition where JPEGFactory.createFromStream is used, if the image is an JPEG, otherwise ImageIO.read is used.

So I changed my test-html-file to embed a PNG instead of an JPEG image - and the OutOfMemory error was gone. Java heap is doing fine, the Windows Task Manager shows only ~ 80 MB memory usage for the Java process no matter how many iterations I run.

Doing a simple seach I came across this:

http://stackoverflow.com/questions/11052091/pdfbox-out-of-memory-when-adding-image

Maybe there is a problem in PDFBox or in the way, the PDFBox-API is used to integrate an JPEG image into a PDDocument, I'm not sure.

So, the workaround for me is to not use the JPEG image format when embedding an image into the HTML code, but instead using PNG.

I wanted to post this issue here first. I'm sure you know how to debug the code better than me, but I hope I could help a bit with the above information.

Hope to hear from you and that there is an easy fix for it. Or maybe this is a bug in PdfBox eventually.

Thanks a lot!

Clean up user agents and usages of user agents

This comprises the following tasks:

Implement pluggable URI resolver.
Implement pluggable cache.
Make sure all calls for external resources go through the user agent. Specifically: URI resolver -> cache -> http stream.
Allow vetoing of an external resource call, probably with the URI resolver.
Allow overriding of the whole user agent.

Add support for font-face in Java2D font resolver.

It doesn't seem to support the font-face way of adding fonts in CSS currently.

PDF renderer doesn't work if the HTML has CDATA sections.

Whenever I try to convert an HTML with CDATA sections to PDF, I get a NullPointerException back. If one tries to execute the following snippet:

    PdfRendererBuilder builder = new PdfRendererBuilder();
    builder.toStream(outputStream);
    builder.withHtmlContent("<html><body><div><![CDATA[test]]></div></body></html>", "");
    builder.run();

The following error will occur

Exception in thread "main" java.lang.NullPointerException
    at com.openhtmltopdf.layout.BoxBuilder.doBidi(BoxBuilder.java:1250)
    at com.openhtmltopdf.layout.BoxBuilder.createChildren(BoxBuilder.java:1207)
    at com.openhtmltopdf.layout.BoxBuilder.createChildren(BoxBuilder.java:132)
    at com.openhtmltopdf.render.BlockBox.ensureChildren(BlockBox.java:970)
    at com.openhtmltopdf.layout.BoxBuilder.createChildren(BoxBuilder.java:1199)
    at com.openhtmltopdf.layout.BoxBuilder.createChildren(BoxBuilder.java:132)
    at com.openhtmltopdf.render.BlockBox.ensureChildren(BlockBox.java:970)

This happens because the paragraph splitter doesn't recognize CDATA sections as text. The pull request #15 fixes this issue.

using withHtmlContent method result #### in Arabic

I'm using thymeleaf to process the HTML page that would be then passed to openhtmltopdf to generate a PDF with Arabic content.
the issue if I'm trying to generate using
withHtmlContent(String html, String baseUri)
then:

the PDF will have #### for any Arabic Content.
what is baseUri and how I should set it?
if I just pass the HTML page location using withUri(String uri), Arabic is rendered, the issue with this is I need to process the HTML to be dynamic "generate a purchase order"

check below sample code:

public void generateOpenPDF() throws Exception {
    String htmlContent = "<html xmlns=\"http://www.w3.org/1999/xhtml\"><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"/><style>@font-face{font-family: noto; src: url('NotoNaskhArabic-Regular.ttf');}body, body *{font-family: 'noto', sans-serif;}</style></head><body><div class=\"border\"><h1>عربي</h1></div></body></html>";

   ByteArrayOutputStream os = new ByteArrayOutputStream();
   PdfRendererBuilder builder = new PdfRendererBuilder();
   builder.useUnicodeBidiSplitter(new ICUBidiSplitter.ICUBidiSplitterFactory());
   builder.useUnicodeBidiReorderer(new ICUBidiReorderer());
   builder.defaultTextDirection(TextDirection.RTL); // OR RTL
	    
   builder.withHtmlContent(htmlContent, "/tmp");
        
   builder.toStream(os);
   builder.run();
   byte[] pdfAsBytes = os.toByteArray();
   os.close();
   Path file = Paths.get("/tmp/the-file-name.pdf");
   Files.write(file, pdfAsBytes);
}

how can i make pageInfo?

I want to set the page header, page footer and page number, even to choose a picture as the background. How can i do that please?

Add support for SVG images

Currently, SVG images are not supported. PDFBox does not "draw" SVGs natively. Best practice is, to use apache batik to transcode SVGs to PDFs and embed these as PDFFormXObject via PDFBox.

_See also _

SVGs via batic in PDFBox: https://www.mail-archive.com/[email protected]/msg06998.html
Transcoding SVGs with batic: http://stackoverflow.com/questions/6875807/convert-svg-to-pdf

Don't download fonts that are not actually used

Currently fonts in font-face rules or fonts added programmatically are automatically downloaded even if they are never used. They don't make it into the PDF, but most browsers don't download unused fonts and we should aim to copy this behavior.

This will be a possibly breaking change, because currently if a font-weight or font-style isn't provided in the font-face rule, the font is processed and these values are taken from the font data. That will clearly be impossible if we do not download the font until we get a match.

Note that non subset fonts will continue to be embedded even if they are not used.