radkovo / pdf2dom Goto Github PK

Pdf2Dom is a PDF parser that converts the documents to a HTML DOM representation. The obtained DOM tree may be then serialized to a HTML file or further processed. A command-line utility for converting the PDF documents to HTML is included in the distribution package. Pdf2Dom may be also used as an independent Java library with a standard DOM interface for your DOM-based applications or as an alternative parser for the CSSBox rendering engine in order to add the PDF processing capability to CSSBox. Pdf2Dom is based on the Apache PDFBox™ library.

Home Page: http://cssbox.sourceforge.net/pdf2dom/

License: GNU Lesser General Public License v3.0

Java 99.71% Shell 0.29%

pdf2dom's People

Contributors

Stargazers

Watchers

pdf2dom's Issues

font size is changing

while converting from pdf to html the font size in the style tag is changing so the overlapping of text problem is coming.please check that
Thank you.

Release dependency issues

Opening an issue for this for discussion/status update since pull request where we were talking about this is closed.

So FontVerter and GfxAssert should be added to maven central, I opened tickets with Sonatype today to get them setup and think I've got all the maven central requirements met. So if Sonatype responds on the weekends it should be done by then.

Background colour of node in DOM

Could you please let me know how can i get Background color of element/node in DOM

able to get below output
style="top:161.80327pt;left:29.21pt;line-height:7.4866333pt;font-family:Arial;font-size:7.0pt;width:48.82689pt;"

using below code

XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "//div[text()='A300-327-GE']"; // use your XPath expression here
NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(dom, XPathConstants.NODESET);
System.out.println("nodeList"+nodeList.item(0).getAttributes().item(2));

New maven release

Do you have any plan when new release will be available on Maven Central Repository? Great work Thanks

FontVerter 1.2.18 depends on maven-reflections which has missing dependency

from my local maven build debug output

[DEBUG] net.sf.cssbox:pdf2dom:jar:1.6:compile
[DEBUG] net.sf.cssbox:cssbox:jar:4.12:compile
[DEBUG] net.sourceforge.nekohtml:nekohtml:jar:1.9.22:compile
[DEBUG] xerces:xercesImpl:jar:2.11.0:compile
[DEBUG] xml-apis:xml-apis:jar:1.4.01:compile
[DEBUG] net.sf.cssbox:jstyleparser:jar:1.23:compile
[DEBUG] org.antlr:antlr-runtime:jar:3.5.2:compile
[DEBUG] org.unbescape:unbescape:jar:1.1.3.RELEASE:compile
[DEBUG] net.mabboud.fontverter:FontVerter:jar:1.2.18:compile
[DEBUG] org.reflections:reflections-maven:jar:0.9.9-RC2:compile
[DEBUG] org.reflections:reflections:jar:0.9.9-RC2:compile
[DEBUG] com.google.code.findbugs:annotations:jar:2.0.1:compile
[DEBUG] dom4j:dom4j:jar:1.6.1:compile
[DEBUG] org.jfrog.maven.annomojo:maven-plugin-anno:jar:1.4.1:compile
[DEBUG] org.jfrog.jade.plugins.common:jade-plugin-common:jar:1.3.8:compile

however jade-plugin-common is suddenly missing from the central repo

Failed to decode downloaded font in Chrome

Chrome/Firefox seems to have problems to encode some embedded fonts when opening the ouput file although Pdf2Dom does not log any error/information while processing the input file.

Expected Behavior

Chrome/Firefox are able to encode the embedded font files.

Current Behavior

Chrome/Firefox cant encode some font files. The logs state:

Chrome

Failed to decode downloaded font: data:application/x-font-woff;base64,###base64### Minted_FontList.html:1 OTS parsing error: CFF : Failed to parse table

Firefox

downloadable font: CFF : Failed to parse table (font-family: "NYDYER Theodore" style:normal weight:400 stretch:100 src index:0) source: data:application/x-font-woff;base64,###base64###

Steps to Reproduce

Download https://cdn3.minted.com/files/content/community/Minted_FontList.pdf
Run "java -jar .\PDFToHTML.jar .\Minted_FontList.pdf"
Open in latest Chrome/Firefox.

xpath of dom element

Hi , Could you please let me know how we can find an element in DOM by using xpath .
please share some example/ code snippet to get the DOM tree and traverse it as needed
I want to verify the colour of element is as expected.
pdf = PDDocument.load(pdfFile);
PDFDomTree parser = new PDFDomTree();
// parse the file and get the DOM Document
Document dom = parser.createDOM(pdf);
System.out.println("dom.getTextContent()"+dom.getTextContent());
System.out.println("dom.getDocumentElement()"+dom.getDocumentElement());
System.out.println(dom.getElementById("A300-327-GE"));

Local resources are unavaliable for FireFox and WebView

Hi,
Thanks for your wonderful codes! I am embedding your works in my application which shows PDFs' contents page by page in GUI.
I met the problem that all images and fonts in the generated html can not be displayed by Firefox and Javafx WebView, meanwhile they work in Chrome.
I am sure there is something wrong about the resources' path cause I tried following paths and the first 2 ones do not work while the last 2 ones work for FireFox and my app:
<img src="D:\tmp/JEEFC.book.png"/>
<img src="D:\\tmp\\JEEFC.book.png"/>
<img src="./JEEFC.book.png"/>
<img src="JEEFC.book.png" />

I found following link:
https://stackoverflow.com/questions/11812111/font-face-url-pointing-to-local-file?r=SearchResults
And following lines remind me it is domain issue:

Both IE 9 and Firefox require font files to be served from the same domain as the page they are loaded into

So I tried following and it works for FireFox and Javafx WebView
<img src="file:///D:\tmp/JEEFC.book.png"/>

To view the html generated by PDF2DOM well, I extend your "SaveResourceToDirHandler.java".
If its variables were "protected" instead of "private", my class would be simple like this:

public class PDFResourceToDirHandler extends SaveResourceToDirHandler {

     @Override
    public String handleResource(HtmlResource resource) throws IOException {
        return "file:///" + super.handleResource(resource);
    }

}

Now it has to copy all lines of your "SaveResourceToDirHandler.java" like this:

public class PDFResourceToDirHandler extends SaveResourceToDirHandler {

    private final File directory;
    private final List<String> writtenFileNames = new LinkedList<>();

    public PDFResourceToDirHandler(File directory) {
        this.directory = directory;
    }
    
    @Override
    public String handleResource(HtmlResource resource) throws IOException {
        String dir = DEFAULT_RESOURCE_DIR;
        if (directory != null) {
            dir = directory.getPath() + "/";
        }

        String fileName = findNextUnusedFileName(resource.getName());
        String resourcePath = dir + fileName + "." + resource.getFileEnding();

        File file = new File(resourcePath);
        FileUtils.writeByteArrayToFile(file, resource.getData());

        writtenFileNames.add(fileName);

        return "file:///" + resourcePath;
    }

    private String findNextUnusedFileName(String fileName) {
        int i = 1;
        String usedName = fileName;
        while (writtenFileNames.contains(usedName)) {
            usedName = fileName + i;
            i++;
        }

        return usedName;
    }

}

infinite loop issue in the font handling

Hi,
we think we found an infinite loop issue in the font handling of Pdf2Dom 1.7.
The stack trace is:
org.fit.pdfdom.FontTable.nextUsedName(FontTable.java:83)
org.fit.pdfdom.FontTable.addEntry(FontTable.java:45)
org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:378)
org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361)
org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544)
org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206)
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218)
...

When looking at the code we see that the index variable i is not increased, so for the third font with the same name an infinite loop will occur.

protected String nextUsedName(String fontName)
{
    int i = 1;
    String usedName = fontName;
    while (isNameUsed(usedName))
        usedName = fontName + i;

    return usedName;

}

We propose the following fix:

protected String nextUsedName(String fontName)
{
    int i = 1;
    String usedName = fontName;
    while (isNameUsed(usedName))
    {
       usedName = fontName + i;
       i++;
    }

    return usedName;

}

You can reproduce it for example with this PDF file:
http://regalwerk.de/fileadmin/user_upload/RW_Katalog_2018_2019_72DPI.pdf

on page 115 there are the following fonts:
VRXWUQ+Verdana
VRXWUQ+Verdana
VRXWUQ+Futura-Bold
VRXWUQ+Futura-Book
VRXWUQ+Verdana-Bold
VRXWUQ+Verdana

causing the algorithm to hang.

We will use a workaround which counts the present fonts and skips problematic pages until the issue is fixed in Pdf2Dom.

Have a nice day

Infinite loop in PDFBoxTree.processFontResources()

Hello,

For one given PDF file, I have an infinite recursive call in PDFBoxTree.processFontResources() resulting in a StackOverflowError. I have several dozens of PDF files that I want to convert, but I have this problem for only one. Unfortunately, I can't share the PDF that results in a problem as it is confidential...

It's happening with the last release, 1.7.

The stack trace I get is the following :

java.lang.StackOverflowError
        at java.util.zip.Inflater.<init>(Inflater.java:102)
        at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:74)
        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
        at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
        at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:77)
        at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
        at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
        at org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:236)
        at org.apache.pdfbox.pdmodel.common.PDStream.toByteArray(PDStream.java:505)
        at org.fit.pdfdom.FontTable$Entry.loadTrueTypeFont(FontTable.java:176)
        at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:147)
        at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161)
        at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:378)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        ...

The code I use is the following :

        File pdfFile = new File("test.pdf");
        File outFile = new File("test.html");
        try (PDDocument pdf = PDDocument.load(pdfFile)) {
        	PDFDomTree parser = new PDFDomTree();
        	Document dom = parser.createDOM(pdf);
        	TransformerFactory transFactory = TransformerFactory.newInstance();
        	Transformer trans = transFactory.newTransformer();
        	Source src = new DOMSource(dom);
        	Result dest = new StreamResult(outFile);
        	trans.transform(src, dest);
        }

Maven not able to resolve dependencie cssbox 4.14-SNAPSHOT

Resolution: Modify pom to install cssbox 4.14 instead.

NullPointerException in PDFBoxTree.java on line 391 - and fix

I have a PDF that results in a NullPointerException in PDFBoxTree.java on line 391 when trying to convert it. The reason is that the font variable is null.

The fix is to include a check for null before checking the font type.

for (COSName key : resources.getFontNames())
{
    PDFont font = resources.getFont(key);
    if (null != font) {
        if (font instanceof PDTrueTypeFont)
        {
            table.addEntry( font);
            log.debug("Font: " + font.getName() + " TTF");
        }
        else if (font instanceof PDType0Font)
        {
            PDCIDFont descendantFont = ((PDType0Font) font).getDescendantFont();
            if (descendantFont instanceof PDCIDFontType2)
                table.addEntry(font);
            else
                log.warn(fontNotSupportedMessage, font.getName(), font.getClass().getSimpleName());
        }
        else if (font instanceof PDType1CFont)
            table.addEntry(font);
        else
            log.warn(fontNotSupportedMessage, font.getName(), font.getClass().getSimpleName());
    }
}

Modifying HTML Styles

I wouldn't say this is an issue, but more of a question. I'm struggling to figure out how to modify the inline styling that is provided with the html output. Is there a way to customize this?

Getting error while creating dom object from pdf in liferay using Pdf2Dom

Hello @radkovo, @m-abboud ,

I want to convert pdf to dom object.
I am trying to use following code:

File file = new File("E:/IETM/ROV.pdf"); PDDocument pdf = PDDocument.load(file); PDFDomTree parser = new PDFDomTree(); Document dom = parser.createDOM(pdf); return dom.toString();

But I am getting following error:
11:49:01,544 ERROR [http-nio-8080-exec-10][JSONWebServiceServiceAction:97] org/fit/pdfdom/PDFToHTML

Kindly let me know where I am going wrong and what should be done to resolve this.

Thanks,
Siddharth

Add module-info.java

So that it will be a dedicated module instead of an automatic one.

Too many console log prints when converting PDF to HTML

When using the writeText method of PDFDomTree, the console prints too much parsed information. Can this control the output level?

Create a Gitter room for this repository

Having a gitter.im community would help contributors and users to communicate and collaborate

PDFDomTree embeds erroneous base64 of font types

Full explanation of the problem can be found here:
https://stackoverflow.com/questions/62328569/why-does-pdfdomtree-embed-erroneous-base64-of-font-types

help me executing the file

Please help me in executing this code in eclipse

pdf context order problem

when I convert pdf document that below to html document
广德经济开发区以数字化转型推动 PCB 产业转型升级的若干政策.pdf,I found that text context order of converted html document is inconsistent with the context order of the original pdf document。look at the picture below,

Pdfbox dependency security vulnerabilities

Just remembered I got an email from GitHub notifying me fontverter's pdfbox dependency has a vunrability. Pdf2dom has the same separate dependency entry for pdfbox as well so need to bump that and also fontverter version after it's bumped there as well.

I'll try to remember to make a PR when I get home from work but making this issue in case I forget again..

HTML TAG

If the PDF file contains HTML code, the file cannot be displayed properly.

Font: Helvetica skipped because type 'PDType1Font' is not supported

printStackTrace() in FontTable

In class FontTable on line 200, there is a ex.printStackTrace();

It would be really nice to remove this and throw an Exception, so the user of your library can decied if he want to see the stacktrace or not.

org.mabb.fontverter.io.DataTypeSerializerException

信息: or call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")
org.mabb.fontverter.io.DataTypeSerializerException: org.mabb.fontverter.io.DataTypeSerializerException: org.mabb.fontverter.opentype.TtfGlyph
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:47)
at org.mabb.fontverter.opentype.TtfGlyph.parse(TtfGlyph.java:80)
at org.mabb.fontverter.opentype.GlyphTable.readData(GlyphTable.java:74)
at org.mabb.fontverter.opentype.OpenTypeParser.readTableDataEntries(OpenTypeParser.java:75)
at org.mabb.fontverter.opentype.OpenTypeParser.parse(OpenTypeParser.java:47)
at org.mabb.fontverter.opentype.OpenTypeParser.parse(OpenTypeParser.java:35)
at org.mabb.fontverter.converter.PsType0ToOpenTypeConverter.getOtfFromDescendantFont(PsType0ToOpenTypeConverter.java:64)
at org.mabb.fontverter.converter.PsType0ToOpenTypeConverter.convert(PsType0ToOpenTypeConverter.java:43)
at org.mabb.fontverter.pdf.PdfFontExtractor.convertType0FontToOpenType(PdfFontExtractor.java:215)
at org.fit.pdfdom.FontTable$Entry.loadType0TtfDescendantFont(FontTable.java:192)
at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:145)
at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161)
at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48)
at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:385)
at org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361)
at org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544)
at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218)
at org.fit.pdfdom.PDFDomTree.writeText(PDFDomTree.java:194)
at org.fit.pdfdom.TestUtils.parseWithPdfDomTree(TestUtils.java:54)
at org.fit.pdfdom.TestUtils.parseWithPdfDomTree(TestUtils.java:31)
at org.fit.pdfdom.TestUtils.parseWithPdfDomTree(TestUtils.java:22)
at org.fit.pdfdom.TestBook.testBook(TestBook.java:44)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: org.mabb.fontverter.io.DataTypeSerializerException: org.mabb.fontverter.opentype.TtfGlyph
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:71)
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:45)
... 46 more
Caused by: org.mabb.fontverter.io.DataTypeSerializerException: int org.mabb.fontverter.opentype.TtfGlyph.instructionLength org.mabb.fontverter.opentype.TtfGlyph
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:65)
... 47 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
at org.mabb.fontverter.io.DataTypeBindingDeserializer.readSingleValue(DataTypeBindingDeserializer.java:105)
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserializeProperty(DataTypeBindingDeserializer.java:92)
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:63)
... 47 more
[main] WARN Error loading type 0 with ttf descendant font 'FDGONI+SimHei' Message: org.mabb.fontverter.io.DataTypeSerializerException: Error serializing property: java.lang.Long[] org.mabb.fontverter.opentype.GlyphLocationTable.longOffsets class java.io.IOException
org.mabb.fontverter.io.DataTypeSerializerException: org.mabb.fontverter.io.DataTypeSerializerException: org.mabb.fontverter.opentype.TtfGlyph
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:47)
at org.mabb.fontverter.opentype.TtfGlyph.parse(TtfGlyph.java:80)
at org.mabb.fontverter.opentype.GlyphTable.readData(GlyphTable.java:74)
at org.mabb.fontverter.opentype.OpenTypeParser.readTableDataEntries(OpenTypeParser.java:75)
at org.mabb.fontverter.opentype.OpenTypeParser.parse(OpenTypeParser.java:47)
at org.mabb.fontverter.opentype.OpenTypeParser.parse(OpenTypeParser.java:35)
at org.mabb.fontverter.converter.PsType0ToOpenTypeConverter.getOtfFromDescendantFont(PsType0ToOpenTypeConverter.java:64)
at org.mabb.fontverter.converter.PsType0ToOpenTypeConverter.convert(PsType0ToOpenTypeConverter.java:43)
at org.mabb.fontverter.pdf.PdfFontExtractor.convertType0FontToOpenType(PdfFontExtractor.java:215)
at org.fit.pdfdom.FontTable$Entry.loadType0TtfDescendantFont(FontTable.java:192)
at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:145)
at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161)
at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48)
at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:385)
at org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361)
at org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544)
at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218)
at org.fit.pdfdom.PDFDomTree.writeText(PDFDomTree.java:194)
at org.fit.pdfdom.TestUtils.parseWithPdfDomTree(TestUtils.java:54)
at org.fit.pdfdom.TestUtils.parseWithPdfDomTree(TestUtils.java:31)
at org.fit.pdfdom.TestUtils.parseWithPdfDomTree(TestUtils.java:22)
at org.fit.pdfdom.TestBook.testBook(TestBook.java:44)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: org.mabb.fontverter.io.DataTypeSerializerException: org.mabb.fontverter.opentype.TtfGlyph
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:71)
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:45)
... 46 more
Caused by: org.mabb.fontverter.io.DataTypeSerializerException: int org.mabb.fontverter.opentype.TtfGlyph.instructionLength org.mabb.fontverter.opentype.TtfGlyph
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:65)
... 47 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
at org.mabb.fontverter.io.DataTypeBindingDeserializer.readSingleValue(DataTypeBindingDeserializer.java:105)
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserializeProperty(DataTypeBindingDeserializer.java:92)
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:63)
... 47 more
[main] WARN Error loading type 0 with ttf descendant font 'FDGOPI+SimSun' Message: org.mabb.fontverter.io.DataTypeSerializerException: Error serializing property: java.lang.Long[] org.mabb.fontverter.opentype.GlyphLocationTable.longOffsets class java.io.IOException
五月 24, 2018 9:12:15 上午 org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
警告: No Unicode mapping for CID+0 (0) in font FDICLC+Dotum

Resulting HTML does not include PDF form fields

Code:
`
PDFDomTreeConfig settings = PDFDomTreeConfig.createDefaultConfig();
settings.setFontHandler(PDFDomTreeConfig.embedAsBase64());
settings.setImageHandler(PDFDomTreeConfig.embedAsBase64());

PDFDomTree pdfDomTree = new PDFDomTree(settings);
try (PDDocument pdf = PDDocument.load(inputStream)) {
try (PrintWriter output = new PrintWriter(outputPath.toFile(), "utf-8")) {
pdfDomTree.writeText(pdf, output);
}
}
`

Input:
sample-form.pdf

Output: does not include any html fields

Getting java.lang.UnsupportedOperationException at org.apache.pdfbox.pdmodel.graphics.color.PDPattern.toRGB

I am using Pdf2Dom to parse pdf document. In my java application when I tried to convert a PDF file to html. I am getting,

java.lang.UnsupportedOperationException
at org.apache.pdfbox.pdmodel.graphics.color.PDPattern.toRGB(PDPattern.java:95)
at org.fit.pdfdom.PathDrawer.pdfColorToColor(PathDrawer.java:133)
at org.fit.pdfdom.PathDrawer.clearPathGraphics(PathDrawer.java:79)
at org.fit.pdfdom.PathDrawer.drawPath(PathDrawer.java:59)
at org.fit.pdfdom.PDFDomTree.createPathImage(PDFDomTree.java:403)
at org.fit.pdfdom.PDFDomTree.renderPath(PDFDomTree.java:251)
at org.fit.pdfdom.PDFBoxTree.processOperator(PDFBoxTree.java:499)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
at org.apache.pdfbox.contentstream.PDFStreamEngine.showForm(PDFStreamEngine.java:181)
at org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:65)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
at org.fit.pdfdom.PDFBoxTree.processOperator(PDFBoxTree.java:542)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:208)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218)
at com.demo.pdf.converter.PdfProcessor.convertToHtml(PdfProcessor.java:87)

error loading font

getting warning like
"WARN org.fit.pdfdom.FontTable - Error loading font 'FTBUET+Verdana-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException"
is there any solution for this.
Thank you

Error with FontWeight in updateStyle function

In updateStyle function, the "font.toLowerCase().lastIndexOf(pdFontType[i]) >= 0" will be error with "TimesNewRoman,Bold" font. The pdFontType will be detected as "roman" -> fontWeight is normal.

embedfonts branch remaining work?

(More of a discussion item than an actual issue)

So what remaining work is there for the embedfonts branch? From what I see it just needs adding conversion for the various non browser supported pdf font types?

I added support for bare CFF (PdfBox calls it Type1C I believe) in my embedded-fonts-2 branch and maybe going to start work on others and if I recall correctly bare CFF is the most common but maybe wishful thinking.

document:null error

Hi,
I'm trying to run DOM creation, as in example on sourceforge project page, but in result my Document object doesn't have any DOM.

        PDDocument pdDocument = PDDocument.load(new FileInputStream(new File(ClassLoader.getSystemClassLoader().getResource("tst.pdf").getFile())));
        PDFDomTree tree = new PDFDomTree(PDFDomTreeConfig.createDefaultConfig());
        Document d = tree.createDOM(pdDocument);
        System.out.println(d);

The output is

[#document: null]

What i'm doing wrong?

Thanks.

Vulnerable Dependency Guava 15

Hello,

first of all..good Job! Thank you for that awesome Lib!

I would like to use it but you have a vulnerable dependency (guava 15). Is is possible to update the dependencies at some point?

best regards