Giter VIP home page Giter VIP logo

radkovo / pdf2dom Goto Github PK

View Code? Open in Web Editor NEW
175.0 175.0 71.0 13.57 MB

Pdf2Dom is a PDF parser that converts the documents to a HTML DOM representation. The obtained DOM tree may be then serialized to a HTML file or further processed. A command-line utility for converting the PDF documents to HTML is included in the distribution package. Pdf2Dom may be also used as an independent Java library with a standard DOM interface for your DOM-based applications or as an alternative parser for the CSSBox rendering engine in order to add the PDF processing capability to CSSBox. Pdf2Dom is based on the Apache PDFBox™ library.

Home Page: http://cssbox.sourceforge.net/pdf2dom/

License: GNU Lesser General Public License v3.0

Java 99.71% Shell 0.29%

pdf2dom's People

Contributors

clint-journaltech avatar dependabot[bot] avatar jartysiewicz avatar m-abboud avatar radkovo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdf2dom's Issues

font size is changing

while converting from pdf to html the font size in the style tag is changing so the overlapping of text problem is coming.please check that
Thank you.

Release dependency issues

Opening an issue for this for discussion/status update since pull request where we were talking about this is closed.

So FontVerter and GfxAssert should be added to maven central, I opened tickets with Sonatype today to get them setup and think I've got all the maven central requirements met. So if Sonatype responds on the weekends it should be done by then.

Background colour of node in DOM

Could you please let me know how can i get Background color of element/node in DOM

able to get below output
style="top:161.80327pt;left:29.21pt;line-height:7.4866333pt;font-family:Arial;font-size:7.0pt;width:48.82689pt;"

using below code

XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "//div[text()='A300-327-GE']"; // use your XPath expression here
NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(dom, XPathConstants.NODESET);
System.out.println("nodeList"+nodeList.item(0).getAttributes().item(2));

New maven release

Do you have any plan when new release will be available on Maven Central Repository? Great work Thanks

FontVerter 1.2.18 depends on maven-reflections which has missing dependency

from my local maven build debug output

[DEBUG] net.sf.cssbox:pdf2dom:jar:1.6:compile
[DEBUG] net.sf.cssbox:cssbox:jar:4.12:compile
[DEBUG] net.sourceforge.nekohtml:nekohtml:jar:1.9.22:compile
[DEBUG] xerces:xercesImpl:jar:2.11.0:compile
[DEBUG] xml-apis:xml-apis:jar:1.4.01:compile
[DEBUG] net.sf.cssbox:jstyleparser:jar:1.23:compile
[DEBUG] org.antlr:antlr-runtime:jar:3.5.2:compile
[DEBUG] org.unbescape:unbescape:jar:1.1.3.RELEASE:compile
[DEBUG] net.mabboud.fontverter:FontVerter:jar:1.2.18:compile
[DEBUG] org.reflections:reflections-maven:jar:0.9.9-RC2:compile
[DEBUG] org.reflections:reflections:jar:0.9.9-RC2:compile
[DEBUG] com.google.code.findbugs:annotations:jar:2.0.1:compile
[DEBUG] dom4j:dom4j:jar:1.6.1:compile
[DEBUG] org.jfrog.maven.annomojo:maven-plugin-anno:jar:1.4.1:compile
[DEBUG] org.jfrog.jade.plugins.common:jade-plugin-common:jar:1.3.8:compile

however jade-plugin-common is suddenly missing from the central repo

Failed to decode downloaded font in Chrome

Chrome/Firefox seems to have problems to encode some embedded fonts when opening the ouput file although Pdf2Dom does not log any error/information while processing the input file.

Expected Behavior

Chrome/Firefox are able to encode the embedded font files.

Current Behavior

Chrome/Firefox cant encode some font files. The logs state:

Chrome

Failed to decode downloaded font: data:application/x-font-woff;base64,###base64### Minted_FontList.html:1 OTS parsing error: CFF : Failed to parse table

Firefox

downloadable font: CFF : Failed to parse table (font-family: "NYDYER Theodore" style:normal weight:400 stretch:100 src index:0) source: data:application/x-font-woff;base64,###base64###

Steps to Reproduce

  1. Download https://cdn3.minted.com/files/content/community/Minted_FontList.pdf
  2. Run "java -jar .\PDFToHTML.jar .\Minted_FontList.pdf"
  3. Open in latest Chrome/Firefox.

xpath of dom element

Hi , Could you please let me know how we can find an element in DOM by using xpath .
please share some example/ code snippet to get the DOM tree and traverse it as needed
I want to verify the colour of element is as expected.
pdf = PDDocument.load(pdfFile);
PDFDomTree parser = new PDFDomTree();
// parse the file and get the DOM Document
Document dom = parser.createDOM(pdf);
System.out.println("dom.getTextContent()"+dom.getTextContent());
System.out.println("dom.getDocumentElement()"+dom.getDocumentElement());
System.out.println(dom.getElementById("A300-327-GE"));

Local resources are unavaliable for FireFox and WebView

Hi,
Thanks for your wonderful codes! I am embedding your works in my application which shows PDFs' contents page by page in GUI.
I met the problem that all images and fonts in the generated html can not be displayed by Firefox and Javafx WebView, meanwhile they work in Chrome.
I am sure there is something wrong about the resources' path cause I tried following paths and the first 2 ones do not work while the last 2 ones work for FireFox and my app:
<img src="D:\tmp/JEEFC.book.png"/>
<img src="D:\\tmp\\JEEFC.book.png"/>
<img src="./JEEFC.book.png"/>
<img src="JEEFC.book.png" />

I found following link:
https://stackoverflow.com/questions/11812111/font-face-url-pointing-to-local-file?r=SearchResults
And following lines remind me it is domain issue:

Both IE 9 and Firefox require font files to be served from the same domain as the page they are loaded into

So I tried following and it works for FireFox and Javafx WebView
<img src="file:///D:\tmp/JEEFC.book.png"/>

To view the html generated by PDF2DOM well, I extend your "SaveResourceToDirHandler.java".
If its variables were "protected" instead of "private", my class would be simple like this:

public class PDFResourceToDirHandler extends SaveResourceToDirHandler {

     @Override
    public String handleResource(HtmlResource resource) throws IOException {
        return "file:///" + super.handleResource(resource);
    }

}

Now it has to copy all lines of your "SaveResourceToDirHandler.java" like this:

public class PDFResourceToDirHandler extends SaveResourceToDirHandler {

    private final File directory;
    private final List<String> writtenFileNames = new LinkedList<>();

    public PDFResourceToDirHandler(File directory) {
        this.directory = directory;
    }
    
    @Override
    public String handleResource(HtmlResource resource) throws IOException {
        String dir = DEFAULT_RESOURCE_DIR;
        if (directory != null) {
            dir = directory.getPath() + "/";
        }

        String fileName = findNextUnusedFileName(resource.getName());
        String resourcePath = dir + fileName + "." + resource.getFileEnding();

        File file = new File(resourcePath);
        FileUtils.writeByteArrayToFile(file, resource.getData());

        writtenFileNames.add(fileName);

        return "file:///" + resourcePath;
    }

    private String findNextUnusedFileName(String fileName) {
        int i = 1;
        String usedName = fileName;
        while (writtenFileNames.contains(usedName)) {
            usedName = fileName + i;
            i++;
        }

        return usedName;
    }

}

infinite loop issue in the font handling

Hi,
we think we found an infinite loop issue in the font handling of Pdf2Dom 1.7.
The stack trace is:
org.fit.pdfdom.FontTable.nextUsedName(FontTable.java:83)
org.fit.pdfdom.FontTable.addEntry(FontTable.java:45)
org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:378)
org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361)
org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544)
org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206)
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218)
...

When looking at the code we see that the index variable i is not increased, so for the third font with the same name an infinite loop will occur.

protected String nextUsedName(String fontName)
{
    int i = 1;
    String usedName = fontName;
    while (isNameUsed(usedName))
        usedName = fontName + i;

    return usedName;

}

We propose the following fix:

protected String nextUsedName(String fontName)
{
    int i = 1;
    String usedName = fontName;
    while (isNameUsed(usedName))
    {
       usedName = fontName + i;
       i++;
    }

    return usedName;

}

You can reproduce it for example with this PDF file:
http://regalwerk.de/fileadmin/user_upload/RW_Katalog_2018_2019_72DPI.pdf

on page 115 there are the following fonts:
VRXWUQ+Verdana
VRXWUQ+Verdana
VRXWUQ+Futura-Bold
VRXWUQ+Futura-Book
VRXWUQ+Verdana-Bold
VRXWUQ+Verdana

causing the algorithm to hang.

We will use a workaround which counts the present fonts and skips problematic pages until the issue is fixed in Pdf2Dom.

Have a nice day

Infinite loop in PDFBoxTree.processFontResources()

Hello,

For one given PDF file, I have an infinite recursive call in PDFBoxTree.processFontResources() resulting in a StackOverflowError. I have several dozens of PDF files that I want to convert, but I have this problem for only one. Unfortunately, I can't share the PDF that results in a problem as it is confidential...

It's happening with the last release, 1.7.

The stack trace I get is the following :

java.lang.StackOverflowError
        at java.util.zip.Inflater.<init>(Inflater.java:102)
        at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:74)
        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:50)
        at org.apache.pdfbox.filter.Filter.decode(Filter.java:87)
        at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:77)
        at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175)
        at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163)
        at org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:236)
        at org.apache.pdfbox.pdmodel.common.PDStream.toByteArray(PDStream.java:505)
        at org.fit.pdfdom.FontTable$Entry.loadTrueTypeFont(FontTable.java:176)
        at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:147)
        at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161)
        at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:378)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:403)
        ...

The code I use is the following :

        File pdfFile = new File("test.pdf");
        File outFile = new File("test.html");
        try (PDDocument pdf = PDDocument.load(pdfFile)) {
        	PDFDomTree parser = new PDFDomTree();
        	Document dom = parser.createDOM(pdf);
        	TransformerFactory transFactory = TransformerFactory.newInstance();
        	Transformer trans = transFactory.newTransformer();
        	Source src = new DOMSource(dom);
        	Result dest = new StreamResult(outFile);
        	trans.transform(src, dest);
        }

NullPointerException in PDFBoxTree.java on line 391 - and fix

I have a PDF that results in a NullPointerException in PDFBoxTree.java on line 391 when trying to convert it. The reason is that the font variable is null.

The fix is to include a check for null before checking the font type.

for (COSName key : resources.getFontNames())
{
    PDFont font = resources.getFont(key);
    if (null != font) {
        if (font instanceof PDTrueTypeFont)
        {
            table.addEntry( font);
            log.debug("Font: " + font.getName() + " TTF");
        }
        else if (font instanceof PDType0Font)
        {
            PDCIDFont descendantFont = ((PDType0Font) font).getDescendantFont();
            if (descendantFont instanceof PDCIDFontType2)
                table.addEntry(font);
            else
                log.warn(fontNotSupportedMessage, font.getName(), font.getClass().getSimpleName());
        }
        else if (font instanceof PDType1CFont)
            table.addEntry(font);
        else
            log.warn(fontNotSupportedMessage, font.getName(), font.getClass().getSimpleName());
    }
}

Modifying HTML Styles

I wouldn't say this is an issue, but more of a question. I'm struggling to figure out how to modify the inline styling that is provided with the html output. Is there a way to customize this?

Getting error while creating dom object from pdf in liferay using Pdf2Dom

Hello @radkovo, @m-abboud ,

I want to convert pdf to dom object.
I am trying to use following code:

File file = new File("E:/IETM/ROV.pdf"); PDDocument pdf = PDDocument.load(file); PDFDomTree parser = new PDFDomTree(); Document dom = parser.createDOM(pdf); return dom.toString();

But I am getting following error:
11:49:01,544 ERROR [http-nio-8080-exec-10][JSONWebServiceServiceAction:97] org/fit/pdfdom/PDFToHTML

Kindly let me know where I am going wrong and what should be done to resolve this.

Thanks,
Siddharth

Pdfbox dependency security vulnerabilities

Just remembered I got an email from GitHub notifying me fontverter's pdfbox dependency has a vunrability. Pdf2dom has the same separate dependency entry for pdfbox as well so need to bump that and also fontverter version after it's bumped there as well.

I'll try to remember to make a PR when I get home from work but making this issue in case I forget again..

HTML TAG

If the PDF file contains HTML code, the file cannot be displayed properly.

printStackTrace() in FontTable

In class FontTable on line 200, there is a ex.printStackTrace();

It would be really nice to remove this and throw an Exception, so the user of your library can decied if he want to see the stacktrace or not.

org.mabb.fontverter.io.DataTypeSerializerException

信息: or call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")
org.mabb.fontverter.io.DataTypeSerializerException: org.mabb.fontverter.io.DataTypeSerializerException: org.mabb.fontverter.opentype.TtfGlyph
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:47)
at org.mabb.fontverter.opentype.TtfGlyph.parse(TtfGlyph.java:80)
at org.mabb.fontverter.opentype.GlyphTable.readData(GlyphTable.java:74)
at org.mabb.fontverter.opentype.OpenTypeParser.readTableDataEntries(OpenTypeParser.java:75)
at org.mabb.fontverter.opentype.OpenTypeParser.parse(OpenTypeParser.java:47)
at org.mabb.fontverter.opentype.OpenTypeParser.parse(OpenTypeParser.java:35)
at org.mabb.fontverter.converter.PsType0ToOpenTypeConverter.getOtfFromDescendantFont(PsType0ToOpenTypeConverter.java:64)
at org.mabb.fontverter.converter.PsType0ToOpenTypeConverter.convert(PsType0ToOpenTypeConverter.java:43)
at org.mabb.fontverter.pdf.PdfFontExtractor.convertType0FontToOpenType(PdfFontExtractor.java:215)
at org.fit.pdfdom.FontTable$Entry.loadType0TtfDescendantFont(FontTable.java:192)
at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:145)
at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161)
at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48)
at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:385)
at org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361)
at org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544)
at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218)
at org.fit.pdfdom.PDFDomTree.writeText(PDFDomTree.java:194)
at org.fit.pdfdom.TestUtils.parseWithPdfDomTree(TestUtils.java:54)
at org.fit.pdfdom.TestUtils.parseWithPdfDomTree(TestUtils.java:31)
at org.fit.pdfdom.TestUtils.parseWithPdfDomTree(TestUtils.java:22)
at org.fit.pdfdom.TestBook.testBook(TestBook.java:44)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: org.mabb.fontverter.io.DataTypeSerializerException: org.mabb.fontverter.opentype.TtfGlyph
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:71)
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:45)
... 46 more
Caused by: org.mabb.fontverter.io.DataTypeSerializerException: int org.mabb.fontverter.opentype.TtfGlyph.instructionLength org.mabb.fontverter.opentype.TtfGlyph
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:65)
... 47 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
at org.mabb.fontverter.io.DataTypeBindingDeserializer.readSingleValue(DataTypeBindingDeserializer.java:105)
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserializeProperty(DataTypeBindingDeserializer.java:92)
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:63)
... 47 more
[main] WARN Error loading type 0 with ttf descendant font 'FDGONI+SimHei' Message: org.mabb.fontverter.io.DataTypeSerializerException: Error serializing property: java.lang.Long[] org.mabb.fontverter.opentype.GlyphLocationTable.longOffsets class java.io.IOException
org.mabb.fontverter.io.DataTypeSerializerException: org.mabb.fontverter.io.DataTypeSerializerException: org.mabb.fontverter.opentype.TtfGlyph
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:47)
at org.mabb.fontverter.opentype.TtfGlyph.parse(TtfGlyph.java:80)
at org.mabb.fontverter.opentype.GlyphTable.readData(GlyphTable.java:74)
at org.mabb.fontverter.opentype.OpenTypeParser.readTableDataEntries(OpenTypeParser.java:75)
at org.mabb.fontverter.opentype.OpenTypeParser.parse(OpenTypeParser.java:47)
at org.mabb.fontverter.opentype.OpenTypeParser.parse(OpenTypeParser.java:35)
at org.mabb.fontverter.converter.PsType0ToOpenTypeConverter.getOtfFromDescendantFont(PsType0ToOpenTypeConverter.java:64)
at org.mabb.fontverter.converter.PsType0ToOpenTypeConverter.convert(PsType0ToOpenTypeConverter.java:43)
at org.mabb.fontverter.pdf.PdfFontExtractor.convertType0FontToOpenType(PdfFontExtractor.java:215)
at org.fit.pdfdom.FontTable$Entry.loadType0TtfDescendantFont(FontTable.java:192)
at org.fit.pdfdom.FontTable$Entry.getData(FontTable.java:145)
at org.fit.pdfdom.FontTable$Entry.isEntryValid(FontTable.java:161)
at org.fit.pdfdom.FontTable.addEntry(FontTable.java:48)
at org.fit.pdfdom.PDFBoxTree.processFontResources(PDFBoxTree.java:385)
at org.fit.pdfdom.PDFBoxTree.updateFontTable(PDFBoxTree.java:361)
at org.fit.pdfdom.PDFDomTree.updateFontTable(PDFDomTree.java:544)
at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:206)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218)
at org.fit.pdfdom.PDFDomTree.writeText(PDFDomTree.java:194)
at org.fit.pdfdom.TestUtils.parseWithPdfDomTree(TestUtils.java:54)
at org.fit.pdfdom.TestUtils.parseWithPdfDomTree(TestUtils.java:31)
at org.fit.pdfdom.TestUtils.parseWithPdfDomTree(TestUtils.java:22)
at org.fit.pdfdom.TestBook.testBook(TestBook.java:44)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: org.mabb.fontverter.io.DataTypeSerializerException: org.mabb.fontverter.opentype.TtfGlyph
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:71)
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:45)
... 46 more
Caused by: org.mabb.fontverter.io.DataTypeSerializerException: int org.mabb.fontverter.opentype.TtfGlyph.instructionLength org.mabb.fontverter.opentype.TtfGlyph
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:65)
... 47 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
at org.mabb.fontverter.io.DataTypeBindingDeserializer.readSingleValue(DataTypeBindingDeserializer.java:105)
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserializeProperty(DataTypeBindingDeserializer.java:92)
at org.mabb.fontverter.io.DataTypeBindingDeserializer.deserialize(DataTypeBindingDeserializer.java:63)
... 47 more
[main] WARN Error loading type 0 with ttf descendant font 'FDGOPI+SimSun' Message: org.mabb.fontverter.io.DataTypeSerializerException: Error serializing property: java.lang.Long[] org.mabb.fontverter.opentype.GlyphLocationTable.longOffsets class java.io.IOException
五月 24, 2018 9:12:15 上午 org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
警告: No Unicode mapping for CID+0 (0) in font FDICLC+Dotum

Resulting HTML does not include PDF form fields

Code:
`
PDFDomTreeConfig settings = PDFDomTreeConfig.createDefaultConfig();
settings.setFontHandler(PDFDomTreeConfig.embedAsBase64());
settings.setImageHandler(PDFDomTreeConfig.embedAsBase64());

PDFDomTree pdfDomTree = new PDFDomTree(settings);
try (PDDocument pdf = PDDocument.load(inputStream)) {
try (PrintWriter output = new PrintWriter(outputPath.toFile(), "utf-8")) {
pdfDomTree.writeText(pdf, output);
}
}
`

Input:
sample-form.pdf

Output: does not include any html fields

Getting java.lang.UnsupportedOperationException at org.apache.pdfbox.pdmodel.graphics.color.PDPattern.toRGB

I am using Pdf2Dom to parse pdf document. In my java application when I tried to convert a PDF file to html. I am getting,

java.lang.UnsupportedOperationException
at org.apache.pdfbox.pdmodel.graphics.color.PDPattern.toRGB(PDPattern.java:95)
at org.fit.pdfdom.PathDrawer.pdfColorToColor(PathDrawer.java:133)
at org.fit.pdfdom.PathDrawer.clearPathGraphics(PathDrawer.java:79)
at org.fit.pdfdom.PathDrawer.drawPath(PathDrawer.java:59)
at org.fit.pdfdom.PDFDomTree.createPathImage(PDFDomTree.java:403)
at org.fit.pdfdom.PDFDomTree.renderPath(PDFDomTree.java:251)
at org.fit.pdfdom.PDFBoxTree.processOperator(PDFBoxTree.java:499)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
at org.apache.pdfbox.contentstream.PDFStreamEngine.showForm(PDFStreamEngine.java:181)
at org.apache.pdfbox.contentstream.operator.DrawObject.process(DrawObject.java:65)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
at org.fit.pdfdom.PDFBoxTree.processOperator(PDFBoxTree.java:542)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.fit.pdfdom.PDFBoxTree.processPage(PDFBoxTree.java:208)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.fit.pdfdom.PDFDomTree.createDOM(PDFDomTree.java:218)
at com.demo.pdf.converter.PdfProcessor.convertToHtml(PdfProcessor.java:87)

error loading font

getting warning like
"WARN org.fit.pdfdom.FontTable - Error loading font 'FTBUET+Verdana-Bold' Message: FontVerter could not detect the input font's type. class java.io.IOException"
is there any solution for this.
Thank you

Error with FontWeight in updateStyle function

In updateStyle function, the "font.toLowerCase().lastIndexOf(pdFontType[i]) >= 0" will be error with "TimesNewRoman,Bold" font. The pdFontType will be detected as "roman" -> fontWeight is normal.

embedfonts branch remaining work?

(More of a discussion item than an actual issue)

So what remaining work is there for the embedfonts branch? From what I see it just needs adding conversion for the various non browser supported pdf font types?

I added support for bare CFF (PdfBox calls it Type1C I believe) in my embedded-fonts-2 branch and maybe going to start work on others and if I recall correctly bare CFF is the most common but maybe wishful thinking.

document:null error

Hi,
I'm trying to run DOM creation, as in example on sourceforge project page, but in result my Document object doesn't have any DOM.

        PDDocument pdDocument = PDDocument.load(new FileInputStream(new File(ClassLoader.getSystemClassLoader().getResource("tst.pdf").getFile())));
        PDFDomTree tree = new PDFDomTree(PDFDomTreeConfig.createDefaultConfig());
        Document d = tree.createDOM(pdDocument);
        System.out.println(d);

The output is

[#document: null]

What i'm doing wrong?

Thanks.

Vulnerable Dependency Guava 15

Hello,

first of all..good Job! Thank you for that awesome Lib!

I would like to use it but you have a vulnerable dependency (guava 15). Is is possible to update the dependencies at some point?

best regards

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.