Giter VIP home page Giter VIP logo

pdf2html's People

Contributors

aashari avatar abowcut avatar bluelovers avatar comvidnet avatar dependabot[bot] avatar jonmadison-amzn avatar shebinleo avatar svtd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pdf2html's Issues

Downloading dependency too slow

Downloading dependency "pdfbox-app-2.0.16.jar"and "tika-app-1.22" is too slow.
In spite of I try taking these .jar file into "node_modules\pdf2html\vendor ", but it always automatic download.
I don't know how to solve this problem~

Error: Failed downloading dependency tika-app-2.6.0.jar

Hi,
I'm getting this error :
.../[email protected]/node_modules/pdf2html postinstall: throw new Error(Failed downloading dependency ${filename}.); .../[email protected]/node_modules/pdf2html postinstall: ^ .../[email protected]/node_modules/pdf2html postinstall: Error: Failed downloading dependency tika-app-2.6.0.jar. .../[email protected]/node_modules/pdf2html postinstall: at ClientRequest.<anonymous> (/builds/infrastructure/applications_slack/patchs_data/millenium/node_modules/.pnpm/[email protected]/node_modules/pdf2html/postinstall.js:27:23) .../[email protected]/node_modules/pdf2html postinstall: at ClientRequest.emit (node:events:518:28) .../[email protected]/node_modules/pdf2html postinstall: at TLSSocket.socketErrorListener (node:_http_client:495:9) .../[email protected]/node_modules/pdf2html postinstall: at TLSSocket.emit (node:events:518:28) .../[email protected]/node_modules/pdf2html postinstall: at emitErrorNT (node:internal/streams/destroy:169:8) .../[email protected]/node_modules/pdf2html postinstall: at emitErrorCloseNT (node:internal/streams/destroy:128:3) .../[email protected]/node_modules/pdf2html postinstall: at process.processTicksAndRejections (node:internal/process/task_queues:82:21) .../[email protected]/node_modules/pdf2html postinstall: Node.js v20.11.0 .../[email protected]/node_modules/pdf2html postinstall: Failed
Anyone having the same ?
Thank you

this library doesn't work on frontend next.js

This library uses the fs library, which doesn't exist on next.js frontends. Why would this library need the file system in the first place? Getting/setting from the file system might not work in serverless environments.

Might need to refactor to not use the fs library or any other library not available to the client or in a serverless environment.

Cannot convert pdf to html

I've started up a nodejs project and installed pdf2html?. However, whenever I try to run pdf2html.html()` i get the following error message:

java -jar /Users/<path to project>/node_modules/pdf2html/vendor/tika-app-1.22.jar --html ./mock/example.pdf
Oct 08, 2020 10:27:51 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Oct 08, 2020 10:27:51 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
Exception in thread "main" java.net.MalformedURLException: no protocol: ./mock/example.pdf
        at java.base/java.net.URL.<init>(URL.java:668)
        at java.base/java.net.URL.<init>(URL.java:564)
        at java.base/java.net.URL.<init>(URL.java:511)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:488)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)

How do I solve this?

Postinstall script is not working in a HTTP proxy environment

Hi,

When you have to use a HTTP proxy to make web requests, the postinstall script will fail. This is because is uses http.get which does not handle HTTP_PROXY environment variable automatically.

One option to fix this would be to use http-proxy-agent, cf. https://github.com/ambanum/CGUs/blob/master/src/fetcher/index.js#L9-L14.

Another option, without modifying the code, would be to provide infos in the README file about this particular setup and getting pdf2html to work. This basically boils down to:

cd node_modules/pdf2html/vendor
# These URLs come from https://github.com/shebinleo/pdf2html/blob/master/postinstall.js#L6-L7
wget http://archive.apache.org/dist/pdfbox/2.0.16/pdfbox-app-2.0.16.jar
wget http://archive.apache.org/dist/tika/tika-app-1.22.jar

Thanks

Can't convert PDF to HTML in other languages than English

When I tried to convert a PDF that contained MALAYALAM language the output was not as expected.

Input PDF Content

എം എൽ എ യുടെ അധ്യക്ഷതയിൽ യോഗം

Output

���ി�ിൽ എം എൽ എ �െട ആധ��തയിൽ േയാഗം

Tika app JAR always breaking the packages

We have been seeing a lot of issues related to Tika app JAR (404), can we somehow commit the jar into the repository instead of download it directly from apache.org?

Any thoughts?

Allow to specify binary download URL

Hello~Appreciate your great work on this.

I'm using this package, but the postinstall script always fails for me, due to the slow network and firewall limits.

And the https module this package is using doesn't support the HTTP_PROXY environment variable. This is a headache for me...

To resolve the above issue, we might have two solutions:

  1. Download requests honor the HTTP_PROXY, HTTPS_PROXY environment variables. We can use another request library that supports so.

  2. Allow users to specify download URLs. So that users can specify a binary download URL that is served by a mirror that is located close to clients.

For this solution, we can borrow how node-sass support this for users from worldwide:

https://github.com/sass/node-sass/blob/24741b351cb046c4548e77886647cd4c89b48c66/lib/extensions.js#L192-L199

Read file from binary (buffer) is missing.

I would like to use buffer of a file except url.

eg.

const bufferPdf = Buffer.from(pdfToSave.buffer);

      await pdf2html
        .text(pdfBuffer, (err, html) => {
          if (err) {
            console.error("Conversion error: " + err);
          } else {
            console.log(html);
          }
        })
        .promise();

Error: Failed downloading dependency

while attempting to npm install pdf2html, i'm getting:

Error: Failed downloading dependency tika-app-2.1.0.jar.

and/or

Error: Failed downloading dependency pdfbox-app-2.0.24.jar.

Depending on the project. I'm grasping at straws here and wondering if it has to do with the recent log4j issues, but figured i'd at least leave the report before diving deeper.

Update: indeed pdfbox-app-2.0.24.jar is no longer available on the site, but 2.0.25.jar is

broken in next.js apis

This is a fresh project with nothing in it other than Next.js and this library. I tried out the example code in the README and got an extremely cryptic error.

Code:

import { NextApiRequest, NextApiResponse } from "next";
import pdf2html from "pdf2html";

export default async function pdfToHtml (req: NextApiRequest, res: NextApiResponse) {
    
  const html = await pdf2html.html("/pdfs/smallPdfExample.pdf")
  console.log("html: ", html)

  res.status(200).json({ name: 'John Doe' })
}

Error:
image

exec is not a function

when i use through follow in project created by create-react-app,throw error:

const pdf2html = require('pdf2html')
 
pdf2html.html('sample.pdf', (err, html) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(html)
    }
})

How to keep the PDF text formatting?

Hello, I do I make sure that the text formatting, such as bold, italic, bigger font size, such formatting are preserved when I am converting the file to html.

Which version of the java SDK

error: Command failed: java -jar D:\vscodeProject\node\myserve\node_modules\pdf2html\vendor\tika-app-2.4.0.jar --html ../docs/9.5-9.17.pdf
Error: Registry key 'Software\JavaSoft\Java Runtime Environment'\CurrentVersion'
has value '1.8', but '1.7' is required.
Error: could not find java.dll
Error: Could not find Java SE Runtime Environment.

Use pagesHtml but input is loss

code is like;

const options = { text: true }

pdf2html.pages('m.pdf',options,(err, htmlPages) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(htmlPages)
    }
})

pdf is like

Name: __ZZH_____

but find text is Name: __________(Ps; it is a input box)
Can anyone Help the case

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.