Ive converted PDF to html using this library which works perfectly, but the html it co

I have written a Bash (pdf2epubEX.sh) to convert a PDF file (myfile.pdf) to a f

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

support for epub3 html about pdf2htmlex HOT 16 OPEN

pdf2htmlex commented on June 16, 2024 1

support for epub3 html

from pdf2htmlex.

Comments (16)

dodeeric commented on June 16, 2024 2

I have written a Bash script (pdf2epubEX.sh) to convert a PDF file (myfile.pdf) to a fixed layout ePub file (myfile.epub). The layout is perfectly retained and all the fonts are embedded. The script is also available in a Docker image making it immediately ready to be used.

This Bash script uses the pdf2htmlEX tool from the present project (last Debian/Ubuntu package: pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-focal-x86_64.deb).

See here: https://github.com/dodeeric/pdf2epubEX.

Below you can find the two fixed layout ePub files made from the two PDF files you gave here above:

Conversion parameters: JPG, 150 DPI.
Metadata: all the info bellow are included in the metadata of the ePub file.

Mystical Poems of Rumi, Jalal al-Din Rumi.; Arberry, A. J.; Javadi, Hasan.; Lewis, Franklin; Yarshater, Ehsan, The University of Chicago Press, 2008, 440 pages, ISBN: 9780226731629, literature.

PDF (1.3 MB) | ePub (2.9 MB)

Java 8 in Action: Lambdas, Streams, and functional-style programming, Raoul-Gabriel Urma, Mario Fusco, and Alan Mycroft, Manning Publications, 2015, ISBN: 9781617291999, 497 pages, computer science.

PDF (9.6 MB) | ePub (9.8 MB)

from pdf2htmlex.

dodeeric commented on June 16, 2024 2

Sorry, I did not understand you wanted a reflowable text ePub. In fact, pdf2htmlEX is THE tool to maintain the original layout. Hence, I do not think it is the best tool to extract the text and the images. Regarding the images, pdf2htmlEX makes one background image per page which can include more than one image from the PDF. Regarding the text, even after extracting the text from a PDF, this text will have to be somewhat edited manually or automatically, for example to remove the hyphenation, remove the CR/LF at the end of each lines, remove the page numbers, move the footnotes. This is even more difficult for PDF with sophisticated layout because you will have to move some paragraphs in the correct reading order.

Converting a PDF to a fixed layout HTML/ePub is not an easy job (pdf2htmlEX is really making an exceptional job), and converting a PDF to a reflowable text HTML/epub is not easy as well: here is an article which explain the difficulties of that task: 4 Reasons why Converting PDF to Responsive EPUB is Impossible (responsive ePub = reflowable text ePub).

If you are using Linux, you can install the poppler-utils package. Then you can extract the text and the images with the two following tools:

Extract the text: pdftotext Mystical-Poems-of-Rumi.pdf ==> Result: Mystical-Poems-of-Rumi.txt
Extract the images: pdfimages -all Mystical-Poems-of-Rumi.pdf ./Mystical-Poems-of-Rumi ==> Result: Mystical-Poems-of-Rumi-000.jpg, Mystical-Poems-of-Rumi-001.jpg, Mystical-Poems-of-Rumi-002.png, Mystical-Poems-of-Rumi-003.jpg.

And as you can see the text needs heavy manual editing before using a tool to convert it to a reflowable text ePub (Sigil, Calibre, Kotobee, etc.)

from pdf2htmlex.

mrifni commented on June 16, 2024 1

I have seen one of your comment in the old repo as well. Good work.
Is there any possibilities to have a reflow layout instead of a fixed layout ?

from pdf2htmlex.

Rockstar04 commented on June 16, 2024 1

@mrifni Due to the way that pdf2htmlEX generates its output, I'm not sure that would be possible.

from pdf2htmlex.

rrtucci commented on June 16, 2024 1

Thanks dodeeric. Excellent advice. I had never heard of Kindle replicas.

I downloaded Kindle Create, created a replica of my pdf, submitted it to Kindle Direct Publishing, all went without a snag. My book should appear in the Amazon website in 2-3 days.

from pdf2htmlex.

dodeeric commented on June 16, 2024 1

Good news for you: your book is already available online on Amazon (Kindle version). I checked the sample, and as expected, the display is perfect.

from pdf2htmlex.

Rockstar04 commented on June 16, 2024

Just to confirm, you are trying to convert a PDF to an EPub V3 ?

from pdf2htmlex.

mrifni commented on June 16, 2024

Yes, PDF to html and then html to epub3.

There are tools which converts from PDF to epub3 but it isn't perfect

from pdf2htmlex.

stephengaito commented on June 16, 2024

@mrifni,

Since I don't use epub v3 at all... It would be helpful if you provided a list of the tools (and their versions) that you are using, as well as the command lines (including all arguments) you are using to drive these conversions...

There are a lot of potential command line arguments you could be using for pdf2htmlEX... so it would be useful to see what you are using.

Equally what tool(s) are you using to convert from html->epub? And what command line arguments are you getting.... And what error messages are you getting?

Regards,
Stephen Gaito

from pdf2htmlex.

mrifni commented on June 16, 2024

@stephengaito

I used the CLI
pdf2htmlEX --zoom 1.3 Mystical\ Poems\ of\ Rumi\ -\ words\ cascade.pdf
to convert a sample PDF to html
and then pandoc
pandoc -f html -t epub3 -o output.epub Mystical\ Poems\ of\ Rumi\ -\ words\ cascade.html
to convert to epub
I also tried using calibre but both pandoc and calibre output of epub file based on pdf2htmlEX is not working as I wanted.
Look at the sample file I have uploaded here and try the commands yourself ( I cant upload the .html and .epub file here ).

Is there any commands in pdf2htmlEX that will be useful for me to get the html and images and use a another API to build epub 3 based on that?

Mystical Poems of Rumi - words cascade.pdf

from pdf2htmlex.

rrtucci commented on June 16, 2024

Hi to all,
I just happen to be interested in publishing my 433 page book as an epub. The book was written in Latex, and the pdf was generated via pdflatex on an ubuntu 18.04 os with anaconda installed. Here is the pdf:

https://github.com/rrtucci/Bayesuvius/raw/master/main.pdf

I installed pdf2htmlEX and was able to generate a PERFECT html version:

https://www.ar-tiste.com/bayesuvius.html

I was able to install dodeeric's docker pdf2epubEx successfully by following his instructions
However, the resulting epub is very very misaligned.. Can you fix it?

from pdf2htmlex.

rrtucci commented on June 16, 2024

https://qbnets.wordpress.com/2021/08/18/self-publishing-a-scientific-or-technical-book-on-amazon-and-the-2-decade-old-quest-to-convert-latex-to-html-and-epub/

from pdf2htmlex.

dodeeric commented on June 16, 2024

Robert,

The problem is not pdf2htmlEX nor pdf2epubEX, but the ePub reader(s)... The ePub files with fixed layout are often not properly supported.

Bellow screenshots of the ePub version of your book (license: cc-by-nc-nd 3.0) in Google Play Books (Android). As you can see, the display is correct in that reader:

Screenshot 1:

Screenshot 2:

Screenshot 3:

Screenshot 4:

Screenshot 5:

Screenshot 6:

from pdf2htmlex.

rrtucci commented on June 16, 2024

Thanks! That explains it. I was using Calibre on my Ubuntu pc to look at it One last comment. I noticed that if I do $pdf2htmlEX main.pdf --process-type3 1 I get a beautiful side panel with the Table of Contents, but if I don't do --process-type3 1, I don't get a side panel. I also noticed that the current epub doesn't have a side panel. Is there a similar way to get a side panel with the epub?

…

On Thu, Aug 19, 2021 at 6:22 AM Eric Dodémont ***@***.***> wrote: Robert, The problem is not *pdf2htmlEX* nor *pdf2epubEX*, but the ePub reader(s)... The ePub files *with fixed layout* are often not properly supported. Bellow screenshots of the ePub version of your book <http://files.dodeeric.be/Documents/rrtucci-150dpi-jpg.epub> in *Google Play Books* (Android). As you can see, the display is correct in that reader: Screenshot 1: [image: Screenshot 1] <https://camo.githubusercontent.com/a13eb634bef912c0cb1e39428238152e07793b3b4a3fe120ab173ec17c13a528/687474703a2f2f66696c65732e646f6465657269632e62652f496d616765732f53637265656e73686f745f32303231303831392d3131333630325f476f6f676c655f506c61795f426f6f6b732e6a7067> Screenshot 2: [image: Screenshot 2] <https://camo.githubusercontent.com/2acbd4dd71f3b22829b25821564b1becf93cc9653ef0bcc558dc100730819c82/687474703a2f2f66696c65732e646f6465657269632e62652f496d616765732f53637265656e73686f745f32303231303831392d3131333631385f476f6f676c655f506c61795f426f6f6b732e6a7067> Screenshot 3: [image: Screenshot 3] <https://camo.githubusercontent.com/cc66f12962d46b3dd2d538b023da4b545d08a7796c14312af41198f27372fc4e/687474703a2f2f66696c65732e646f6465657269632e62652f496d616765732f53637265656e73686f745f32303231303831392d3131333632375f476f6f676c655f506c61795f426f6f6b732e6a7067> Screenshot 4: [image: Screenshot 4] <https://camo.githubusercontent.com/4493685c9287aa6e43302689958097c2b51ff8af6fdb550a3a880dab6c6f2311/687474703a2f2f66696c65732e646f6465657269632e62652f496d616765732f53637265656e73686f745f32303231303831392d3131333634375f476f6f676c655f506c61795f426f6f6b732e6a7067> Screenshot 5: [image: Screenshot 5] <https://camo.githubusercontent.com/fec638d55507bf9c69a2e5ec7cfc31d731f3a3aa062f5dc8498c4b5423ecc6a8/687474703a2f2f66696c65732e646f6465657269632e62652f496d616765732f53637265656e73686f745f32303231303831392d3131333830305f476f6f676c655f506c61795f426f6f6b732e6a7067> Screenshot 6: [image: Screenshot 6] <https://camo.githubusercontent.com/b3d0f7284c42ce4fa9463ea9245b5874936163204f442b92f4bca052c83dfe52/687474703a2f2f66696c65732e646f6465657269632e62652f496d616765732f53637265656e73686f745f32303231303831392d3131333831365f476f6f676c655f506c61795f426f6f6b732e6a7067> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#74 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZWTYKXWHA4TLCQNEP3USTT5TLN5ANCNFSM4OW7G3JA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

from pdf2htmlex.

rrtucci commented on June 16, 2024

I tried half a dozen free ebook readers on a Samsung tablet. I found they all mis-handled fixed format epub except one called "PocketBook" http://reader.pocketbook.digital/eng

The "Google Play Books" is not misaligned but it takes like a minute to load each page, whereas PocketBook loads each new page in less than a second

from pdf2htmlex.

dodeeric commented on June 16, 2024

On my device, most of the pages display instantaneously or in less than 5 seconds.

For your information, Amazon has a format called "print replica" which can be read by the Amazon Kindle ebook reader. It's a format which mainly "wraps" the PDF file into another file containing metadata's. Only Amazon gives the possibility to publish an ebook in the PDF format... but without saying it.

How to prepare a print replica file from a PDF file with the Kindle Create tool

Regarding the "side panel" available in the "browser reader": that side panel is supposed to become the table of content of the ePub book. The pdf2epubEX tool does not (yet) support that.

from pdf2htmlex.

support for epub3 html about pdf2htmlex HOT 16 OPEN

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent