Giter VIP home page Giter VIP logo

Comments (16)

dodeeric avatar dodeeric commented on June 16, 2024 2

I have written a Bash script (pdf2epubEX.sh) to convert a PDF file (myfile.pdf) to a fixed layout ePub file (myfile.epub). The layout is perfectly retained and all the fonts are embedded. The script is also available in a Docker image making it immediately ready to be used.

This Bash script uses the pdf2htmlEX tool from the present project (last Debian/Ubuntu package: pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-focal-x86_64.deb).

See here: https://github.com/dodeeric/pdf2epubEX.

Below you can find the two fixed layout ePub files made from the two PDF files you gave here above:

Conversion parameters: JPG, 150 DPI.
Metadata: all the info bellow are included in the metadata of the ePub file.

Mystical Poems of Rumi, Jalal al-Din Rumi.; Arberry, A. J.; Javadi, Hasan.; Lewis, Franklin; Yarshater, Ehsan, The University of Chicago Press, 2008, 440 pages, ISBN: 9780226731629, literature.

PDF (1.3 MB) | ePub (2.9 MB)

Java 8 in Action: Lambdas, Streams, and functional-style programming, Raoul-Gabriel Urma, Mario Fusco, and Alan Mycroft, Manning Publications, 2015, ISBN: 9781617291999, 497 pages, computer science.

PDF (9.6 MB) | ePub (9.8 MB)

from pdf2htmlex.

dodeeric avatar dodeeric commented on June 16, 2024 2

Sorry, I did not understand you wanted a reflowable text ePub. In fact, pdf2htmlEX is THE tool to maintain the original layout. Hence, I do not think it is the best tool to extract the text and the images. Regarding the images, pdf2htmlEX makes one background image per page which can include more than one image from the PDF. Regarding the text, even after extracting the text from a PDF, this text will have to be somewhat edited manually or automatically, for example to remove the hyphenation, remove the CR/LF at the end of each lines, remove the page numbers, move the footnotes. This is even more difficult for PDF with sophisticated layout because you will have to move some paragraphs in the correct reading order.

Converting a PDF to a fixed layout HTML/ePub is not an easy job (pdf2htmlEX is really making an exceptional job), and converting a PDF to a reflowable text HTML/epub is not easy as well: here is an article which explain the difficulties of that task: 4 Reasons why Converting PDF to Responsive EPUB is Impossible (responsive ePub = reflowable text ePub).

If you are using Linux, you can install the poppler-utils package. Then you can extract the text and the images with the two following tools:

And as you can see the text needs heavy manual editing before using a tool to convert it to a reflowable text ePub (Sigil, Calibre, Kotobee, etc.)

from pdf2htmlex.

mrifni avatar mrifni commented on June 16, 2024 1

I have seen one of your comment in the old repo as well. Good work.
Is there any possibilities to have a reflow layout instead of a fixed layout ?

from pdf2htmlex.

Rockstar04 avatar Rockstar04 commented on June 16, 2024 1

@mrifni Due to the way that pdf2htmlEX generates its output, I'm not sure that would be possible.

from pdf2htmlex.

rrtucci avatar rrtucci commented on June 16, 2024 1

Thanks dodeeric. Excellent advice. I had never heard of Kindle replicas.

I downloaded Kindle Create, created a replica of my pdf, submitted it to Kindle Direct Publishing, all went without a snag. My book should appear in the Amazon website in 2-3 days.

from pdf2htmlex.

dodeeric avatar dodeeric commented on June 16, 2024 1

Good news for you: your book is already available online on Amazon (Kindle version). I checked the sample, and as expected, the display is perfect.

from pdf2htmlex.

Rockstar04 avatar Rockstar04 commented on June 16, 2024

Just to confirm, you are trying to convert a PDF to an EPub V3 ?

from pdf2htmlex.

mrifni avatar mrifni commented on June 16, 2024

Yes, PDF to html and then html to epub3.

There are tools which converts from PDF to epub3 but it isn't perfect

from pdf2htmlex.

stephengaito avatar stephengaito commented on June 16, 2024

@mrifni,

Since I don't use epub v3 at all... It would be helpful if you provided a list of the tools (and their versions) that you are using, as well as the command lines (including all arguments) you are using to drive these conversions...

There are a lot of potential command line arguments you could be using for pdf2htmlEX... so it would be useful to see what you are using.

Equally what tool(s) are you using to convert from html->epub? And what command line arguments are you getting.... And what error messages are you getting?

Regards,
Stephen Gaito

from pdf2htmlex.

mrifni avatar mrifni commented on June 16, 2024

@stephengaito

I used the CLI
pdf2htmlEX --zoom 1.3 Mystical\ Poems\ of\ Rumi\ -\ words\ cascade.pdf
to convert a sample PDF to html
and then pandoc
pandoc -f html -t epub3 -o output.epub Mystical\ Poems\ of\ Rumi\ -\ words\ cascade.html
to convert to epub
I also tried using calibre but both pandoc and calibre output of epub file based on pdf2htmlEX is not working as I wanted.
Look at the sample file I have uploaded here and try the commands yourself ( I cant upload the .html and .epub file here ).

  1. Is there any commands in pdf2htmlEX that will be useful for me to get the html and images and use a another API to build epub 3 based on that?

Mystical Poems of Rumi - words cascade.pdf

from pdf2htmlex.

rrtucci avatar rrtucci commented on June 16, 2024

Hi to all,
I just happen to be interested in publishing my 433 page book as an epub. The book was written in Latex, and the pdf was generated via pdflatex on an ubuntu 18.04 os with anaconda installed. Here is the pdf:

https://github.com/rrtucci/Bayesuvius/raw/master/main.pdf

I installed pdf2htmlEX and was able to generate a PERFECT html version:

https://www.ar-tiste.com/bayesuvius.html

I was able to install dodeeric's docker pdf2epubEx successfully by following his instructions
However, the resulting epub is very very misaligned.. Can you fix it?

from pdf2htmlex.

rrtucci avatar rrtucci commented on June 16, 2024

https://qbnets.wordpress.com/2021/08/18/self-publishing-a-scientific-or-technical-book-on-amazon-and-the-2-decade-old-quest-to-convert-latex-to-html-and-epub/

from pdf2htmlex.

dodeeric avatar dodeeric commented on June 16, 2024

Robert,

The problem is not pdf2htmlEX nor pdf2epubEX, but the ePub reader(s)... The ePub files with fixed layout are often not properly supported.

Bellow screenshots of the ePub version of your book (license: cc-by-nc-nd 3.0) in Google Play Books (Android). As you can see, the display is correct in that reader:

Screenshot 1:
Screenshot 1

Screenshot 2:
Screenshot 2

Screenshot 3:
Screenshot 3

Screenshot 4:
Screenshot 4

Screenshot 5:
Screenshot 5

Screenshot 6:
Screenshot 6

from pdf2htmlex.

rrtucci avatar rrtucci commented on June 16, 2024

from pdf2htmlex.

rrtucci avatar rrtucci commented on June 16, 2024

I tried half a dozen free ebook readers on a Samsung tablet. I found they all mis-handled fixed format epub except one called "PocketBook" http://reader.pocketbook.digital/eng

The "Google Play Books" is not misaligned but it takes like a minute to load each page, whereas PocketBook loads each new page in less than a second

from pdf2htmlex.

dodeeric avatar dodeeric commented on June 16, 2024

On my device, most of the pages display instantaneously or in less than 5 seconds.

For your information, Amazon has a format called "print replica" which can be read by the Amazon Kindle ebook reader. It's a format which mainly "wraps" the PDF file into another file containing metadata's. Only Amazon gives the possibility to publish an ebook in the PDF format... but without saying it.

How to prepare a print replica file from a PDF file with the Kindle Create tool

Regarding the "side panel" available in the "browser reader": that side panel is supposed to become the table of content of the ePub book. The pdf2epubEX tool does not (yet) support that.

from pdf2htmlex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.