Comments (16)
I have written a Bash script (pdf2epubEX.sh) to convert a PDF file (myfile.pdf) to a fixed layout ePub file (myfile.epub). The layout is perfectly retained and all the fonts are embedded. The script is also available in a Docker image making it immediately ready to be used.
This Bash script uses the pdf2htmlEX tool from the present project (last Debian/Ubuntu package: pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-focal-x86_64.deb).
See here: https://github.com/dodeeric/pdf2epubEX.
Below you can find the two fixed layout ePub files made from the two PDF files you gave here above:
Conversion parameters: JPG, 150 DPI.
Metadata: all the info bellow are included in the metadata of the ePub file.
Mystical Poems of Rumi, Jalal al-Din Rumi.; Arberry, A. J.; Javadi, Hasan.; Lewis, Franklin; Yarshater, Ehsan, The University of Chicago Press, 2008, 440 pages, ISBN: 9780226731629, literature.
Java 8 in Action: Lambdas, Streams, and functional-style programming, Raoul-Gabriel Urma, Mario Fusco, and Alan Mycroft, Manning Publications, 2015, ISBN: 9781617291999, 497 pages, computer science.
from pdf2htmlex.
Sorry, I did not understand you wanted a reflowable text ePub. In fact, pdf2htmlEX is THE tool to maintain the original layout. Hence, I do not think it is the best tool to extract the text and the images. Regarding the images, pdf2htmlEX makes one background image per page which can include more than one image from the PDF. Regarding the text, even after extracting the text from a PDF, this text will have to be somewhat edited manually or automatically, for example to remove the hyphenation, remove the CR/LF at the end of each lines, remove the page numbers, move the footnotes. This is even more difficult for PDF with sophisticated layout because you will have to move some paragraphs in the correct reading order.
Converting a PDF to a fixed layout HTML/ePub is not an easy job (pdf2htmlEX is really making an exceptional job), and converting a PDF to a reflowable text HTML/epub is not easy as well: here is an article which explain the difficulties of that task: 4 Reasons why Converting PDF to Responsive EPUB is Impossible (responsive ePub = reflowable text ePub).
If you are using Linux, you can install the poppler-utils package. Then you can extract the text and the images with the two following tools:
- Extract the text:
pdftotext Mystical-Poems-of-Rumi.pdf
==> Result: Mystical-Poems-of-Rumi.txt - Extract the images:
pdfimages -all Mystical-Poems-of-Rumi.pdf ./Mystical-Poems-of-Rumi
==> Result: Mystical-Poems-of-Rumi-000.jpg, Mystical-Poems-of-Rumi-001.jpg, Mystical-Poems-of-Rumi-002.png, Mystical-Poems-of-Rumi-003.jpg.
And as you can see the text needs heavy manual editing before using a tool to convert it to a reflowable text ePub (Sigil, Calibre, Kotobee, etc.)
from pdf2htmlex.
I have seen one of your comment in the old repo as well. Good work.
Is there any possibilities to have a reflow layout instead of a fixed layout ?
from pdf2htmlex.
@mrifni Due to the way that pdf2htmlEX generates its output, I'm not sure that would be possible.
from pdf2htmlex.
Thanks dodeeric. Excellent advice. I had never heard of Kindle replicas.
I downloaded Kindle Create, created a replica of my pdf, submitted it to Kindle Direct Publishing, all went without a snag. My book should appear in the Amazon website in 2-3 days.
from pdf2htmlex.
Good news for you: your book is already available online on Amazon (Kindle version). I checked the sample, and as expected, the display is perfect.
from pdf2htmlex.
Just to confirm, you are trying to convert a PDF to an EPub V3 ?
from pdf2htmlex.
Yes, PDF to html and then html to epub3.
There are tools which converts from PDF to epub3 but it isn't perfect
from pdf2htmlex.
Since I don't use epub v3 at all... It would be helpful if you provided a list of the tools (and their versions) that you are using, as well as the command lines (including all arguments) you are using to drive these conversions...
There are a lot of potential command line arguments you could be using for pdf2htmlEX... so it would be useful to see what you are using.
Equally what tool(s) are you using to convert from html->epub? And what command line arguments are you getting.... And what error messages are you getting?
Regards,
Stephen Gaito
from pdf2htmlex.
I used the CLI
pdf2htmlEX --zoom 1.3 Mystical\ Poems\ of\ Rumi\ -\ words\ cascade.pdf
to convert a sample PDF to html
and then pandoc
pandoc -f html -t epub3 -o output.epub Mystical\ Poems\ of\ Rumi\ -\ words\ cascade.html
to convert to epub
I also tried using calibre but both pandoc and calibre output of epub file based on pdf2htmlEX is not working as I wanted.
Look at the sample file I have uploaded here and try the commands yourself ( I cant upload the .html and .epub file here ).
- Is there any commands in pdf2htmlEX that will be useful for me to get the html and images and use a another API to build epub 3 based on that?
Mystical Poems of Rumi - words cascade.pdf
from pdf2htmlex.
Hi to all,
I just happen to be interested in publishing my 433 page book as an epub. The book was written in Latex, and the pdf was generated via pdflatex on an ubuntu 18.04 os with anaconda installed. Here is the pdf:
https://github.com/rrtucci/Bayesuvius/raw/master/main.pdf
I installed pdf2htmlEX and was able to generate a PERFECT html version:
https://www.ar-tiste.com/bayesuvius.html
I was able to install dodeeric's docker pdf2epubEx successfully by following his instructions
However, the resulting epub is very very misaligned.. Can you fix it?
from pdf2htmlex.
from pdf2htmlex.
Robert,
The problem is not pdf2htmlEX nor pdf2epubEX, but the ePub reader(s)... The ePub files with fixed layout are often not properly supported.
Bellow screenshots of the ePub version of your book (license: cc-by-nc-nd 3.0) in Google Play Books (Android). As you can see, the display is correct in that reader:
from pdf2htmlex.
from pdf2htmlex.
I tried half a dozen free ebook readers on a Samsung tablet. I found they all mis-handled fixed format epub except one called "PocketBook" http://reader.pocketbook.digital/eng
The "Google Play Books" is not misaligned but it takes like a minute to load each page, whereas PocketBook loads each new page in less than a second
from pdf2htmlex.
On my device, most of the pages display instantaneously or in less than 5 seconds.
For your information, Amazon has a format called "print replica" which can be read by the Amazon Kindle ebook reader. It's a format which mainly "wraps" the PDF file into another file containing metadata's. Only Amazon gives the possibility to publish an ebook in the PDF format... but without saying it.
How to prepare a print replica file from a PDF file with the Kindle Create tool
Regarding the "side panel" available in the "browser reader": that side panel is supposed to become the table of content of the ePub book. The pdf2epubEX tool does not (yet) support that.
from pdf2htmlex.
Related Issues (20)
- CMake Deprecation Warning: Compatibility with CMake < 2.8.12 will be removed from a future version of CMake
- Feature Request: Open external links in New tab
- Feature Request: Save page state in URL
- Create a new latest docker image on docker hub HOT 4
- Maintaining the visible form of text when using cut-paste
- Heap-Buffer-Overflow in embed_font Function
- Doubt: Blocks order
- how to install it and can you tell how we can convert pdf to html HOT 3
- how to restore table structure HOT 1
- how to install on macos HOT 1
- Bug: Gen inside xref table too large (bigger than INT_MAX)
- libjpeg-turbo8 is not present on recent Debian versions HOT 1
- Rotated annotations
- Request: Support actionLaunch/actionGoToR links
- Why are the matrix styles needed?
- Why is some of the text not extracted and is basked into the generated images?
- TOC and many internal crossref links?
- Issue in selecting text HOT 1
- Converting error HOT 2
- convert all PDF content into one web page
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdf2htmlex.