bfabiszewski / libmobi Goto Github PK

View Code? Open in Web Editor NEW

415.0 21.0 70.0 47.19 MB

C library for handling Kindle (MOBI) formats of ebook documents

License: GNU Lesser General Public License v3.0

Shell 0.54% C 96.15% Makefile 0.46% M4 1.31% Roff 0.79% CMake 0.74%

kindle ebook c library

libmobi's Introduction

Libmobi

C library for handling Mobipocket/Kindle (MOBI) ebook format documents.

Library comes with several command line tools for working with mobi ebooks. The tools source may also be used as an example on how to use the library.

Features:

reading and parsing:
- some older text Palmdoc formats (pdb),
- Mobipocket files (prc, mobi),
- newer MOBI files including KF8 format (azw, azw3),
- Replica Print files (azw4)
recreating source files using indices
reconstructing references (links and embedded) in html files
reconstructing source structure that can be fed back to kindlegen
reconstructing dictionary markup (orth, infl tags)
writing back loaded documents
metadata editing
handling encrypted documents
encrypting documents for use on eInk Kindles

Todo:

improve writing
serialize rawml into raw records
process RESC records

Doxygen documentation:

Source:

on github

Packages:

Installation:

[for git] $ ./autogen.sh
$ ./configure
$ make
[optionally] $ make test
$ sudo make install

On macOS, you can install via Homebrew with brew install libmobi.

Alternative build systems

The supported way of building project is by using autotools.
Optionally project provides basic support for CMake, Xcode and MSVC++ systems. However these alternative configurations are not covering all options of autotools project. They are also not tested and not updated regularly.

Usage

single include file: #include <mobi.h>
linker flag: -lmobi
basic usage:

#include <mobi.h>

/* Initialize main MOBIData structure */
/* Must be deallocated with mobi_free() when not needed */
MOBIData *m = mobi_init();
if (m == NULL) { 
  return ERROR; 
}

/* Open file for reading */
FILE *file = fopen(fullpath, "rb");
if (file == NULL) {
  mobi_free(m);
  return ERROR;
}

/* Load file into MOBIData structure */
/* This structure will hold raw data/metadata from mobi document */
MOBI_RET mobi_ret = mobi_load_file(m, file);
fclose(file);
if (mobi_ret != MOBI_SUCCESS) { 
  mobi_free(m);
  return ERROR;
}

/* Initialize MOBIRawml structure */
/* Must be deallocated with mobi_free_rawml() when not needed */
/* In the next step this structure will be filled with parsed data */
MOBIRawml *rawml = mobi_init_rawml(m);
if (rawml == NULL) {
  mobi_free(m);
  return ERROR;
}
/* Raw data from MOBIData will be converted to html, css, fonts, media resources */
/* Parsed data will be available in MOBIRawml structure */
mobi_ret = mobi_parse_rawml(rawml, m);
if (mobi_ret != MOBI_SUCCESS) {
  mobi_free(m);
  mobi_free_rawml(rawml);
  return ERROR;
}

/* Do something useful here */
/* ... */
/* For examples how to access data in MOBIRawml structure see mobitool.c */

/* Free MOBIRawml structure */
mobi_free_rawml(rawml);

/* Free MOBIData structure */
mobi_free(m);

return SUCCESS;

for examples of usage, see tools

Requirements

compiler supporting C99
zlib (optional, configure --with-zlib=no to use included miniz.c instead)
libxml2 (optional, configure --with-libxml2=no to use internal xmlwriter)
tested with gcc (>=4.2.4), clang (llvm >=3.4), sun c (>=5.13), MSVC++ (2015)
builds on Linux, MacOS, Windows (MSVC++, MinGW), Android, Solaris
tested architectures: x86, x86-64, arm, ppc
works cross-compiled on Kindle :)

Tests

Projects using libmobi

KyBook 2 Reader
@Voice Aloud Reader
QLMobi quicklook plugin
Librera Reader
... (let me know to include your project)

License:

LGPL, either version 3, or any later

Credits:

The huffman decompression and KF8 parsing algorithms were learned by studying python source code of KindleUnpack.
Thanks to all contributors of Mobileread MOBI wiki

libmobi's People

Contributors

Stargazers

Watchers

libmobi's Issues

AZW3 file generates table of contents that does not work

I have an AZW3 file that I cannot post publicly, but could send you by email for testing. When converted to EPUB, it generates non-functional Table of Contents (TOC) - the chapter names are correct, but links do not work. The TOC entries are like:

  <navPoint id="toc-2" playOrder="2">
   <navLabel>
    <text>CAP&amp;Iacute;TULO II: Otra mudanza ca&amp;oacute;tica</text>
   </navLabel>
   <content src="part00000.html#"/>
  </navPoint>

Note that '' is missing a tag after # character. The same happens with internal links in ebook text to the chapter titles. The same file converts fine to EPUB e.g. with Calibre.

BTW, tried to email you privately about this first, but the email does not go through and sits in the retry queue. Your own mail server at your .net domain says that your email address is graylisted...

Greg

an issue with the function implementation of mobi_buffer_get_varlen_internal in src/buffer.c

When you create a MOBIBuffer object:

    typedef struct {
    size_t offset; /**< Current offset in respect to buffer start */
    size_t maxlen; /**< Length of the buffer data */
    unsigned char *data; /**< Pointer to buffer data */
    MOBI_RET error; /**< MOBI_SUCCESS = 0 if operation on buffer is successful, non-zero value on failure */
} MOBIBuffer;

the initial value of buf->offset is 0:

MOBIBuffer * mobi_buffer_init_null(unsigned char *data, const size_t len) {
    MOBIBuffer *buf = malloc(sizeof(MOBIBuffer));
    if (buf == NULL) {
        debug_print("%s", "Buffer allocation failed\n");
        return NULL;
    }
    buf->data = data;
    buf->offset = 0;
    buf->maxlen = len;
    buf->error = MOBI_SUCCESS;
    return buf;
}

I think there is a problem calling mobi_buffer_get_varlen_internal when direction is -1(read buffer backwards) with a value of buf->offset that is 3.
If buf->offset is 3, it should Reads maximum 4 bytes from the buffer. Stops when byte has bit 7 set.
so it should read byte number 3, byte number 2, byte number 1, and then byte number 0.
but when it comes to read byte number 0, we can see the following check at line 267:
if (buf->offset < 1) {
it checks if zero is less than 1 and it is, so an error is printed and only the last 3 bytes that have been read return and not the 4.
(even though according to pull request it should return 0)

if it needs to read byte number 0 - it should read it and then return without decrementing buf->offset of 0 because if it does it, it will lead to an integer underflow and we will get the max value for size_t in buf->offset, so I suggest checking if it is 0 after reading the byte to the value byte and after updating the value of val, and if buf->offset is 0,
we should check byte_count and according to that decide whether to execute

                debug_print("%s", "End of buffer\n");
                buf->error = MOBI_BUFFER_END;
                return 0;

or to set byte to stop_flag so it will stop reading and return val, while keeping buf->offset at 0,

Can't convent pdb files to epub

Please check I can't convert that pdb to epub
"Error while loading document (Unsupported document format)"
PDB.zip

Can't get image from mobi

Hello @bfabiszewski
I am using your another lib QLMobi combine with libmobi to parse html and images from mobi book.
Most book works great, but some books can not get media image.
I have try to fix but can not get the point. Hope you can help,this is the last problem for me i think~
Both QLMobi and libmobi are great nearly perfect lib.
Thank you very much for your great job~
World of Warcraft - Dawn of the Aspects Part I.mobi.zip

Also i am the developer of Alook Browser - 2x Speed (https://itunes.apple.com/us/app/alook-web-browser-2x-speed/id1261944766?mt=8) if you are using iOS ，and here is a promotional code JWYTH3FE4JJK
Forgive my poor english~
Best Regards.

Out of bounds write, crash

diff --git a/src/util.c b/src/util.c
index be08b26..8887afd 100644
--- a/src/util.c
+++ b/src/util.c
@@ -1601,7 +1601,7 @@ static MOBI_RET mobi_decompress_content(const MOBIData *m, char *text, FILE *fil
         if (dump) {
             fwrite(decompressed, 1, decompressed_size, file);
         } else {
-            if (text_length > *len) {
+            if (text_length + decompressed_size > *len) {
                 debug_print("%s", "Text buffer too small\n");
                 /* free huff/cdic tables */
                 mobi_free_huffcdic(huffcdic);
-- 
2.7.4

Bug: Integer overflow parsing record offsets

There is an error parsing the records offsets in mobi_load_rec. If the next record offset is lower than the previous that results in a negative size that overflows the unsigned integer, so the malloc in mobi_load_recdata can be enormous.

        if (curr->next != NULL) {
            next = curr->next;
            size = next->offset - curr->offset; // <- integer overflow here
        } else {
           ....stripped
        }

        curr->size = size;
        ret = mobi_load_recdata(curr, file); // -> malloc(curr->size); -> enormous malloc

Here is sample that shows this behaviour:
sample.zip

Out-of-bound read vulnerability caused by incomplete check inside `mobi_decompress_huffman_internal`

Developer can access the bug detail at here.

Amazon azw4 format?

Cześć Bartek!
One of the users of my app (@voice Aloud Reader in Google Play) sent me the first azw4 ebook. Do you think you could include this format into your library? Would you need any help with this? I was able to convert the file to epub using the latest calibre, but the format is weird - short lines about 80 characters long formatted as <p>...</p>. Could be a problem with this original file, or calibre's conversion process, don't know at this time.

OK, just managed to update my old Kindle HDX 3rd generation, and it opened the azw4 file fine, no problem with formatting there. Apparently Calibre's conversion is not perfect yet. Please let me know if you have any plans regarding AZW4. Thanks!

Grzesiek

How to build it as dll?

How can I build this as a dll so that I can consume in a C# project?

How to get the HTML content with the specified sequence number as fast as possible?

When i parse big file. Method MOBI_RET mobi_parse_rawml(MOBIRawml *rawml, const MOBIData *m) is too slow.

How to get the HTML content with the specified sequence number as fast as possible?

Mobi file can't parse.

Thank you and sorry to trouble you aging.
Best Regards.

pg1342-images.mobi.zip

I am confused about a function.

_buffer_get_varlen I am puzzled by this function, why should I read 7 bit, Stops when byte has bit 7 set, I am also confused about this condition. Should not be a step-by-step read 8 bit

Trying to get in touch regarding a security issue

Hey there!

I'd like to report a security issue but cannot find contact instructions on your repository.

If not a hassle, might you kindly add a SECURITY.md file with an email, or another contact method? GitHub recommends this best practice to ensure security issues are responsibly disclosed, and it would serve as a simple instruction for security researchers in the future.

Thank you for your consideration, and I look forward to hearing from you!

(cc @huntr-helper)

enabling MOBI_DEBUG on Windows

I am getting a CMake error if I enable MOBI_DEBUG on Windows (VS 2022):

cl : command line error D8021: invalid numeric argument '/Wextra'

convert mobi ebook to epub error

convert mobi file to epub format successfully, but the epub file format is error, it can't be opened by iBooks and many android epub readers. I check the epub file with calibre-edit, and get the error below:

ERROR: Parsing failed: xmlParseEntityRef: no name, line 1, column 807    [OEBPS/part00000.html]
INFO: File too large    [OEBPS/part00000.html]

123_test.epub.zip

Another Out-of-bound read vulnerability caused by incomplete check inside `mobi_decompress_huffman_internal`

Detail bug report is at here. Developer can access it by logging in.

mobitool -d <file> -o <output directory> ignores -o argument

$ mobitool.exe -d data/googled.mobi -o .
Title: Googled
Author: Ken Auletta
. . . <output snipped> . . .

Dumping rawml...
Saving rawml to data/googled.rawml

Please add .kfx support

Please add support new Kindle format KFX
(sample attached)
sample.zip

when i make and install in mac osx10.12.4, it tell me can`t find the mobi.h

main.c:10:11: fatal error: 'mobi.h' file not found
how can i fix this error.

tarball is missing for latest releases (e.g. 0.6)

Github-provided tarballs are pretty bad, because the size is about 45 megabytes

AddressSanitizer: heap-buffer-overflow at buffer.c:212

We found with our fuzzer several heap-buffer-overflow errors when compiling libmobi with address sanitizer and run with the command mobitool -i7m $file. Someone else also found a few others here.

We will list them separately in the following issue threads and this is the 1st one.

POC (proof-of-crash) files:
https://github.com/ntu-sec/pocs/blob/master/libmobi/hbo_buffer.c%3A212_1.mobi
https://github.com/ntu-sec/pocs/blob/master/libmobi/hbo_buffer.c%3A212_2.mobi

gdb output:
https://github.com/ntu-sec/pocs/blob/master/libmobi/hbo_buffer.c:212_1.mobi.gdb.txt
https://github.com/ntu-sec/pocs/blob/master/libmobi/hbo_buffer.c:212_2.mobi.gdb.txt

README question: can libmobi also create new documents from scratch?

The README lists a lot of features, but they're all apparently centered around reading or modifying an existing file.

Can libmobi also create new ebooks from scratch? (For use in an EPUB->MOBI conversion software) If yes, maybe another bullet point in the README clarifying that would be useful 🙂

Thanks for creating this cool library!

AddressSanitizer: heap-buffer-overflow at buffer.c:230

POC files:
https://github.com/ntu-sec/pocs/blob/master/libmobi/hbo_buffer.c%3A230_1.mobi
https://github.com/ntu-sec/pocs/blob/master/libmobi/hbo_buffer.c%3A230_2.mobi

gdb output:
https://github.com/ntu-sec/pocs/blob/master/libmobi/hbo_buffer.c%3A230_1.mobi.gdb.txt
https://github.com/ntu-sec/pocs/blob/master/libmobi/hbo_buffer.c%3A230_2.mobi.gdb.txt

Mobi file can't parse

Mobi parse failed but can be open in Kindle app.
File is in attachment.
Thank you for your great work~
Best Regards.
World of Warcraft - Dawn of the Aspects Part I.mobi.zip

Homebrew formula

Homebrew is an awesome package manager for macOS. If you add a brew formula, i.e. libmobi.rb, it will get very convenient to install libmobi on macOS.

convert azw3 ebook to epub error

printf("Could not initialize zip archive\n");
Here is the link to the file I tested.
https://1drv.ms/u/s!AkaVccfysLmAhI5Odqj2pZ1QCMci6g?e=U9lkC3

MOBI_ATTRNAME_MAXSIZE 100 for some books it's not enought

Please increase MOBI_ATTRNAME_MAXSIZE and MOBI_ATTRVALUE_MAXSIZE to 150

#define MOBI_ATTRNAME_MAXSIZE 150 /< Maximum length of tag attribute name, like "href" */
#define MOBI_ATTRVALUE_MAXSIZE 150 /< Maximum length of tag attribute value */

thanks

how can i use this lib to get the mobi cover image?

AddressSanitizer: heap-buffer-overflow at common.c:376

POC file:
https://github.com/ntu-sec/pocs/blob/master/libmobi/hbo_common.c%3A376_1.mobi.gdb.txt

gdb output:
https://github.com/ntu-sec/pocs/blob/master/libmobi/hbo_common.c%3A376_1.mobi

toc.ncx is sometimes created with wrong navigation labels

The issue is with "World of Warcraft - Dawn of the Aspects Part I.mobi" ebook file, submitted with "Mobi file can't parse #10" by @LiuDeng:

The toc.ncx that libmobi generates from this file has wrong links. For example for "Part I" we have in toc.ncx:

However, there is no element with id "0000006908" in part00000.html at all. Instead, "Part I" header is preceded with:

Could you maybe tell me where and how the toc.ncx is constructed, maybe then I could find a fix on my own...