Giter VIP home page Giter VIP logo

Comments (4)

piovac avatar piovac commented on July 3, 2024 2

In reality, the problem can be easily fixed with a setting Calibre conversion tool. On the "Search & replace" setting need to add \ width=\"(.*?)\" height=\"(.*?)\" to the Search field while leaving the Replace field empty.

Another trick to make better converted files is to reduce the size of the code text to avoid wrapping. That requires to edit the Style01.css file, search for .courprogramlisting and change the font-size:80%! to font-size:40%!

Maybe you could add some of these suggestions in an .md file.

from safaribooks.

lorenzodifuccia avatar lorenzodifuccia commented on July 3, 2024

I'll work on that.
Thanks!

from safaribooks.

klezm avatar klezm commented on July 3, 2024

Script to fix epubs

I wrote a script to update epub books with the suggestion by @piovac

import pathlib
import shutil
import io
import xml.dom.minidom
import xml.etree.ElementTree as ET
import zipfile
import re
import difflib
from pprint import pprint

# https://medium.com/dev-bits/ultimate-guide-for-working-with-i-o-streams-and-zip-archives-in-python-3-6f3cf96dca50

# #############################  Change this  #############################

paths = r"path/to/your/calibre/library"

# #########################################################################

paths = list(pathlib.Path(paths).rglob("*.epub"))
pprint(paths)

# path_backup = pathlib.Path(r"")
# shutil.copyfile(path_backup, path)

for path in paths:
    print("\n"*0 + f'---------------    Processing book: {path.name}    ---------------')

    zip_updated = io.BytesIO()

    with zipfile.ZipFile(path, "a") as zip:
        with zipfile.ZipFile(zip_updated, "w", compression = zipfile.ZIP_DEFLATED) as zip_u:
            files = zip.infolist()

            f_types = ['.xml', '.html', '.xhtml']

            for f in files:
                # print("." * 50, f'  {f.filename}  ', "." * 50)

                # copy not modified files
                if pathlib.Path(f.filename).suffix not in f_types:
                    zip_u.writestr(f, zip.read(f.filename))

                # modify data in archive
                else:
                    data = zip.read(f.filename).decode()
                    data_temp = data

                    # use xml to modify all img tags
                    if False:
                        # data = ET.canonicalize(data, rewrite_prefixes = True)
                        # data_xml = ET.fromstring(data)
                        data_xml = ET.ElementTree(ET.fromstring(data)).getroot()
                        # namespace = data_xml.tag[1:data_xml.tag.index("}")]

                        imgs = data_xml.findall(".//{*}img")  # use XPATH and use wildcard for any namespace
                        for img in imgs:
                            del img.attrib["width"]  # img.attrib.pop("width")
                            del img.attrib["height"]  # img.attrib.pop("height")
                            # img.tag = img.tag[img.tag.index("}") + 1:]  # remove namespace from tag

                        ET.register_namespace("", xml.dom.XHTML_NAMESPACE)  # prevent namespace before tags
                        # ET.register_namespace("", namespace)
                        data = ET.tostring(data_xml, encoding = "unicode")

                    # use simple regex to remove the attributes
                    else:
                        data = re.sub(r" width=\"(.*?)\" height=\"(.*?)\"", "", data)

                    zip_u.writestr(f, bytearray(data, "utf-8"))

                    if False:  # for debugging purposes
                        TMP_data = xml.dom.minidom.parseString(data).toprettyxml()
                        TMP_data_temp = xml.dom.minidom.parseString(data_temp).toprettyxml()
                        print(*[x for x in difflib.Differ().compare(TMP_data_temp.splitlines(), TMP_data.splitlines())
                                if x.find("<img ") >= 0 and not x.startswith("-")], sep = "\n")
                        print(*list(difflib.context_diff(TMP_data_temp.splitlines(keepends = True),
                                                         TMP_data.splitlines(keepends = True), n = 0)), sep = "")

    with open(path, "wb") as zip:
        zip.write(zip_updated.getbuffer())

    zip_updated.close()

Fix for safaribooks?

inserting the following code in this line might solve the problem (not tested)

# remove width / height attributes from img tags
imgs = book_content.findall(".//{*}img")  # use XPATH and use wildcard for any namespace
for img in imgs:
    del img.attrib["width"]
    del img.attrib["height"]

from safaribooks.

SylvainMartel avatar SylvainMartel commented on July 3, 2024

using the replace with \ width=\"(.*?)\" height=\"(.*?)\" solved my problem, thanks! Will a fix make it in the main python code?

from safaribooks.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.