Giter VIP home page Giter VIP logo

Comments (14)

Tvangeste avatar Tvangeste commented on June 16, 2024 2

@soshial, I'm not sure that putting the CRC32/MD5 checksums inside the XDXF file is a good idea. You see, every time a user modifies such XDXF file, he/she must somehow recalculate the checksum, which is annoying and inconvenient and requires some external tools to do so.

Personally, I don't think we need any checksumming at all for plain old text files. There are special tools to calculate and check the checksums, no need to put them into dictionary file directly.

Mandatory compression is also not as flexible as I'd like. Many users do modify their dictionaries from time to time, correcting the typos, adding new entries, etc. Extracting the dictionary, modifying it and re-compressing is just too much extra work for such use cases.

Consider GoldenDict. Even when it is open, user can open XDXF file, modify it and then press Ctrl+F5 to rescan the dictionaries, that will propagate the changes immediately. No need to compress/decompress anything, no need to calculate checksums, very fast and convenient way.

In fact users could even provide an editor command line and start editing the files right from context menu, in simple text editors.

from xdxf_makedict.

nikita-moor avatar nikita-moor commented on June 16, 2024 1

imprint all media to the XDXF xml with base64 encoding

Real dictionary observation:

I have a dictionary made of page scans and keys referencing these pages. One page normally contains 3-5 articles, so one image is referenced by several articles.

Slob format saves images directly into the dictionary file and lets referencing images as external files: <img src="image.png">. File size is 62.8 Mb.

The same dictionary encoded into StarDict format with images embedded (base64) into the articles, as <img src="data:image/…" /> is 230.2 Mb.

Embedding images directly makes file 3.7 time bigger, because same images are repeated several times. Comparing different formats is perhaps not absolutely correct, but in fact they store data alike. So, think about centralized storage of images and inter-XML links (like abbreviations?).

P.S. I personally support idea of several files (dictionary, media, css, js) compressed into a zip (or similar) archive. It would be easier replacing or editing images without need of programming.

P.P.S. Having 2 files, one for dictionary and another for media, seems convenient. But having experience of supporting MDict (two files: mdx + mdd) I could say users ask me repeatedly "why are two files there" and "which one should I download". So, "one file to rule them all" is better.

from xdxf_makedict.

Tvangeste avatar Tvangeste commented on June 16, 2024

I'm not sure that this is a good idea. Many dictionary formats explicitly separate the main content and the (media) resources (stardict, dsl+zip in GoldenDict, MDX/MDD in MDict). One of the main reasons for that is that resources might be huge, taking Gigabytes of space. On mobile devices where the size is critical, users are free to not copy the media resources which will give them a working dictionary, albeit without the media. Also, some users might decide not to download the huge media files at all, just taking the main content.

Personally, I'm also inclined to separate the text content and the binary data into separate files, that would give better flexibility for everybody. Just imagine editing an XML file which is 4 GB of size! :)

from xdxf_makedict.

soshial avatar soshial commented on June 16, 2024

This is definitely very reasonable: I was inclined to the same opinion. But what do you think about storing the icon and the cover image of the dictionary in main file? They cannot be that big as you say, so this wouldn't take up much space, what do you think?

Also, do you think that optional storing meta_data in a separate file might be a good idea?

from xdxf_makedict.

Tvangeste avatar Tvangeste commented on June 16, 2024

Hmm, actually I think that storing both meta-data and the icon inside the main content file is a proper behavior, there is no need to keep many different files, this confuses users. Two files (the main content file and the additional media resources file) seems to be the most appropriate approach.

When all the main data is in the single file, it is easier to parse and transfer dictionaries.

As for external files with metadata, this can be done outside of the specification. For example, in GoldenDict we conside to introduce such format-independent files so that users might adjust dictionary name, provide custom icons, etc, without modifying the original file.

from xdxf_makedict.

soshial avatar soshial commented on June 16, 2024

I am very grateful for your feedback and opinion and I agree with you. But do you think it is possible to involve other Goldendict community members to decide this important details so that the best solution is reached?

Speaking further on the topic, I was wondering if packing both dictionary and media files in a simple *.zip archive would be a nice practice. Since some dictionary files (uncompressed articles) may take up to dozens of megabytes -- and this is with no media files involved. So reading description from a series of dictionaries would imply unpacking gigabites of data. So do you think we need to store dicts in archives or not?

UPD. This unpacking might be also important, since when Goldendict indexes the dictionary files it remembers the exact file offsets for each word-article, abd I'm not sure it's possible with the comperessed files.

from xdxf_makedict.

Tvangeste avatar Tvangeste commented on June 16, 2024

do you think it is possible to involve other Goldendict community members to decide this important details so that the best solution is reached?

You could always summon them via @goldendict/developers, if that works from external repository, probably not....

I was wondering if packing both dictionary and media files in a simple *.zip archive would be a nice practice.

Nope, it won't work. We need the offset-based access to the main content and zip doesn't work. For that we use dictzip, which allows to do that.

In short, the main content (dictionary itself) can/should be compressed with dictzip, the media resources (images, audio, video) can/should be compressed with regular zip (but one need to be careful about file names encoding in such a zip file).

This unpacking might be also important

With dictzip this is not a problem. GoldenDict already handles, e.g., dsl.dz (dictzip compressed DSL files) with no issues.

from xdxf_makedict.

Tvangeste avatar Tvangeste commented on June 16, 2024

I'd say that ability to compress the XDXF dictionary (via dictzip) is a matter external to the XDXF specification. Some tools might decide to handle such compressed dictionaries, some others might prefer other means or only handle the uncompressed data.

from xdxf_makedict.

soshial avatar soshial commented on June 16, 2024

Wow, thank you so much telling about this dictzip software: I was wondering why they have *.dz extension.
But is it possible to put multiple files in it (i don't mean media ones, but the meta info for example)?

PS. Please let's also discuss tables/grammar issue #5

from xdxf_makedict.

soshial avatar soshial commented on June 16, 2024

I was also thinking: since some people would want to use xdxf files without *.gz, then we would need to have our file CRC32 checksumed, but we are not able to checksum the file if the checksum must be in the meta_info section before we start computing it. Haha =)

Maybe *.gz should be made obligatory?

from xdxf_makedict.

soshial avatar soshial commented on June 16, 2024

Very good point, thank you.

from xdxf_makedict.

ceefour avatar ceefour commented on June 16, 2024

I like treating XDXF artifacts just like Open/LibreOffice documents or Java JAR/WAR files. i.e. conceptually they're "a directory tree contaning at least an .xdxf file, with one or more media files".

Whether these are:

  1. expanded (as actual directory tree on the filesystem)
  2. accessed as URIs (so it's possible to load an XDXF remotely and then load any referenced media file on-demand, in this case XDXF acts like HTML with img src's)
  3. compressed using ZIP (which is a good format due to its ubiquity & accessible content listing, less compression than gzip/bzip2/xz but I guess it's OK)

is a "deployment detail", tools should be able to access any of these uniformly. (just like in Java, the program doesn't care where or how you put a dependency class, as long as it's available in the classpath).

from xdxf_makedict.

soshial avatar soshial commented on June 16, 2024

@Tvangeste isn't it reasonable to put media into dictzip as well, since most of the images and sounds can fir into 1 or several 64kb blocks? This way media files will be random accessible too.

from xdxf_makedict.

bmix avatar bmix commented on June 16, 2024

The Open Container Format (used in ePub) is a mature (v3.2) W3C standard, which is very similar to the OpenDocument container format, the Java Archive (JAR) format and many others.

It describes a ZIP archive, where the very first file in the archive contains the media-type (for example application/epub+zip) in plain-text (ASCII) and stays uncompressed.

There must be a META-INF directory, which contains meta data, that the file format needs. The rest is specified for the requirements of ePub documents.

Other, familiar containers are several Java archive containers (JAR, WAR, etc.)
and the Open Document Format.

I would base XDXF files strictly (the spec has been written, tested, so no need to brew a new one, makes it easy for users) on this format, but configure the file- and directory names, where needed.

eng-ita.xdz
|__ mimetype
|__ META-INF
|  |__(an XML file laying out the physical structure, like, where to find which kind of asset)
|__ dictionary.xdxf
|__ graphic/   (optional)
|__ audio/     (optional)
|__ transform/ (optional)
|  |__ html.xsl
|  |__ pdf.xsl
|  |__ dict.xsl
|__ whatever is needed/  (optional)

In addition, I would specify a flat-file XDXF, used for those dictionaries, that do not need any assets, but can be transported safely as put XML.

Assets would always be linked. And I would adopt XLink for any linking in XDXF.

from xdxf_makedict.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.