Giter VIP home page Giter VIP logo

Comments (8)

ceztko avatar ceztko commented on June 12, 2024

Hello, there may be multiple problems but I think the most relevant one is is related to missing predefined CMaps encoding support, for which I just added an entry in the TODO (but it was a known missing feature). Embedding all the known predefined CMaps, which I believe should be in this repository, is quite a big task. The main idea I have about this task would be to parse all the cmaps inside PdfCharCodeMap (the code that does the CMap parsing is here, but may be refactored so it's callable somewhere else) and make some kind of binary serialization so it can be efficiently embedded in PoDoFo. Then, implement the predefined cmap resolution algorithm as told in the specification and I believe this problem should be addressed. I would be very glad to see some contributions on this matter, as I definitely don't have time to put on the task, I'm sorry.

from podofo.

tayei1997 avatar tayei1997 commented on June 12, 2024

Thanks for your answer.

So the problem now is that podofo can extract the binary encoded data from the text in the image below, but, due to the lack of a corresponding CMap, it cannot decode the text correctly.
企业微信截图_17107287983628
If I want to decode it myself, I first need to get the pre-TJ data, can I use podofo to get the pre-TJ data?

from podofo.

ceztko avatar ceztko commented on June 12, 2024

You can have a look at the use of PdfContentStreamReader here. But this project would benefit if you try to implement the system I suggested and do it within PoDoFo source (at least a prototype of it in a fork). I recently received some very good contributions from a couple of Chinese users that had issues trying to draw text: I enjoyed the level of competence and their PRs have been already merged.

from podofo.

tayei1997 avatar tayei1997 commented on June 12, 2024

Hello, there may be multiple problems but I think the most relevant one is is related to missing predefined CMaps encoding support, for which I just added an entry in the TODO (but it was a known missing feature). Embedding all the known predefined CMaps, which I believe should be in this repository, is quite a big task. The main idea I have about this task would be to parse all the cmaps inside PdfCharCodeMap (the code that does the CMap parsing is here, but may be refactored so it's callable somewhere else) and make some kind of binary serialization so it can be efficiently embedded in PoDoFo. Then, implement the predefined cmap resolution algorithm as told in the specification and I believe this problem should be addressed. I would be very glad to see some contributions on this matter, as I definitely don't have time to put on the task, I'm sorry.

I have the will to carry out the development, so I would like to ask a few questions:

  1. if all the CMap mappings are embedded in podofo, will it cause the memory usage of podofo to become higher?
  2. I lack the CMap related development experience you mentioned, so it is difficult to estimate the time needed for this work, how many man-days do you think it will take to complete this work?

from podofo.

ceztko avatar ceztko commented on June 12, 2024

Hello. Sorry, for the delay in the answer, it took me some more time to do further analysis. First, let me confirm that the issue here is really the missing embedding of predefined CMaps encoding. I try to answer your questions below.

  1. if all the CMap mappings are embedded in podofo, will it cause the memory usage of podofo to become higher?

Yes, but there are options to reduce the memory consumption by embedding pre-parsed maps. See below.

I lack the CMap related development experience you mentioned, so it is difficult to estimate the time needed for this work, how many man-days do you think it will take to complete this work?

It's hard, but let's try to create some tasks and (possibly over-)estimate them:

  1. [4 Hours] Factorize CMap parsing code so it can be used to make a tool to bulk parse many cmaps;

  2. [4 Hours] Make PdfCharCodeMap to be initialized from a CodeUnitMap. This may remove the need of defining binary serialization of the map, as I was suggesting before. Basically you can make a constructor of PdfCharCodeMap like the following:

PdfCharCodeMap(CodeUnitMap&& codeUnitMap);

Which you can use to define many singletons like the following:

static const PdfCharCodeMap& GetInstance_UniGB_UCS2_H()
{
    static PdfCharCodeMap UniGB_UCS2_H(CodeUnitMap({
        { PdfCharCode(32, 2), { 1 } },
        { PdfCharCode(33, 2), { 2 } },
        { PdfCharCode(34, 2), { 3 } }
        // ..
        }));

    return UniGB_UCS2_H;
}
  1. [8 Hours] Make a tool that will do the parsing of the CMap and create the singletons above in many .cpp files.
  2. [8 Hours] Create a script Run the tool above on the existing CMaps from cmap-resources and mapping-resources-pdf repositories (both should be needed, in 2 steps).
  3. [32 Hours] Implement the algorithm described in "9.10.2 Mapping character codes to Unicode values" below:

If the font is a composite font that uses one of the predefined CMaps listed in "Table 116 - Predefined CJK CMap names" (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, Adobe-Korea1 (deprecated in PDF 2.0 (2020)) or Adobe-KR (added in PDF 2.0 (2020)) character collection:
a. Map the character code to a character identifier (CID) according to the font’s CMap.
b. Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.
c. Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry-ordering-UCS2 (for example, Adobe–Japan1–UCS2).
d. Obtain the CMap with the name constructed in step (c) (available from a variety of online sources, e.g. https://github.com/adobe-type-tools/mapping-resources-pdf).
e. Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.
Type 0 fonts whose descendant CIDFonts use the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, Adobe-Korea1 (deprecated in PDF 2.0 (2020)) or Adobe-KR (added in PDF 2.0 (2020)) character collection (as specified in the CIDSystemInfo dictionary) shall have a supplement number corresponding to the version of PDF supported by the PDF processor.

Translated in PoDoFo architecture, I believe one PdfEncoding instance has to be constructed from the embedded maps, recognizing the /Encoding entry is one of the predefined names. I believe the code to cid CMap encoding that must be used in point a. is to be found in cmap-resources, while the "toUnicode" CMap needed in step .d is to be found in mapping-resources-pdf. You then constructor an instance like PdfEncoding(cidMap, toUnicode) (the name detection and instance construction should be probably inserted at this location in the source) and text extraction should start to work.

Summarizing, I believe 7-8 man days may be a decent estimation of the work need to accomplish the task. Following the above approach would make me more willing to fast track review/merge a prototype solving the problem. The more the approach differs, the less I may be comfortable at reviewing your work.

from podofo.

ceztko avatar ceztko commented on June 12, 2024

Have you considered whether you are willing to implement the above activities? 7-8 days may be larger estimate and if you are quick enough it could be shorter (but remember I would like to see few unit tests as well for this work).

from podofo.

tayei1997 avatar tayei1997 commented on June 12, 2024

Hi, I have the intention of completing the above functionality, but I must state that as I can only develop the relevant code outside of my official working hours, and due to my lack of experience in this development, I cannot offer a guarantee as to the time of completion.

from podofo.

ceztko avatar ceztko commented on June 12, 2024

Ok. I'm sorry for the unsolicited advice: I don't know what's your job, but in the case a company is paying you to work on PDF related topics still I recommend you to not work out of official hours if the work ultimately benefits them. In this way companies using open source software "for free" get more responsible , and the actual software improves in a more professional way.

from podofo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.