Hello, I am using podofo library provides pdf text extraction function, encountered a

Extract PDF file results in a garbled code about podofo HOT 8 OPEN

tayei1997 commented on June 12, 2024

Extract PDF file results in a garbled code

from podofo.

Comments (8)

ceztko commented on June 12, 2024

Hello, there may be multiple problems but I think the most relevant one is is related to missing predefined CMaps encoding support, for which I just added an entry in the TODO (but it was a known missing feature). Embedding all the known predefined CMaps, which I believe should be in this repository, is quite a big task. The main idea I have about this task would be to parse all the cmaps inside PdfCharCodeMap (the code that does the CMap parsing is here, but may be refactored so it's callable somewhere else) and make some kind of binary serialization so it can be efficiently embedded in PoDoFo. Then, implement the predefined cmap resolution algorithm as told in the specification and I believe this problem should be addressed. I would be very glad to see some contributions on this matter, as I definitely don't have time to put on the task, I'm sorry.

from podofo.

tayei1997 commented on June 12, 2024

Thanks for your answer.

So the problem now is that podofo can extract the binary encoded data from the text in the image below, but, due to the lack of a corresponding CMap, it cannot decode the text correctly.

If I want to decode it myself, I first need to get the pre-TJ data, can I use podofo to get the pre-TJ data?

from podofo.

ceztko commented on June 12, 2024

You can have a look at the use of PdfContentStreamReader here. But this project would benefit if you try to implement the system I suggested and do it within PoDoFo source (at least a prototype of it in a fork). I recently received some very good contributions from a couple of Chinese users that had issues trying to draw text: I enjoyed the level of competence and their PRs have been already merged.

from podofo.

tayei1997 commented on June 12, 2024

Hello, there may be multiple problems but I think the most relevant one is is related to missing predefined CMaps encoding support, for which I just added an entry in the TODO (but it was a known missing feature). Embedding all the known predefined CMaps, which I believe should be in this repository, is quite a big task. The main idea I have about this task would be to parse all the cmaps inside PdfCharCodeMap (the code that does the CMap parsing is here, but may be refactored so it's callable somewhere else) and make some kind of binary serialization so it can be efficiently embedded in PoDoFo. Then, implement the predefined cmap resolution algorithm as told in the specification and I believe this problem should be addressed. I would be very glad to see some contributions on this matter, as I definitely don't have time to put on the task, I'm sorry.

I have the will to carry out the development, so I would like to ask a few questions:

if all the CMap mappings are embedded in podofo, will it cause the memory usage of podofo to become higher?
I lack the CMap related development experience you mentioned, so it is difficult to estimate the time needed for this work, how many man-days do you think it will take to complete this work?

from podofo.

ceztko commented on June 12, 2024

Hello. Sorry, for the delay in the answer, it took me some more time to do further analysis. First, let me confirm that the issue here is really the missing embedding of predefined CMaps encoding. I try to answer your questions below.

if all the CMap mappings are embedded in podofo, will it cause the memory usage of podofo to become higher?

Yes, but there are options to reduce the memory consumption by embedding pre-parsed maps. See below.

I lack the CMap related development experience you mentioned, so it is difficult to estimate the time needed for this work, how many man-days do you think it will take to complete this work?

It's hard, but let's try to create some tasks and (possibly over-)estimate them:

[4 Hours] Factorize CMap parsing code so it can be used to make a tool to bulk parse many cmaps;
[4 Hours] Make PdfCharCodeMap to be initialized from a CodeUnitMap. This may remove the need of defining binary serialization of the map, as I was suggesting before. Basically you can make a constructor of PdfCharCodeMap like the following:

PdfCharCodeMap(CodeUnitMap&& codeUnitMap);

Which you can use to define many singletons like the following:

static const PdfCharCodeMap& GetInstance_UniGB_UCS2_H()
{
    static PdfCharCodeMap UniGB_UCS2_H(CodeUnitMap({
        { PdfCharCode(32, 2), { 1 } },
        { PdfCharCode(33, 2), { 2 } },
        { PdfCharCode(34, 2), { 3 } }
        // ..
        }));

    return UniGB_UCS2_H;
}

[8 Hours] Make a tool that will do the parsing of the CMap and create the singletons above in many .cpp files.
[8 Hours] Create a script Run the tool above on the existing CMaps from cmap-resources and mapping-resources-pdf repositories (both should be needed, in 2 steps).
[32 Hours] Implement the algorithm described in "9.10.2 Mapping character codes to Unicode values" below:

If the font is a composite font that uses one of the predefined CMaps listed in "Table 116 - Predefined CJK CMap names" (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, Adobe-Korea1 (deprecated in PDF 2.0 (2020)) or Adobe-KR (added in PDF 2.0 (2020)) character collection:
a. Map the character code to a character identifier (CID) according to the font’s CMap.
b. Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.
c. Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry-ordering-UCS2 (for example, Adobe–Japan1–UCS2).
d. Obtain the CMap with the name constructed in step (c) (available from a variety of online sources, e.g. https://github.com/adobe-type-tools/mapping-resources-pdf).
e. Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.
Type 0 fonts whose descendant CIDFonts use the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, Adobe-Korea1 (deprecated in PDF 2.0 (2020)) or Adobe-KR (added in PDF 2.0 (2020)) character collection (as specified in the CIDSystemInfo dictionary) shall have a supplement number corresponding to the version of PDF supported by the PDF processor.

Translated in PoDoFo architecture, I believe one PdfEncoding instance has to be constructed from the embedded maps, recognizing the /Encoding entry is one of the predefined names. I believe the code to cid CMap encoding that must be used in point a. is to be found in cmap-resources, while the "toUnicode" CMap needed in step .d is to be found in mapping-resources-pdf. You then constructor an instance like PdfEncoding(cidMap, toUnicode) (the name detection and instance construction should be probably inserted at this location in the source) and text extraction should start to work.

Summarizing, I believe 7-8 man days may be a decent estimation of the work need to accomplish the task. Following the above approach would make me more willing to fast track review/merge a prototype solving the problem. The more the approach differs, the less I may be comfortable at reviewing your work.

from podofo.

ceztko commented on June 12, 2024

Have you considered whether you are willing to implement the above activities? 7-8 days may be larger estimate and if you are quick enough it could be shorter (but remember I would like to see few unit tests as well for this work).

from podofo.

tayei1997 commented on June 12, 2024

Hi, I have the intention of completing the above functionality, but I must state that as I can only develop the relevant code outside of my official working hours, and due to my lack of experience in this development, I cannot offer a guarantee as to the time of completion.

from podofo.

ceztko commented on June 12, 2024

Ok. I'm sorry for the unsolicited advice: I don't know what's your job, but in the case a company is paying you to work on PDF related topics still I recommend you to not work out of official hours if the work ultimately benefits them. In this way companies using open source software "for free" get more responsible , and the actual software improves in a more professional way.

from podofo.

Extract PDF file results in a garbled code about podofo HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent