Comments (8)
Hello, there may be multiple problems but I think the most relevant one is is related to missing predefined CMaps encoding support, for which I just added an entry in the TODO (but it was a known missing feature). Embedding all the known predefined CMaps, which I believe should be in this repository, is quite a big task. The main idea I have about this task would be to parse all the cmaps inside PdfCharCodeMap
(the code that does the CMap parsing is here, but may be refactored so it's callable somewhere else) and make some kind of binary serialization so it can be efficiently embedded in PoDoFo. Then, implement the predefined cmap resolution algorithm as told in the specification and I believe this problem should be addressed. I would be very glad to see some contributions on this matter, as I definitely don't have time to put on the task, I'm sorry.
from podofo.
Thanks for your answer.
So the problem now is that podofo can extract the binary encoded data from the text in the image below, but, due to the lack of a corresponding CMap, it cannot decode the text correctly.
If I want to decode it myself, I first need to get the pre-TJ data, can I use podofo to get the pre-TJ data?
from podofo.
You can have a look at the use of PdfContentStreamReader
here. But this project would benefit if you try to implement the system I suggested and do it within PoDoFo source (at least a prototype of it in a fork). I recently received some very good contributions from a couple of Chinese users that had issues trying to draw text: I enjoyed the level of competence and their PRs have been already merged.
from podofo.
Hello, there may be multiple problems but I think the most relevant one is is related to missing predefined CMaps encoding support, for which I just added an entry in the TODO (but it was a known missing feature). Embedding all the known predefined CMaps, which I believe should be in this repository, is quite a big task. The main idea I have about this task would be to parse all the cmaps inside
PdfCharCodeMap
(the code that does the CMap parsing is here, but may be refactored so it's callable somewhere else) and make some kind of binary serialization so it can be efficiently embedded in PoDoFo. Then, implement the predefined cmap resolution algorithm as told in the specification and I believe this problem should be addressed. I would be very glad to see some contributions on this matter, as I definitely don't have time to put on the task, I'm sorry.
I have the will to carry out the development, so I would like to ask a few questions:
- if all the CMap mappings are embedded in podofo, will it cause the memory usage of podofo to become higher?
- I lack the CMap related development experience you mentioned, so it is difficult to estimate the time needed for this work, how many man-days do you think it will take to complete this work?
from podofo.
Hello. Sorry, for the delay in the answer, it took me some more time to do further analysis. First, let me confirm that the issue here is really the missing embedding of predefined CMaps encoding. I try to answer your questions below.
- if all the CMap mappings are embedded in podofo, will it cause the memory usage of podofo to become higher?
Yes, but there are options to reduce the memory consumption by embedding pre-parsed maps. See below.
I lack the CMap related development experience you mentioned, so it is difficult to estimate the time needed for this work, how many man-days do you think it will take to complete this work?
It's hard, but let's try to create some tasks and (possibly over-)estimate them:
-
[4 Hours] Factorize CMap parsing code so it can be used to make a tool to bulk parse many cmaps;
-
[4 Hours] Make
PdfCharCodeMap
to be initialized from aCodeUnitMap
. This may remove the need of defining binary serialization of the map, as I was suggesting before. Basically you can make a constructor ofPdfCharCodeMap
like the following:
PdfCharCodeMap(CodeUnitMap&& codeUnitMap);
Which you can use to define many singletons like the following:
static const PdfCharCodeMap& GetInstance_UniGB_UCS2_H()
{
static PdfCharCodeMap UniGB_UCS2_H(CodeUnitMap({
{ PdfCharCode(32, 2), { 1 } },
{ PdfCharCode(33, 2), { 2 } },
{ PdfCharCode(34, 2), { 3 } }
// ..
}));
return UniGB_UCS2_H;
}
- [8 Hours] Make a tool that will do the parsing of the CMap and create the singletons above in many
.cpp
files. - [8 Hours] Create a script Run the tool above on the existing CMaps from cmap-resources and mapping-resources-pdf repositories (both should be needed, in 2 steps).
- [32 Hours] Implement the algorithm described in "9.10.2 Mapping character codes to Unicode values" below:
If the font is a composite font that uses one of the predefined CMaps listed in "Table 116 - Predefined CJK CMap names" (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, Adobe-Korea1 (deprecated in PDF 2.0 (2020)) or Adobe-KR (added in PDF 2.0 (2020)) character collection:
a. Map the character code to a character identifier (CID) according to the font’s CMap.
b. Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.
c. Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry-ordering-UCS2 (for example, Adobe–Japan1–UCS2).
d. Obtain the CMap with the name constructed in step (c) (available from a variety of online sources, e.g. https://github.com/adobe-type-tools/mapping-resources-pdf).
e. Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.
Type 0 fonts whose descendant CIDFonts use the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, Adobe-Korea1 (deprecated in PDF 2.0 (2020)) or Adobe-KR (added in PDF 2.0 (2020)) character collection (as specified in the CIDSystemInfo dictionary) shall have a supplement number corresponding to the version of PDF supported by the PDF processor.
Translated in PoDoFo architecture, I believe one PdfEncoding
instance has to be constructed from the embedded maps, recognizing the /Encoding
entry is one of the predefined names. I believe the code to cid CMap encoding that must be used in point a.
is to be found in cmap-resources, while the "toUnicode" CMap needed in step .d
is to be found in mapping-resources-pdf. You then constructor an instance like PdfEncoding(cidMap, toUnicode)
(the name detection and instance construction should be probably inserted at this location in the source) and text extraction should start to work.
Summarizing, I believe 7-8 man days may be a decent estimation of the work need to accomplish the task. Following the above approach would make me more willing to fast track review/merge a prototype solving the problem. The more the approach differs, the less I may be comfortable at reviewing your work.
from podofo.
Have you considered whether you are willing to implement the above activities? 7-8 days may be larger estimate and if you are quick enough it could be shorter (but remember I would like to see few unit tests as well for this work).
from podofo.
Hi, I have the intention of completing the above functionality, but I must state that as I can only develop the relevant code outside of my official working hours, and due to my lack of experience in this development, I cannot offer a guarantee as to the time of completion.
from podofo.
Ok. I'm sorry for the unsolicited advice: I don't know what's your job, but in the case a company is paying you to work on PDF related topics still I recommend you to not work out of official hours if the work ultimately benefits them. In this way companies using open source software "for free" get more responsible , and the actual software improves in a more professional way.
from podofo.
Related Issues (20)
- Segmentation fault when trying to Load image HOT 4
- Images are skewed HOT 1
- help with signing documents HOT 1
- Please clarify license information for the following files HOT 3
- please provide instructions on how to create a PdfSignature HOT 2
- PdfErrorCode::OutOfMemory, PoDoFo is out of memory. in Signer.ComputeSignatureSequential({ }, Buff, true); HOT 6
- Is it possible to set the specified page not to be copied? HOT 1
- GetPageAt;ExtractTextTo: When the number of PDF pages increases linearly, the cost of both methods increases exponentially HOT 4
- FT_Load_Sfnt_Table, the judgment of return value HOT 1
- Question: Trying to add Marked Content and reshuffle OCG's HOT 1
- fails to compile on msys2/mingw HOT 13
- Unable to load legacy providers in OpenSSL >= 3.x.x HOT 4
- Program received signal SIGABRT, Aborted HOT 3
- how to remove freeobject HOT 1
- Question: Migration to 0.10.x - What happend to PdfField? HOT 3
- helloworld does not compile with libxml2 v2.12.6 HOT 2
- Loading Multipage TIFFs HOT 5
- multiple use of TIFFClose on the same TIFF handle HOT 8
- PDF size greatly increases when resaving. HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from podofo.