Giter VIP home page Giter VIP logo

Comments (4)

pdfa-mattk avatar pdfa-mattk commented on June 10, 2024

Hello @caspervanpomeren,

For characters that are outside the bounds of the Widths array as given by the FirstChar and LastChar entries, the value contained in the MissingWidths key is used for the character's width. In this example, because the space character is outside the bounds, it uses the '0' value from the MissingWidths - that's why there's no visual spacing between the two words "Hello" and 'World" when viewing the PDF file. To give the space character an actual width, you would change the FirstChar value to '32', and add the desired width (likely '278' based on the metrics for the font) as the first entry of the Widths array.
Per Table 109, the FirstChar and LastChar entries give the expected size of the Widths and therefore need to be valid for the size of the Widths array. '-1' is not a vaild value here. '251' could be a valid value if the appropriate entries were added to the Widths array to make the correct size.

The encoding of this font, as a Type 1 with no specified encoding and with the font Flags value having the 6th bit set (with a value of '32', only the 6th bit is set in this binary bit flag), is the StandardEncoding detailed in Annex D and for which the character encoding is given in Table D.2. A Type 1 font is a single-byte font and therefore cannot encode more than 255 characters [note: do not encode character 0 in a font, it will confuse many processor]. Character encodings are how a writer describes what content stream data corresponds to what font characters. To use characters in Helvetica that are not contained in StandardEncoding (or are not in one of the other pre-defined encodings, which you could use by supplying the name for the Encoding value in the font dictionary), you need to make an encoding dictionary as described in section 9.6.5 and add the characters you'd like to encode. You'd also want to adjust the Widths array suitably.

Note: these examples use unembedded Standard 14 fonts primarily for the sake of compactness. In most cases, I highly recommend embedding (and subsetting if desired) fonts that are used in PDF files.

Hope this helps!

from pdf20examples.

caspervanpomeren avatar caspervanpomeren commented on June 10, 2024

Hi @pdfa-mattk ,

Thanks for the quick and detailed response. It took a while to digest all the information and respond (had lots of reading/testing to do and was a bit sick), but here it is.

For characters that are outside the bounds of the Widths array as given by the FirstChar and LastChar entries, the value contained in the MissingWidths key is used for the character's width. In this example, because the space character is outside the bounds, it uses the '0' value from the MissingWidths - that's why there's no visual spacing between the two words "Hello" and 'World" when viewing the PDF file. To give the space character an actual width, you would change the FirstChar value to '32', and add the desired width (likely '278' based on the metrics for the font) as the first entry of the Widths array.
Per Table 109, the FirstChar and LastChar entries give the expected size of the Widths and therefore need to be valid for the size of the Widths array. '-1' is not a vaild value here. '251' could be a valid value if the appropriate entries were added to the Widths array to make the correct size.

I understand, MissingWidths is basically the fallback if you didn't specify a width for a character code.

(with a value of '32', only the 6th bit is set in this binary bit flag)

I finally understand the logic behind this! See the following JavaScript code:

(32).toString(2)
--> Returns: "100000"
From the right, the 6th bit is set (to 1)

And the other example I gave that I didn't understand (for future readers):

/Flags 262178  %Bits 2, 6, and 19
(262178).toString(2)
--> Returns: "1000000000000100010"
From the right, the 2nd, the 6th and the 19th bit is set (to 1)

And if I want to set a certain flag, for example 2, 6 and 19 I can use this logic to get the correct value:

parseInt("1000000000000100010", 2)
--> Returns: 262178

It took me a while to understand the whole bits concept, but it finally clicked.

A Type 1 font is a single-byte font and therefore cannot encode more than 255 characters

So if I understand correctly, the Helvetica font has 315 characters but I can only encode 255 characters when using a Type 1 font? This means I will not be able to encode all the 315 characters of the Helvetica font?

To use characters in Helvetica that are not contained in StandardEncoding (or are not in one of the other pre-defined encodings, which you could use by supplying the name for the Encoding value in the font dictionary), you need to make an encoding dictionary as described in section 9.6.5 and add the characters you'd like to encode. You'd also want to adjust the Widths array suitably.

So I tried playing around with this and basically tried four scenarios:

  1. Used StandardEncoding and tried using all the characters it contains;
  2. Used StandardEncoding + Differences array to get to different characters not usually in StandardEncoding;
  3. Used WinAnsiEncoding and tried using all the characters it contains;
  4. Tried using PDFDocEncoding, but later read that this is used for: "Encoding for text strings in a PDF document outside the document's content streams". Well unfortunately I started with this one and made a complete width's array for all it's characters... At least I now know how the octal number system works.

Here are my findings/questions based on these scenarios:

  1. Scenario: Used StandardEncoding and it's characters it contains.
    Here is my (relevant) code:
4 0 obj
  <</Length 154
  >>
stream
BT
  /F1 24 Tf
  72 696 Td
  (Hel¡lo ) Tj
  /F1 24 Tf
  (32000-2) Tj
  /F1 24 Tf
  156.1 0 Td
  (wor) Tj
  (ld) Tj
ET

133.3 694.2 m
221.4 694.2 l
1.2 w
S
endstream
endobj

5 0 obj
  <</Type /Font
    /Subtype /Type1
    /BaseFont /Helvetica
    /FirstChar 32
    /LastChar 251
    /Widths 6 0 R
    /FontDescriptor 7 0 R
  >>
endobj

6 0 obj
[ 278 278 355 556 556 889 667 222 333 333
  389 584 278 333 278 278 556 556 556 556
  556 556 556 556 556 556 278 278 584 584
  584 556 1015 667 667 722 722 667 611 778
  722 278 500 667 556 833 722 778 667 778
  722 667 611 722 667 944 667 667 611 278
  278 278 469 556 222 556 556 500 556 556
  278 556 556 222 222 500 222 833 556 556
  556 556 333 500 278 556 500 722 500 500
  500 334 260 334 584 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 333
  556 556 167 556 556 556 556 191 333 556
  333 333 500 500 0 556 556 556 278 0
  537 350 222 333 333 556 1000 1000 0
  611 0 333 333 333 333 333 333 333 333
  0 333 333 0 333 333 333 1000 0 0
  0 0 0 0 0 0 0 0 0 0
  0 0 0 0 1000 0 370 0 0 0
  0 556 778 1000 365 0 0 0 0 0
  889 0 0 0 278 0 0 222 611 944
  611 ]
endobj

7 0 obj
  <</Type /FontDescriptor
    /FontName /Helvetica
    /Flags 32
    /FontBBox [ -166 -225 1000 931 ]
    /ItalicAngle 0
    /Ascent 718
    /Descent -207
    /CapHeight 718
    /StemV 88
    /MissingWidth 0  
  >>
endobj

The problem here is, the code returns the following text: "Hel´¡lo 32000-2world". The text I was expecting was: "Hel¡lo 32000-2world".

What causes this extra ´?

This happens with all kinds of characters, for example if I try to insert: "HelØlo 32000-2world", I get "Helˆ�lo 32000-2world"

What am I doing wrong? I think I correctly filled in the widths array and these characters are included in the StandardEncoding if I look at Annex D.

The only thing I could think of was that I filled the widths array with widths of zero to compensate for the holes between the available character codes that were available in the encoding, but this was the only logical thing that actually made the widths array correctly work for me.

  1. Scenario: Used StandardEncoding + Differences array.
    Here is my (relevant) code:
4 0 obj
  <</Length 154
  >>
stream
BT
  /F1 24 Tf
  72 696 Td
  (Hel$lo ) Tj
  /F1 24 Tf
  (32000-2) Tj
  /F1 24 Tf
  156.1 0 Td
  (wor) Tj
  (ld) Tj
ET

133.3 694.2 m
221.4 694.2 l
1.2 w
S
endstream
endobj

5 0 obj
  <</Type /Font
    /Subtype /Type1
    /BaseFont /Helvetica
    /Encoding 6 0 R
    /FirstChar 33
    /LastChar 126
    /Widths 7 0 R
    /FontDescriptor 8 0 R
  >>
endobj

6 0 obj
  <</Type /Encoding
    /Differences [36 /Euro]
  >>
endobj

7 0 obj
[ 278 355 556 556 889 667 222 333 333 389 584 278 333 278 278 556
  556 556 556 556 556 556 556 556 556 278 278 584 584 584 556 1015
  667 667 722 722 667 611 778 722 278 500 667 556 833 722 778 667
  778 722 667 611 722 667 944 667 667 611 278 278 278 469 556 222
  556 556 500 556 556 278 556 556 222 222 500 222 833 556 556 556
  556 333 500 278 556 500 722 500 500 500 334 260 334 584 556 ]
endobj

8 0 obj
  <</Type /FontDescriptor
    /FontName /Helvetica
    /Flags 32
    /FontBBox [ -166 -225 1000 931 ]
    /ItalicAngle 0
    /Ascent 718
    /Descent -207
    /CapHeight 718
    /StemV 88
    /MissingWidth 0  
  >>
endobj

This returns: "Hel€lo 32000-2world" as expected. The only thing I noticed was I can't use character names like sterling in the differences array since it already exists as a character code in the StandardEncoding. Is it not allowed to have two character codes for the same thing or am I missing something?

  1. Scenario: Used WinAnsiEncoding and it's characters it contains.
    Here is my (relevant) code:
4 0 obj
  <</Length 154
  >>
stream
BT
  /F1 24 Tf
  72 696 Td
  (Hel¡lo ) Tj
  /F1 24 Tf
  (32000-2) Tj
  /F1 24 Tf
  156.1 0 Td
  (wor) Tj
  (ld) Tj
ET

133.3 694.2 m
221.4 694.2 l
1.2 w
S
endstream
endobj

5 0 obj
  <</Type /Font
    /Subtype /Type1
    /BaseFont /Helvetica
    /Encoding /WinAnsiEncoding
    /FirstChar 32
    /LastChar 255
    /Widths 6 0 R
    /FontDescriptor 7 0 R
  >>
endobj

6 0 obj
[ 278 278 355 556 556 889 667 191 333 333 
  389 584 278 333 278 278 556 556 556 556 
  556 556 556 556 556 556 278 278 584 584 
  584 556 1015 667 667 722 722 667 611 778 
  722 278 500 667 556 833 722 778 667 778 
  722 667 611 722 667 944 667 667 611 278
  278 278 469 556 333 556 556 500 556 556 
  278 556 556 222 222 500 222 833 556 556 
  556 556 333 500 278 556 500 722 500 500 
  500 334 260 334 584 0 556 0 222 556 
  333 1000 556 556 333 1000 667 333 1000 0 
  611 0 0 222 222 333 333 350 556 1000
  333 1000 500 333 944 0 500 667 0 333 
  556 556 556 556 260 556 333 737 370 556
  584 0 737 333 400 584 333 333 333 556 
  537 278 333 333 365 556 834 834 834 611
  667 667 667 667 667 667 1000 722 667 667 
  667 667 278 278 278 278 722 722 778 778 
  778 778 778 584 778 722 722 722 722 667 
  667 611 556 556 556 556 556 556 889 500 
  556 556 556 556 278 278 278 278 556 556 
  556 556 556 556 556 584 611 556 556 556 
  556 500 556 500 ]
endobj

7 0 obj
  <</Type /FontDescriptor
    /FontName /Helvetica
    /Flags 32
    /FontBBox [ -166 -225 1000 931 ]
    /ItalicAngle 0
    /Ascent 718
    /Descent -207
    /CapHeight 718
    /StemV 88
    /MissingWidth 0  
  >>
endobj

The problem here is, the code returns the following text: "Hel¡lo 32000-2 world". The text I was expecting was: "Hel¡lo 32000-2world".

What causes this extra Â?

This is basically the same problem as in scenario 1.

Another problem, when I try using this text: "Hel€lo 32000-2world", I get this: "Hel€lo 32000w-2orld". Even though WinAnsiEncoding does include the character Euro.

Note: these examples use unembedded Standard 14 fonts primarily for the sake of compactness. In most cases, I highly recommend embedding (and subsetting if desired) fonts that are used in PDF files.

I understand, but I am currently trying to understand the entire spec and make examples that explain everything from the spec. In these examples I also want to benefit from the compactness of unembedded standard fonts. I plan on making a pull request with all my examples and you can decide if you want them added.

When I will use my knowledge of the spec, I will certainly use embedding and subsetting fonts. Do you know if there are any simple examples of embedding/subsetting of fonts? This is basically the next step I am going to work on.

To conclude, a more generic question: How do people generally learn to write pdf by hand? Do they just read the spec and go from there? Or are there certain resources that are recommended? Or are there certain communities on IRC/Discord etc? Because while I have found some tutorials/knowledge online, it isn't a huge amount. It's especially hard since lots of information isn't based on the latest spec and the term "pdf" is used so much in relation to other things that search engines don't really return what I am looking for.

Some extra background information, my end goal is to create a JavaScript library that can automatically write pdf that complies with the latest pdf spec and almost completely supports every aspect of it. So I am starting by doing everything by hand and understanding how everything works, and then I am going to translate that knowledge to JavaScript code,

Thanks again for the help. I also understand that this is quite some text, so please take your time and even if you can only answer one thing I would really appreciate it.

If I need to clarify anything please let me know.

Casper

from pdf20examples.

pdfa-mattk avatar pdfa-mattk commented on June 10, 2024

Hi @caspervanpomeren, let me see if I can answer some of your questions here:

So if I understand correctly, the Helvetica font has 315 characters but I can only encode 255 characters when using a Type 1 font? This means I will not be able to encode all the 315 characters of the Helvetica font?

As a Type1 font, this is correct. This is a general characteristic of Type1 fonts as defined in PDF. Because they use single byte values in the content stream to reference characters in their encodings, and a single byte can only hold up to 256 different values, this sets the limit on the number of different characters that can be encoded in a given instance of a Type1 font.
You could make two different instances of Helvetica, with two different font dictionaries that use two different Encodings that have different Differences arrays, and between the two encode all 315 characters. Then, you just need to use the correct instance to set the appropriate text strings. They can even both be named "Helvetica". The PDF will have two different fonts, but these will both be instances of Helvetica.

The problem here is, the code returns the following text: "Hel´¡lo 32000-2world". The text I was expecting was: "Hel¡lo 32000-2world".
What causes this extra ´?

For Type1 fonts, strings in content streams are read as individual bytes. I suspect that the text you're putting into the content stream might be encoded in UTF-8 - this would cause the ¡ (inverted exclamation mark, U+00A1) to be expressed as two bytes: 0xC2 0xA1. The "extra" character is likely that extra 0xC2 that I suspect you're putting into the content stream. The same concept looks like it explains the other odd and extra characters you're seeing when trying to put other characters in.
The bytes of the content stream are interpreted as individual bytes, and those single bytes are used to look up characters in the font's Encoding. The content stream bytes are not in UTF-8, they're presumed to be expressed specifically in the font's Encoding.

The only thing I noticed was I can't use character names like sterling in the differences array since it already exists as a character code in the StandardEncoding. Is it not allowed to have two character codes for the same thing or am I missing something?

I don't know of any restriction on doing this. You should be able to use any name in any position in the Differences array, and as long as the font used for display has that character it should work. Could you tell me what error or behavior you were seeing when you tried this?

a more generic question: How do people generally learn to write pdf by hand? Do they just read the spec and go from there?

Most everything in PDF 2.0 is shared with earlier versions of PDF, so your best resources are mostly going to be written for earlier versions of PDF. Many people learn by reading the spec, examining PDFs that are generated from other libraries or programs, and experimenting.
Two books I can recommend are Leonard Rosenthal's "Developing with PDF: Dive Into the Portable Document Format" and John Whitington's "PDF Explained: The ISO Standard for Document Exchange". https://brendanzagaeski.appspot.com/0004.html has some good introduction and starting points as well.
Most everything you learn about making PDF files from versions before PDF 2.0 will apply to PDF 2.0 as well, so don't worry about looking for PDF 2.0 - specific tutorials.

from pdf20examples.

web-apply avatar web-apply commented on June 10, 2024

https://raw.githubusercontent.com/pdf-association/pdf20examples/master/Simple%20PDF%202.0%20file.pdf

from pdf20examples.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.