Giter VIP home page Giter VIP logo

Comments (6)

brad4d avatar brad4d commented on May 28, 2024

I cannot tell from the example a.js file in the description whether the á character is correctly encoded as UTF-8 in the file you're actually using when you see this error.

Can you confirm that the input file, a.js is actually correct utf-8?

from closure-compiler.

brad4d avatar brad4d commented on May 28, 2024

Actually, could you just attach 2 files to this issue?

  1. The actual a.js file.
  2. The exact output from closure-compiler itself. (i.e. the input that python is seeing)

from closure-compiler.

juj avatar juj commented on May 28, 2024

Here are the input files: a.zip

image

C3 A1 is 11000011 10100001, which is of form 110xxxxx 10yyyyyy, i.e. a leading code point and a continution code point. See e.g. Wikipedia on UTF-8 Encoding. The Unicode code point in this case will be xxxxxyyyyyy = 00011 100001 = 0xE1 = https://www.compart.com/en/unicode/U+00E1.

The exact output from closure-compiler itself. (i.e. the input that python is seeing)

The test case does not produce any JavaScript output from closure-compiler. Python attempts to capture the stderr error message from Closure process, but Python croaks internally since it cannot decode the stderr bytes that Closure is outputting, and so does not produce any output to the calling a.py file.

Executing the following python file instead

import subprocess
ret = subprocess.run(['npx', 'google-closure-compiler','--charset=UTF8','--js','a.js','--js_output_file','o.js'], encoding='iso-8859-1', stderr=subprocess.PIPE, shell=True)
print(ret.stderr)

does not throw an exception, and instead causes Python to print the stderr as expected:

a.js:1:4: WARNING - [JSC_SUSPICIOUS_NAN] Comparison against NaN is always false. Did you mean isNaN()?
  1| if (4 == NaN) console.log('á');
         ^^^^^^^^

from closure-compiler.

brad4d avatar brad4d commented on May 28, 2024

What I want to know is this:

Is closure-compiler actually generating an invalid character sequence to stderr, or is something else going on?

One thing that could be happening is that the stderr output from closure-compiler could be getting mixed with output from either its own stdout or output from some other process that happens to share the same output stream. Due to buffering, the 2-character sequence for 'á' closure-compiler sends to stderr could be interrupted by output from somewhere else..

Thanks for providing the a.js file and your command line. We can use that to find out what the actual stderr output from the latest closure-compiler build is for this case.

If this problem is in some way actually tied to Windows, we're unlikely to fix it ourselves as none of the core team uses Windows when working on closure-compiler.

from closure-compiler.

brad4d avatar brad4d commented on May 28, 2024

Thank you for supplying the a.js file.

  1. I downloaded it
  2. I checked out and built the latest version of closure-compiler as a Java jar file.
  3. I stored the path to that jar file in $ccjar
  4. I ran the following commands to check the behavior.

First confirm that my terminal / OS is using UTF-8

$ echo $LANG
en_US.UTF-8
$ echo á |xxd
00000000: c3a1 0a  

Yep. c3a1 is the correct byte pair for this UTF-8 character as stated in a previous comment.

Now confirm that the character is correct in a.js

$ xxd a.js
00000000: 6966 2028 3420 3d3d 204e 614e 2920 636f  if (4 == NaN) co
00000010: 6e73 6f6c 652e 6c6f 6728 27c3 a127 293b  nsole.log('..');
00000020: 0d0a                                     ..

Yep.

Now run the compiler with the options as described in earlier comments and save its stderr output into err.out and use xxd to check the contents of that file.

$ java -jar $ccjar --charset=UTF8 --js a.js --js_output_file  o.js 2> err.out
$ xxd err.out
00000000: 612e 6a73 3a31 3a34 3a20 5741 524e 494e  a.js:1:4: WARNIN
00000010: 4720 2d20 5b4a 5343 5f53 5553 5049 4349  G - [JSC_SUSPICI
00000020: 4f55 535f 4e41 4e5d 2043 6f6d 7061 7269  OUS_NAN] Compari
00000030: 736f 6e20 6167 6169 6e73 7420 4e61 4e20  son against NaN 
00000040: 6973 2061 6c77 6179 7320 6661 6c73 652e  is always false.
00000050: 2044 6964 2079 6f75 206d 6561 6e20 6973   Did you mean is
00000060: 4e61 4e28 293f 0a20 2031 7c20 6966 2028  NaN()?.  1| if (
00000070: 3420 3d3d 204e 614e 2920 636f 6e73 6f6c  4 == NaN) consol
00000080: 652e 6c6f 6728 27c3 a127 293b 0d0a 2020  e.log('..');..  
00000090: 2020 2020 2020 205e 5e5e 5e5e 5e5e 5e0a         ^^^^^^^^.
000000a0: 0a30 2065 7272 6f72 2873 292c 2031 2077  .0 error(s), 1 w
000000b0: 6172 6e69 6e67 2873 290a                 arning(s).

Yep. We again see "c3" and "a1" used as the 2-byte encoding in bytes at positions 0x87 and 0x88.

The Java jar executing in Linux is definitely generating stderr using UTF-8 encoding.

Probably the closure-compiler you're running has been converted from a jar file to a native Windows binary using Graal, because I think that's what the google/closure-compiler-npm code that generates the NPM release tries to make the default.

I'm not sure if the different behavior you see is the result of Windows behavior or in the behavior of Java on Windows (as emulated by Graal), or something else.

from closure-compiler.

juj avatar juj commented on May 28, 2024

One simplification/note to the bug test case is that the original a.py was

import subprocess
subprocess.run(['npx', 'google-closure-compiler','--charset=UTF8','--js','a.js','--js_output_file','o.js'], encoding='utf-8', stderr=subprocess.PIPE, shell=True)

although this bug does not relate to --charset=UTF8 parameter, and the bug occurs also with shorter line

import subprocess
subprocess.run(['npx', 'google-closure-compiler','--js','a.js','--js_output_file','o.js'], encoding='utf-8', stderr=subprocess.PIPE, shell=True)

It is expected that the issue does not occur on Linux or macOS, since those OSes default to UTF-8 widely.

In my Windows shell I have changed my active codepage to UTF-8, i.e.

C:\emsdk\emscripten\main>chcp
Active code page: 65001

See chcp 65001.

Although this change does not affect the bug, so this is not a Windows terminal/console issue, but something somewhere in the libraries in question either in Closure or somewhere else like observed.

We successfully worked around this in Emscripten code by specifying a directive encoding='iso-8859-1' if WINDOWS else 'utf-8' when invoking Closure.

from closure-compiler.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.