Giter VIP home page Giter VIP logo

Comments (18)

uschindler avatar uschindler commented on June 9, 2024 2

To give a bit of context: I would also add ZWNJ (and if I do, 3 tests trigger the validation violation). But Robert does not like it because he wants to write tests without escapes. I don't agree with that, but I don't want to argue.

To me the more important one is ZWSP (U+200B), because when you copy-paste code from javadocs generated by the javadoc tool, there are U+200B between method names and parameters, so it happens easily that you get them into code. This happened to me in the ExtractJdkApis.java file used by Gradle to generate API signature files.

So my PR #12937 just catches this one for now. In addition I added validation of the UTF-8 encoding, so files saved accidentally as ISO-8859-x are detected by ./gradlew validateSourcePatterns.

from lucene.

msokolov avatar msokolov commented on June 9, 2024 1

What about ZWNJ? TBH I don't understand the difference between ZWS and ZWNJ

from lucene.

uschindler avatar uschindler commented on June 9, 2024 1

What about ZWNJ? TBH I don't understand the difference between ZWS and ZWNJ

I could add it to the regex, but then Robert will shot me with his guns...

from lucene.

rmuir avatar rmuir commented on June 9, 2024

But the problem there wasn't the ZWSP but wrong character encoding? It also would have happened with a smiley emoji or arabic letter?

from lucene.

rmuir avatar rmuir commented on June 9, 2024

Also lots of tests have strings that have "non-visible" unicode characters. I think this issue is too broad and will never work.

Instead I think we should limit variable, method, class names, etc to simple ascii? But lets not police what is inside test strings please.

from lucene.

dweiss avatar dweiss commented on June 9, 2024

Wouldn't it be better if they were explicitly (\uxxxx) encoded though? Especially those invisible ones, which can be really annoying since you can't see them...

from lucene.

uschindler avatar uschindler commented on June 9, 2024

But the problem there wasn't the ZWSP but wrong character encoding? It also would have happened with a smiley emoji or arabic letter?

Yes. The problem was still that if you do not see it, you can't fix it. Not even the text editor showed them. It took me long time to fix this issue. Yes it was a default charset problem on Mike's device.

I generally agree with Mike on some of his points: if you need to write a test that requires some strange invisible character, please add it as Unicode escape in java code.

So I would like to add a check/pattern to or source patterns task that complains about hidden code. If there's a test that gets complaints: oh man just add an escape and all are happy.

from lucene.

rmuir avatar rmuir commented on June 9, 2024

Wouldn't it be better if they were explicitly (\uxxxx) encoded though? Especially those invisible ones, which can be really annoying since you can't see them...

But how in the world are you supposed to read such text?

from lucene.

rmuir avatar rmuir commented on June 9, 2024

The problem is specific to java identifiers, so any check must also use Character.isIdentifierIgnorable, not anyones definition of "invisible" which is arbitrary. These are the only ones that can impact the java source code in such a way like what Mike saw. And technically only identifiers need to be checked for this.

even then, failing or requiring escaping still doesn't prevent the nonsense. it is totally legal to do this:

int x = 1;
x\u200b = 1; // yes, it works escaped too, and changes value of 'x'

This is why i think it isn't useful to punish the unicode. Instead i'd prefer if some static analysis checker just rejected on any identifier that contains "identifier ignorable"... regardless of whether it was escaped or not: it is simply confusing to do this. we could also limit them to simple ascii. maybe ecj or error-prone has such a check.

we always come back around to this bikeshed/hammer of people wanting to remove all unicode e.g. from test strings, i'm against that!

from lucene.

rmuir avatar rmuir commented on June 9, 2024

I'm looking into this one: https://errorprone.info/bugpattern/UnicodeInCode

This wouldn't cause any problems for comments or literals which is my only concern: but would prevent crazy stuff in code itself. If it works without zillions of failures I'll make a PR.

from lucene.

rmuir avatar rmuir commented on June 9, 2024

OK see PR, this check just wasn't enabled because there was stuff to fix. I fixed the spatial3d code so now we can prevent this going forwards.

from lucene.

uschindler avatar uschindler commented on June 9, 2024

Wouldn't it be better if they were explicitly (\uxxxx) encoded though? Especially those invisible ones, which can be really annoying since you can't see them...

But how in the world are you supposed to read such text?

Your PR only fixed the identifiers. I am solely talking about Java files. If you need a text string to pass to an analyzer that splits on whitespace whatever, why can't we forbid invisible code points. Just use a escaped text string.

I am not talking about resource files. Just java code. And there I would like to enforce all string declarations with such characters to use Unicode escapes.

from lucene.

uschindler avatar uschindler commented on June 9, 2024

...just the hidden ones.

I agree with your PR for java code, as this is the security issue for patches coming in. 👍

P.S. the PR wouldn't have prevented the original issue by Mike, because this was Gradle code not passing through javac, but directly interpreted.

from lucene.

dweiss avatar dweiss commented on June 9, 2024

If you need a text string to pass to an analyzer that splits on whitespace whatever, why can't we forbid invisible code points. Just use a escaped text string.

I'm not against emojis in identifiers :). But I had to debug what Uwe mentioned more than a few times in my life - not just in source but also in data (jsons, etc.) - takes a while to figure out what's wrong when you can't see it.

from lucene.

uschindler avatar uschindler commented on June 9, 2024

Hi,
here is my simplest one to just prevent 2 unicode characters: #12937

In validateSourcePatterns we have more of those. It detected also the violation I introduced yesterday. The problem with u200B is mainly: Javadocs generated by the javadoc tool often separate method names from their parameters using that codepoint. If you copy/paste a method declaration, you insert it into code. This would possibly also be detected by Robert's errorprone, but this one applies to all source files.

from lucene.

rmuir avatar rmuir commented on June 9, 2024

i'm ok with the ZWSP and BOM marker. just not other characters that one might consider invisible such as ZWJ. i want the tests to use human readable strings. Sorry, I'll be sticking to my guns on that.

from lucene.

uschindler avatar uschindler commented on June 9, 2024

I will merge the other PR #12937 soon, will just make reporting better. It also detects invalid UTF-8 in all checked files (not only Java source files).

from lucene.

mikemccand avatar mikemccand commented on June 9, 2024

Thanks @uschindler and @rmuir!

from lucene.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.