Comments (18)
To give a bit of context: I would also add ZWNJ (and if I do, 3 tests trigger the validation violation). But Robert does not like it because he wants to write tests without escapes. I don't agree with that, but I don't want to argue.
To me the more important one is ZWSP (U+200B), because when you copy-paste code from javadocs generated by the javadoc tool, there are U+200B between method names and parameters, so it happens easily that you get them into code. This happened to me in the ExtractJdkApis.java
file used by Gradle to generate API signature files.
So my PR #12937 just catches this one for now. In addition I added validation of the UTF-8 encoding, so files saved accidentally as ISO-8859-x are detected by ./gradlew validateSourcePatterns
.
from lucene.
What about ZWNJ? TBH I don't understand the difference between ZWS and ZWNJ
from lucene.
What about ZWNJ? TBH I don't understand the difference between ZWS and ZWNJ
I could add it to the regex, but then Robert will shot me with his guns...
from lucene.
But the problem there wasn't the ZWSP but wrong character encoding? It also would have happened with a smiley emoji or arabic letter?
from lucene.
Also lots of tests have strings that have "non-visible" unicode characters. I think this issue is too broad and will never work.
Instead I think we should limit variable, method, class names, etc to simple ascii? But lets not police what is inside test strings please.
from lucene.
Wouldn't it be better if they were explicitly (\uxxxx) encoded though? Especially those invisible ones, which can be really annoying since you can't see them...
from lucene.
But the problem there wasn't the ZWSP but wrong character encoding? It also would have happened with a smiley emoji or arabic letter?
Yes. The problem was still that if you do not see it, you can't fix it. Not even the text editor showed them. It took me long time to fix this issue. Yes it was a default charset problem on Mike's device.
I generally agree with Mike on some of his points: if you need to write a test that requires some strange invisible character, please add it as Unicode escape in java code.
So I would like to add a check/pattern to or source patterns task that complains about hidden code. If there's a test that gets complaints: oh man just add an escape and all are happy.
from lucene.
Wouldn't it be better if they were explicitly (\uxxxx) encoded though? Especially those invisible ones, which can be really annoying since you can't see them...
But how in the world are you supposed to read such text?
from lucene.
The problem is specific to java identifiers, so any check must also use Character.isIdentifierIgnorable
, not anyones definition of "invisible" which is arbitrary. These are the only ones that can impact the java source code in such a way like what Mike saw. And technically only identifiers need to be checked for this.
even then, failing or requiring escaping still doesn't prevent the nonsense. it is totally legal to do this:
int x = 1;
x\u200b = 1; // yes, it works escaped too, and changes value of 'x'
This is why i think it isn't useful to punish the unicode. Instead i'd prefer if some static analysis checker just rejected on any identifier that contains "identifier ignorable"... regardless of whether it was escaped or not: it is simply confusing to do this. we could also limit them to simple ascii. maybe ecj or error-prone has such a check.
we always come back around to this bikeshed/hammer of people wanting to remove all unicode e.g. from test strings, i'm against that!
from lucene.
I'm looking into this one: https://errorprone.info/bugpattern/UnicodeInCode
This wouldn't cause any problems for comments or literals which is my only concern: but would prevent crazy stuff in code itself. If it works without zillions of failures I'll make a PR.
from lucene.
OK see PR, this check just wasn't enabled because there was stuff to fix. I fixed the spatial3d code so now we can prevent this going forwards.
from lucene.
Wouldn't it be better if they were explicitly (\uxxxx) encoded though? Especially those invisible ones, which can be really annoying since you can't see them...
But how in the world are you supposed to read such text?
Your PR only fixed the identifiers. I am solely talking about Java files. If you need a text string to pass to an analyzer that splits on whitespace whatever, why can't we forbid invisible code points. Just use a escaped text string.
I am not talking about resource files. Just java code. And there I would like to enforce all string declarations with such characters to use Unicode escapes.
from lucene.
...just the hidden ones.
I agree with your PR for java code, as this is the security issue for patches coming in. 👍
P.S. the PR wouldn't have prevented the original issue by Mike, because this was Gradle code not passing through javac, but directly interpreted.
from lucene.
If you need a text string to pass to an analyzer that splits on whitespace whatever, why can't we forbid invisible code points. Just use a escaped text string.
I'm not against emojis in identifiers :). But I had to debug what Uwe mentioned more than a few times in my life - not just in source but also in data (jsons, etc.) - takes a while to figure out what's wrong when you can't see it.
from lucene.
Hi,
here is my simplest one to just prevent 2 unicode characters: #12937
In validateSourcePatterns
we have more of those. It detected also the violation I introduced yesterday. The problem with u200B is mainly: Javadocs generated by the javadoc tool often separate method names from their parameters using that codepoint. If you copy/paste a method declaration, you insert it into code. This would possibly also be detected by Robert's errorprone, but this one applies to all source files.
from lucene.
i'm ok with the ZWSP and BOM marker. just not other characters that one might consider invisible such as ZWJ. i want the tests to use human readable strings. Sorry, I'll be sticking to my guns on that.
from lucene.
I will merge the other PR #12937 soon, will just make reporting better. It also detects invalid UTF-8 in all checked files (not only Java source files).
from lucene.
Thanks @uschindler and @rmuir!
from lucene.
Related Issues (20)
- [Ideation] Improve readInts24 & readDelta16 performance for DocIdsWriter
- Write a HOWTO migrate Codec format version HOT 1
- Add Facets#getBulkSpecificValues
- Reproducible failure in TestFloatVectorSimilarityQuery.testApproximate HOT 1
- [Minor] Regeneration investigation HOT 6
- Let's run our Monster tests, at least once? HOT 3
- Should we clean up the few remaining references to `Lucene/Solr`? HOT 1
- Test failure in TestKnnGraph.testMultiThreadedSearch HOT 6
- Find and replace uses of unbounded array growth with `growInRange` HOT 1
- Test failure in TestHnswFloatVectorGraph HOT 5
- Can we ban `Thread.sleep`? HOT 6
- Occasional OOMEs when running the test suite HOT 1
- org.apache.lucene.search.TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity fails intermittently HOT 3
- Improve backward-compatibility testing
- Reproducible test failure with Terms#intersect on the default codec HOT 8
- `gradlew check` fails checksum validation in a fresh clone on Windows HOT 11
- `TestStressLockFactories` fails on Windows in a freshly cloned repository HOT 29
- FSDirectory stuck at open(Path path) method when ran from .jar file HOT 6
- Use Reverse RetentionQuerySupplier to speed up SoftDeletesRetentionMergePolicy#numDeletesToMerge HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lucene.