Comments (9)
I got somewhere. Take it with a grain of salt though, as I might have easily missed something :)
The problem appears when multi-character unicode emojis get broken up into separate diffs inside a patch:
'💛'.length // 2
'💛'.charAt(0) // \ud83d
'💛'.charAt(1) // \udc9b
encodeURI("\ud83d") // malformed uri error
encodeURI("\udc9b") // malformed uri error
encodeURI("\uD83D\udc9b") // '%F0%9F%92%9B'
This mostly happens with internally generated prefixes/suffixes, for example inside patch_addContext_
- dmp "allocates" a chunk which starts/ends in the middle of an emoji. But it can occur inside an actual diff, too.
It seems to me that the problem can be solved by replacing encodeURI
/decodeURI
with escape
/unescape
(or other prefered escape method) inside patch_fromText and toString, have you tried this? It seems to work fine for my use-case - I hope it doesn't break something else. encodeURI
feels quite out of place anyway, since the code deals with abstract text, and not URIs.
Before I tried this, I also tinkered with the source code for quite a while, and I did seemingly manage to prevent such emoji splitting - in patch_addContext_
, by testing if encodeURI
throws, and increasing the padding and shifting the preffix end/suffix start until it didn't. But this really is more of a desperate hack than anything else. Such approach may be able to fix emojis breaking up, but dmp would still throw as soon as an invalid character would appear for any other reason (other characters can throw too).
from diff-match-patch.
@laurent22 I'm running into the same problem, did you manage to get around this?
from diff-match-patch.
We simply don't use patch_toText anymore because it's pretty much broken. Instead we use patch_make and serialise to JSON, which for our purpose is good enough (because we don't need the patch in proper diff format).
https://github.com/laurent22/joplin/blob/dev/packages/lib/models/Revision.ts
from diff-match-patch.
@laurent22 Thank you so much for such quick reply! What do you mean by it's pretty much broken? Do you mean just this issue, or other issues, too?
Unfortunately we don't use diff-match-patch
directly - one of our critical deps does, so I pretty much have to solve this and patch either dmp, or the dependency that uses it. I will post the results :)
from diff-match-patch.
What do you mean by it's pretty much broken?
I mean that in our particular case this function is unusable because a lot of random strings are thrown at it, so many users sooner or later will hit this escapeUri bug.
But patch_make as a replacement is fine, it doesn't crash.
If you figure out a fix please let us know! I looked at the code at some point but couldn't fix it.
from diff-match-patch.
@laurent22 Ok, I've come a long way since my last post. I found out much work was devoted to this problem in the official Google's repo (this repo is just an independently maintained npm package for it, which I didn't know). Particulary these three issues: #59 #69 #80.
It seems to me that the problem can be solved by replacing encodeURI/decodeURI with escape/unescape (or other prefered escape method) inside patch_fromText and toString
This does break some other use-cases after all, as noticed in this issue for a related library which tried this exact fix (not sure if this can happen outside of Chineese or similar languages, though).
I also tinkered with the source code for quite a while, and I did seemingly manage to prevent such emoji splitting
My previous approach was way off. But I adapted dmsnell's approach from issue #80 to the toText
method, and it seems to be working just fine - it can be seen here:)
from diff-match-patch.
Thanks for looking into it @michal-kurz. It seems unlikely that any change will be merged to the official repository (the PR is from 2019) - do you know if there's a good fork being maintained somewhere where that kind of fix could be applied?
from diff-match-patch.
@michal-kurz I've recently added my fix to patch_toText
because I found another trigger when running this test case.
dmp.patch_toText( dmp.patch_make( '🅰 not a ', '🅰 not a s' ) );
the problem here is when we're generating the context for the patch objects, we might go ahead and split surrogates again.
I would be very interested in simply using my branch instead of master
for diff-match-patch
since @NeilFraser is seemingly not available to help push that fix into the main branch.
What do you think of this? by the way I'm planning on cleaning up my branch now too, particularly with the tests, and merge it into my fork's master
so potentially soon we could just point the code to dmsnell/diff-match-patch instead of Google/diff-match-patch and the fixes would simply be there without any further hacking around.
either way, if we could fix this here, it would be helpful to jsondiffpatch
, which currently exposes this bug in more cases than most people using diff-match-patch
directly would.
from diff-match-patch.
cc: @JackuB ^^^
from diff-match-patch.
Related Issues (13)
- Wrong license on npm HOT 2
- Update to newest Google version HOT 1
- API messed up 1.0.0 -> 1.0.3 HOT 3
- Can't instantiate DiffMatchPatch HOT 2
- Files with more than 65535 lines produce incorrect diffs HOT 1
- How to compare multi-line diff
- Boilerplate in License HOT 1
- Uncaught ReferenceError: module is not defined HOT 2
- Possible diff miss-match between JavaScript and C++ HOT 2
- Documentation/Examples HOT 2
- The link to the API doesn't exist anymore HOT 2
- patch_toText, unified diff like output
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from diff-match-patch.