I'm opening this issue to create a conversation around supporting unicode and other il

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Unicode Character Expansion about path-to-regexp HOT 14 CLOSED

pillarjs commented on August 27, 2024 1

Unicode Character Expansion

from path-to-regexp.

Comments (14)

blakeembrey commented on August 27, 2024 3

Here's a framework level function that should work (but could use some improvements):

function encoding (ignore) {
  return {
    encode: function (value) {
      return encodeURIComponent(value)
    },
    decode: function (value) {
      return value.replace(/(?:%[ef][0-9a-f](?:%[0-9a-f]{2}){2}|%[cd][0-9a-f]%[0-9a-f]{2}|%[0-9a-f]{2})/ig, function (m) {
        const char = decodeURIComponent(m)

        if (ignore.indexOf(char) > -1) return m

        return char
      })
    }
  }
}

Note: You can not technically just use decodeURI() because a : won't be decoded but it can be encoded.

from path-to-regexp.

dougwilson commented on August 27, 2024

I'm fine with this, as long as it expands to the correct Unicode normalization form, whatever the specs say IRIs are expanded to.

from path-to-regexp.

blakeembrey commented on August 27, 2024

@dougwilson Awesome, I'll start an investigation on that 👍

from path-to-regexp.

dougwilson commented on August 27, 2024

Looks like it's RFC 3987 section 3.1 :)

Also, if it helps, Mojolicious already does this stuff, so it'd be interesting to know how they are doing their transformation, since it has worked well for a long time.

from path-to-regexp.

blakeembrey commented on August 27, 2024

Very nice, thank you 😄

from path-to-regexp.

blakeembrey commented on August 27, 2024

@dougwilson Just looking through the implementation, it looks like they decode it at the framework level before passing it to route matches? Did I understand that correctly? http://mojolicio.us/perldoc/Mojo/Path#to_route

Is that something that could be considered for Express 5.0?

from path-to-regexp.

dougwilson commented on August 27, 2024

I'll have to look more at their source code (which can be browsed at https://metacpan.org/source/SRI/Mojolicious-6.0/lib), but I'm open to whatever we want to do for Express 5.0. Currently Express 5.0 just passes the raw req.url down to this module, so theoretically that makes it the most flexible of all, but if it's not flexible enough, we can always make it more flexible :)

But really, what I wanted to know is how do they do the UTF-8 -> URI transformation. The characters in JavaScript source code are UCS-2 and so we need to have some kind of way to transform source code strings to URIs reliably. For example, if I type ú in source code, there are actually several different byte sequences that result in that character and even multiple different Unicode code points (U+00FA or U+0075 U+0301). Which ones do we support, or do we treat them the same?

Example:

decodeURIComponent('%C3%BA') // -> ú
decodeURIComponent('u%CC%81') // -> ú

Should a user have to understand that the ú they type in their editor may not match the ú that comes in the URI?

from path-to-regexp.

bajtos commented on August 27, 2024

decode it at the framework level before passing it to route matches

This is a great idea, especially if there is a way how to get it working reliably for all edge cases. It should solve the problem where there are multiple ways how to encode a single character, be it ú as mentioned by @dougwilson, or characters like ~ that some clients encode and some don't (IIRC).

It may be possible to implement it in a backward-compatible way if we url-decode the path inputs passed by API users too.

from path-to-regexp.

blakeembrey commented on August 27, 2024

@dougwilson So ES6 has a built in method for this: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

decodeURIComponent('u%CC%81').normalize() === decodeURIComponent('%C3%BA')

Edit: Only 0.12 for String#normalize, but we can also use a third party module like https://github.com/walling/unorm for older node versions. I think putting this at the framework level actually makes more sense here, because we'll be decoding it a bunch of times and it'll make the flow more confusing otherwise. If the path accepts unicode, I think we should be able to manually compare too req.url === 'ú'.

from path-to-regexp.

nicooprat commented on August 27, 2024

Any news on this?

from path-to-regexp.

blakeembrey commented on August 27, 2024

@nicooprat This issue seems like the opposite to what you're looking for (this was about unicode inputs to path matching). From kadirahq/flow-router#599, it seems like what you'd like is the ability to specify the way to encode - currently it's only pretty but that won't be enough to handle unicode characters outside of ASCII.

from path-to-regexp.

blakeembrey commented on August 27, 2024

By the way, I took another stab at doing this for the path-to-regexp 2.0. I don't think it's reasonably possible since every single character could be encoded differently. This has reinforced my view that it needs to be encoded at the router level. Would love additional input though.

For a simple example, type /+ into the browser. The + is not encoded. Now do encodeURIComponent (which you should be using programmatically to safely encode user input). Now you've got an escaped plus. The route will only match one or the other. This gets even more complex when you have multiple representations of the same character input. I'd recommend decoding at the framework level (it's more performant than using path-to-regexp and having this library decode the same string hundreds of times).

from path-to-regexp.

blakeembrey commented on August 27, 2024

I have added a basic normalizePathname function to this library in 2c3baf1. It uses a simpler implementation than above so you need to make sure of two things if you use this function:

You don't double decode a string (e.g. normalizePathname already calls decodeURIComponent, you shouldn't call it again in application code)
You don't mind that %2F will become /, which will change the match (e.g. /test/route and /test%2Froute were previously different)

from path-to-regexp.

blakeembrey commented on August 27, 2024

Unfortunately that was a quick 9 hours. I've released a 5.x which removes normalizePathname - I looked around to some routing libraries online and played with URL, and it wasn't consistent. Also noted that String.prototype.normalize would not be supported on IE. Instead, I've opted to document the solution in the README.

Note for implementors: if you're using URL, I added an encode option to pathToRegexp so you can encode paths using encodeURI. This will ensure path names from URL match against the RegEx as you cannot use the solution in the README (URL will always encode the pathname).

from path-to-regexp.

Unicode Character Expansion about path-to-regexp HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent