Hello! Should UPath._make_child rep

Use `urllib.parse.urljoin` when joining paths about universal_pathlib HOT 5 CLOSED

fsspec commented on August 16, 2024

Use `urllib.parse.urljoin` when joining paths

from universal_pathlib.

Comments (5)

normanrz commented on August 16, 2024 1

I guess double slashes would then need to be constructed explicitly. Happy to review a PR, if you want to give the urljoin behaviour a try.

from universal_pathlib.

normanrz commented on August 16, 2024

Personally I would expect it to behave like urljoin.

I would agree. Is there actually a use case for double slashes in the middle of a url path?

from universal_pathlib.

joouha commented on August 16, 2024

Is there actually a use case for double slashes in the middle of a url path?

Most web servers will treat a double slash the same as a single slash, but a web server could respond with different responses, e.g. these two URIs point to different pages:

https://en.wikipedia.org/wiki/Python
https://en.wikipedia.org/wiki//Python

from universal_pathlib.

ap-- commented on August 16, 2024

I've been thinking about this for a bit, and I wonder what's the best way to address this.

For me it is easier to think about this in "pathlib-terms" if I rephrase this to: "Should specific file systems support empty path parts?"

If we assume some filesystem that supports "double slashes" I think an intuitive "pathlib-style" way to produce a double slash would be:

>>> UPath("protocol://somepath") / "" / "abc"
UPath("protocol://somepath//abc")

Thinking this through might be a little more involved though, since a lot of users might expect paths to handle similar between different file systems. For example on posix and windows because directories can't have the same name as a file, users (or at least me 😅) usually expect:

UPath("protocol://somepath") == UPath("protocol://somepath/") == UPath("protocol://somepath//")

which is why stdlib pathlib currently normalizes those paths to the same. So I guess for supporting empty parts we would actually need to implement behavior like:

>>> UPath("protocol://somepath") / ""
UPath("protocol://somepath//")

>>> assert UPath("protocol://somepath") == UPath("protocol://somepath/")
>>> assert UPath("protocol://somepath") != UPath("protocol://somepath//")

# but on a webserver
>>> UPath("protocol://somepath/a/b") != UPath("protocol://somepath/a/b/")

# --> so we should not normalize trailing slashes on those filesystems, I guess

And regarding the switch to urljoin: I usually find the urljoin behavior unintuitive. For example just check the behavior below:

from urllib.parse import urljoin

roots = [
    "http://example.com",
    "http://example.com/",
    "http://example.com/c",
    "http://example.com/c/",
]

paths = [
    "",
    "a/b",
    "/a/b",
    "//a/b",
    "///a/b",
    "////a/b",
    "/////a/b",
]

for root in roots:
    for path in paths:
        print(f"urljoin({root!r}, {path!r})".ljust(44), "==", repr(urljoin(root, path)))


# output of the above script
urljoin('http://example.com', '')            == 'http://example.com'
urljoin('http://example.com', 'a/b')         == 'http://example.com/a/b'
urljoin('http://example.com', '/a/b')        == 'http://example.com/a/b'
urljoin('http://example.com', '//a/b')       == 'http://a/b'
urljoin('http://example.com', '///a/b')      == 'http://example.com/a/b'
urljoin('http://example.com', '////a/b')     == 'http://example.com//a/b'
urljoin('http://example.com', '/////a/b')    == 'http://example.com///a/b'
urljoin('http://example.com/', '')           == 'http://example.com/'
urljoin('http://example.com/', 'a/b')        == 'http://example.com/a/b'
urljoin('http://example.com/', '/a/b')       == 'http://example.com/a/b'
urljoin('http://example.com/', '//a/b')      == 'http://a/b'
urljoin('http://example.com/', '///a/b')     == 'http://example.com/a/b'
urljoin('http://example.com/', '////a/b')    == 'http://example.com//a/b'
urljoin('http://example.com/', '/////a/b')   == 'http://example.com///a/b'
urljoin('http://example.com/c', '')          == 'http://example.com/c'
urljoin('http://example.com/c', 'a/b')       == 'http://example.com/a/b'
urljoin('http://example.com/c', '/a/b')      == 'http://example.com/a/b'
urljoin('http://example.com/c', '//a/b')     == 'http://a/b'
urljoin('http://example.com/c', '///a/b')    == 'http://example.com/a/b'
urljoin('http://example.com/c', '////a/b')   == 'http://example.com//a/b'
urljoin('http://example.com/c', '/////a/b')  == 'http://example.com///a/b'
urljoin('http://example.com/c/', '')         == 'http://example.com/c/'
urljoin('http://example.com/c/', 'a/b')      == 'http://example.com/c/a/b'
urljoin('http://example.com/c/', '/a/b')     == 'http://example.com/a/b'
urljoin('http://example.com/c/', '//a/b')    == 'http://a/b'
urljoin('http://example.com/c/', '///a/b')   == 'http://example.com/a/b'
urljoin('http://example.com/c/', '////a/b')  == 'http://example.com//a/b'
urljoin('http://example.com/c/', '/////a/b') == 'http://example.com///a/b'

I think we should go through all of this using a concrete example and define the behavior beforehand. I would also check and see how fsspec handles this for http filesystems to make sure that this all is supported upstream, before introducing special functionality in universal_pathlib. @joouha where did this issue pop up initially?

from universal_pathlib.

joouha commented on August 16, 2024

Hi,

For a bit of background, I encountered this issue when trying to load resources from web-pages. I wanted a universal interface to be able to load resources from a range of protocols, so universal pathlib seemed like a good option.

Say I load the page http://www.example.com/a/b/index.html with the following content:

<img src="image.png">
<img src="../image.png">
<img src="/image.png">
<img src="ftp://other.com/image.png">
<img src="//other.com/image.png">

I would expect to be able to join the page's URL with any resource link using the / operator,
and end up at the same resources which a browser would load (which is also urljoin's behaviour):

>>> UPath("http://www.example.com/a/b/index.html") / "image.png?version=1"
HTTPPath("http://www.example.com/page/image.png?version=1")

>>> UPath("http://www.example.com/a/b/index.html") / "../image.png"
HTTPPath("http://www.example.com/a/image.png")

>>> UPath("http://www.example.com/a/b/index.html") / "/image.png"
HTTPPath("http://www.example.com/image.png")

>>> UPath("http://www.example.com/a/b/index.html") / "ftp://other.com/image.png"
UPath("ftp://other.com/image.png")

>>> UPath("http://www.example.com/a/b/index.html") / "//other.com/image.png"
HTTPPath("http://other.com/image.png")

Since upath works with URIs, I would expect its behaviour to follow the the standards for the URI protocol defined in RCF3986.

I would expect UPath normalization and joining rules to differ from pathlib, since pathlib works with POSIX and Windows paths. These are not URIs - they follow their own behaviour patterns defined elsewhere.

So as a user, I would expect the following posix paths to be equivalent:

PosixPath("/somepath") == PosixPath("//somepath/") == PosixPath("//somepath//")

but I would not expect the following URIs to be equivalent, because RFC3986 states that they might point to different resources:

HTTPPath("https://en.wikipedia.org/wiki/Film") != HTTPPath("https://en.wikipedia.org/wiki/Film/") != HTTPPath("https://en.wikipedia.org/wiki//Film") != HTTPPath("https://en.wikipedia.org/wiki//Film/")

(which they actually do).

RFC3986 defines how many of the methods in universal pathlib should be implemented when dealing with URIs, such a joining URI paths, normalizing URIs, and URI equivalence.

Also, I like this as a way of constructing URI paths with double slashes - very elegant!

>>> UPath("protocol://somepath") / "" / "abc"
UPath("protocol://somepath//abc")

from universal_pathlib.

Use `urllib.parse.urljoin` when joining paths about universal_pathlib HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent