Giter VIP home page Giter VIP logo

Comments (5)

normanrz avatar normanrz commented on August 16, 2024 1

I guess double slashes would then need to be constructed explicitly. Happy to review a PR, if you want to give the urljoin behaviour a try.

from universal_pathlib.

normanrz avatar normanrz commented on August 16, 2024

Personally I would expect it to behave like urljoin.

I would agree. Is there actually a use case for double slashes in the middle of a url path?

from universal_pathlib.

joouha avatar joouha commented on August 16, 2024

Is there actually a use case for double slashes in the middle of a url path?

Most web servers will treat a double slash the same as a single slash, but a web server could respond with different responses, e.g. these two URIs point to different pages:

https://en.wikipedia.org/wiki/Python
https://en.wikipedia.org/wiki//Python

from universal_pathlib.

ap-- avatar ap-- commented on August 16, 2024

I've been thinking about this for a bit, and I wonder what's the best way to address this.

For me it is easier to think about this in "pathlib-terms" if I rephrase this to: "Should specific file systems support empty path parts?"

If we assume some filesystem that supports "double slashes" I think an intuitive "pathlib-style" way to produce a double slash would be:

>>> UPath("protocol://somepath") / "" / "abc"
UPath("protocol://somepath//abc")

Thinking this through might be a little more involved though, since a lot of users might expect paths to handle similar between different file systems. For example on posix and windows because directories can't have the same name as a file, users (or at least me 😅) usually expect:

UPath("protocol://somepath") == UPath("protocol://somepath/") == UPath("protocol://somepath//")

which is why stdlib pathlib currently normalizes those paths to the same. So I guess for supporting empty parts we would actually need to implement behavior like:

>>> UPath("protocol://somepath") / ""
UPath("protocol://somepath//")

>>> assert UPath("protocol://somepath") == UPath("protocol://somepath/")
>>> assert UPath("protocol://somepath") != UPath("protocol://somepath//")

# but on a webserver
>>> UPath("protocol://somepath/a/b") != UPath("protocol://somepath/a/b/")

# --> so we should not normalize trailing slashes on those filesystems, I guess

And regarding the switch to urljoin: I usually find the urljoin behavior unintuitive. For example just check the behavior below:

from urllib.parse import urljoin

roots = [
    "http://example.com",
    "http://example.com/",
    "http://example.com/c",
    "http://example.com/c/",
]

paths = [
    "",
    "a/b",
    "/a/b",
    "//a/b",
    "///a/b",
    "////a/b",
    "/////a/b",
]

for root in roots:
    for path in paths:
        print(f"urljoin({root!r}, {path!r})".ljust(44), "==", repr(urljoin(root, path)))


# output of the above script
urljoin('http://example.com', '')            == 'http://example.com'
urljoin('http://example.com', 'a/b')         == 'http://example.com/a/b'
urljoin('http://example.com', '/a/b')        == 'http://example.com/a/b'
urljoin('http://example.com', '//a/b')       == 'http://a/b'
urljoin('http://example.com', '///a/b')      == 'http://example.com/a/b'
urljoin('http://example.com', '////a/b')     == 'http://example.com//a/b'
urljoin('http://example.com', '/////a/b')    == 'http://example.com///a/b'
urljoin('http://example.com/', '')           == 'http://example.com/'
urljoin('http://example.com/', 'a/b')        == 'http://example.com/a/b'
urljoin('http://example.com/', '/a/b')       == 'http://example.com/a/b'
urljoin('http://example.com/', '//a/b')      == 'http://a/b'
urljoin('http://example.com/', '///a/b')     == 'http://example.com/a/b'
urljoin('http://example.com/', '////a/b')    == 'http://example.com//a/b'
urljoin('http://example.com/', '/////a/b')   == 'http://example.com///a/b'
urljoin('http://example.com/c', '')          == 'http://example.com/c'
urljoin('http://example.com/c', 'a/b')       == 'http://example.com/a/b'
urljoin('http://example.com/c', '/a/b')      == 'http://example.com/a/b'
urljoin('http://example.com/c', '//a/b')     == 'http://a/b'
urljoin('http://example.com/c', '///a/b')    == 'http://example.com/a/b'
urljoin('http://example.com/c', '////a/b')   == 'http://example.com//a/b'
urljoin('http://example.com/c', '/////a/b')  == 'http://example.com///a/b'
urljoin('http://example.com/c/', '')         == 'http://example.com/c/'
urljoin('http://example.com/c/', 'a/b')      == 'http://example.com/c/a/b'
urljoin('http://example.com/c/', '/a/b')     == 'http://example.com/a/b'
urljoin('http://example.com/c/', '//a/b')    == 'http://a/b'
urljoin('http://example.com/c/', '///a/b')   == 'http://example.com/a/b'
urljoin('http://example.com/c/', '////a/b')  == 'http://example.com//a/b'
urljoin('http://example.com/c/', '/////a/b') == 'http://example.com///a/b'

I think we should go through all of this using a concrete example and define the behavior beforehand. I would also check and see how fsspec handles this for http filesystems to make sure that this all is supported upstream, before introducing special functionality in universal_pathlib. @joouha where did this issue pop up initially?

from universal_pathlib.

joouha avatar joouha commented on August 16, 2024

Hi,

For a bit of background, I encountered this issue when trying to load resources from web-pages. I wanted a universal interface to be able to load resources from a range of protocols, so universal pathlib seemed like a good option.

Say I load the page http://www.example.com/a/b/index.html with the following content:

<img src="image.png">
<img src="../image.png">
<img src="/image.png">
<img src="ftp://other.com/image.png">
<img src="//other.com/image.png">

I would expect to be able to join the page's URL with any resource link using the / operator,
and end up at the same resources which a browser would load (which is also urljoin's behaviour):

>>> UPath("http://www.example.com/a/b/index.html") / "image.png?version=1"
HTTPPath("http://www.example.com/page/image.png?version=1")

>>> UPath("http://www.example.com/a/b/index.html") / "../image.png"
HTTPPath("http://www.example.com/a/image.png")

>>> UPath("http://www.example.com/a/b/index.html") / "/image.png"
HTTPPath("http://www.example.com/image.png")

>>> UPath("http://www.example.com/a/b/index.html") / "ftp://other.com/image.png"
UPath("ftp://other.com/image.png")

>>> UPath("http://www.example.com/a/b/index.html") / "//other.com/image.png"
HTTPPath("http://other.com/image.png")

Since upath works with URIs, I would expect its behaviour to follow the the standards for the URI protocol defined in RCF3986.

I would expect UPath normalization and joining rules to differ from pathlib, since pathlib works with POSIX and Windows paths. These are not URIs - they follow their own behaviour patterns defined elsewhere.

So as a user, I would expect the following posix paths to be equivalent:

PosixPath("/somepath") == PosixPath("//somepath/") == PosixPath("//somepath//")

but I would not expect the following URIs to be equivalent, because RFC3986 states that they might point to different resources:

HTTPPath("https://en.wikipedia.org/wiki/Film") != HTTPPath("https://en.wikipedia.org/wiki/Film/") != HTTPPath("https://en.wikipedia.org/wiki//Film") != HTTPPath("https://en.wikipedia.org/wiki//Film/")

(which they actually do).

RFC3986 defines how many of the methods in universal pathlib should be implemented when dealing with URIs, such a joining URI paths, normalizing URIs, and URI equivalence.


Also, I like this as a way of constructing URI paths with double slashes - very elegant!

>>> UPath("protocol://somepath") / "" / "abc"
UPath("protocol://somepath//abc")

from universal_pathlib.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.