Comments (5)
I guess double slashes would then need to be constructed explicitly. Happy to review a PR, if you want to give the urljoin
behaviour a try.
from universal_pathlib.
Personally I would expect it to behave like
urljoin
.
I would agree. Is there actually a use case for double slashes in the middle of a url path?
from universal_pathlib.
Is there actually a use case for double slashes in the middle of a url path?
Most web servers will treat a double slash the same as a single slash, but a web server could respond with different responses, e.g. these two URIs point to different pages:
https://en.wikipedia.org/wiki/Python
https://en.wikipedia.org/wiki//Python
from universal_pathlib.
I've been thinking about this for a bit, and I wonder what's the best way to address this.
For me it is easier to think about this in "pathlib-terms" if I rephrase this to: "Should specific file systems support empty path parts?"
If we assume some filesystem that supports "double slashes" I think an intuitive "pathlib-style" way to produce a double slash would be:
>>> UPath("protocol://somepath") / "" / "abc"
UPath("protocol://somepath//abc")
Thinking this through might be a little more involved though, since a lot of users might expect paths to handle similar between different file systems. For example on posix and windows because directories can't have the same name as a file, users (or at least me
UPath("protocol://somepath") == UPath("protocol://somepath/") == UPath("protocol://somepath//")
which is why stdlib pathlib currently normalizes those paths to the same. So I guess for supporting empty parts we would actually need to implement behavior like:
>>> UPath("protocol://somepath") / ""
UPath("protocol://somepath//")
>>> assert UPath("protocol://somepath") == UPath("protocol://somepath/")
>>> assert UPath("protocol://somepath") != UPath("protocol://somepath//")
# but on a webserver
>>> UPath("protocol://somepath/a/b") != UPath("protocol://somepath/a/b/")
# --> so we should not normalize trailing slashes on those filesystems, I guess
And regarding the switch to urljoin
: I usually find the urljoin behavior unintuitive. For example just check the behavior below:
from urllib.parse import urljoin
roots = [
"http://example.com",
"http://example.com/",
"http://example.com/c",
"http://example.com/c/",
]
paths = [
"",
"a/b",
"/a/b",
"//a/b",
"///a/b",
"////a/b",
"/////a/b",
]
for root in roots:
for path in paths:
print(f"urljoin({root!r}, {path!r})".ljust(44), "==", repr(urljoin(root, path)))
# output of the above script
urljoin('http://example.com', '') == 'http://example.com'
urljoin('http://example.com', 'a/b') == 'http://example.com/a/b'
urljoin('http://example.com', '/a/b') == 'http://example.com/a/b'
urljoin('http://example.com', '//a/b') == 'http://a/b'
urljoin('http://example.com', '///a/b') == 'http://example.com/a/b'
urljoin('http://example.com', '////a/b') == 'http://example.com//a/b'
urljoin('http://example.com', '/////a/b') == 'http://example.com///a/b'
urljoin('http://example.com/', '') == 'http://example.com/'
urljoin('http://example.com/', 'a/b') == 'http://example.com/a/b'
urljoin('http://example.com/', '/a/b') == 'http://example.com/a/b'
urljoin('http://example.com/', '//a/b') == 'http://a/b'
urljoin('http://example.com/', '///a/b') == 'http://example.com/a/b'
urljoin('http://example.com/', '////a/b') == 'http://example.com//a/b'
urljoin('http://example.com/', '/////a/b') == 'http://example.com///a/b'
urljoin('http://example.com/c', '') == 'http://example.com/c'
urljoin('http://example.com/c', 'a/b') == 'http://example.com/a/b'
urljoin('http://example.com/c', '/a/b') == 'http://example.com/a/b'
urljoin('http://example.com/c', '//a/b') == 'http://a/b'
urljoin('http://example.com/c', '///a/b') == 'http://example.com/a/b'
urljoin('http://example.com/c', '////a/b') == 'http://example.com//a/b'
urljoin('http://example.com/c', '/////a/b') == 'http://example.com///a/b'
urljoin('http://example.com/c/', '') == 'http://example.com/c/'
urljoin('http://example.com/c/', 'a/b') == 'http://example.com/c/a/b'
urljoin('http://example.com/c/', '/a/b') == 'http://example.com/a/b'
urljoin('http://example.com/c/', '//a/b') == 'http://a/b'
urljoin('http://example.com/c/', '///a/b') == 'http://example.com/a/b'
urljoin('http://example.com/c/', '////a/b') == 'http://example.com//a/b'
urljoin('http://example.com/c/', '/////a/b') == 'http://example.com///a/b'
I think we should go through all of this using a concrete example and define the behavior beforehand. I would also check and see how fsspec handles this for http filesystems to make sure that this all is supported upstream, before introducing special functionality in universal_pathlib. @joouha where did this issue pop up initially?
from universal_pathlib.
Hi,
For a bit of background, I encountered this issue when trying to load resources from web-pages. I wanted a universal interface to be able to load resources from a range of protocols, so universal pathlib seemed like a good option.
Say I load the page http://www.example.com/a/b/index.html
with the following content:
<img src="image.png">
<img src="../image.png">
<img src="/image.png">
<img src="ftp://other.com/image.png">
<img src="//other.com/image.png">
I would expect to be able to join the page's URL with any resource link using the /
operator,
and end up at the same resources which a browser would load (which is also urljoin
's behaviour):
>>> UPath("http://www.example.com/a/b/index.html") / "image.png?version=1"
HTTPPath("http://www.example.com/page/image.png?version=1")
>>> UPath("http://www.example.com/a/b/index.html") / "../image.png"
HTTPPath("http://www.example.com/a/image.png")
>>> UPath("http://www.example.com/a/b/index.html") / "/image.png"
HTTPPath("http://www.example.com/image.png")
>>> UPath("http://www.example.com/a/b/index.html") / "ftp://other.com/image.png"
UPath("ftp://other.com/image.png")
>>> UPath("http://www.example.com/a/b/index.html") / "//other.com/image.png"
HTTPPath("http://other.com/image.png")
Since upath
works with URIs, I would expect its behaviour to follow the the standards for the URI protocol defined in RCF3986.
I would expect UPath normalization and joining rules to differ from pathlib
, since pathlib
works with POSIX and Windows paths. These are not URIs - they follow their own behaviour patterns defined elsewhere.
So as a user, I would expect the following posix paths to be equivalent:
PosixPath("/somepath") == PosixPath("//somepath/") == PosixPath("//somepath//")
but I would not expect the following URIs to be equivalent, because RFC3986 states that they might point to different resources:
HTTPPath("https://en.wikipedia.org/wiki/Film") != HTTPPath("https://en.wikipedia.org/wiki/Film/") != HTTPPath("https://en.wikipedia.org/wiki//Film") != HTTPPath("https://en.wikipedia.org/wiki//Film/")
(which they actually do).
RFC3986 defines how many of the methods in universal pathlib should be implemented when dealing with URIs, such a joining URI paths, normalizing URIs, and URI equivalence.
Also, I like this as a way of constructing URI paths with double slashes - very elegant!
>>> UPath("protocol://somepath") / "" / "abc"
UPath("protocol://somepath//abc")
from universal_pathlib.
Related Issues (20)
- Adhere to semver and add a changelog
- UPath broken in Python 3.11
- The noxfile has to be changed to support newer nox versions
- UPath with_name, with_stem, with_suffix is broken
- Glob on s3 bucket should be coherent with other implementations? HOT 5
- Handling of absolute file:// paths HOT 6
- mkdir does not respect exist_ok when parents is set to True
- Accessing the underlying filesystem HOT 8
- URI paths are incorrectly parsed as posix paths HOT 3
- Updating admins/maintainers HOT 3
- Support for data URI scheme? HOT 5
- Should `UPath.__new__` return `pathlib.Path` instances for local paths HOT 16
- Should the `UPath` public class API be identical to the public `pathlib.Path` class API HOT 2
- URI query component is ignored when opening a file HOT 5
- Inconsistency with Path HOT 3
- `joinpath` does not update url
- Refactor tests HOT 3
- Updating package dev tools HOT 4
- Add ability to register custom UPath implementations HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from universal_pathlib.