Comments (15)
Transparent (de)compression is only supported for local files now.
Should be possible to do it transparently for S3 files too, using Python's zlib
from compressed stream processing. Let me know if you want to tackle this -- a low-hanging, extremely useful feature!
from smart_open.
+1 for this issue, I have some example code here: https://gist.github.com/brianmingus/a47f26760d244ba7e9d1
from smart_open.
It's interesting how smart_open is evolving into something similar for the Python world to what Apache Commons VFS is to the JVM world. Maybe there's some inspiration of the abstractions they used to build something more general.
from smart_open.
This is useful for me as well, as soon as #38 gets merged I could write a PR for this one as well.
from smart_open.
Thanks for the link @asieira -- I didn't know about VFS.
I'm all for learning from other people's mistakes -- what abstractions and designs in particular do you think would be useful?
from smart_open.
The first thing they did was to create the abstraction for a file system, not opening a single file. Going the single file route as smart_open has gone so far is great for formats like xz, gzip and bzip... mas if you want to handle archives that themselves contain internal structure (like zip, tar, etc) that won't work as well. So in VFS, opening a ZIP file is akin to traversing a virtual folder to access the content inside.
Plus, they worked in an arbitrary number of layers. You can build a URI like gz://zip://ftp://ftp.example.com/file/blah.zip!/zipfolder1/file.gz and access a GZIP file, inside a ZIP file, read from an FTP server.
Plus, it's an architecture that can be extended. They defined a set of abstract classes that you can implement to define a new type of filesystem in addition to the built-in ones.
All great ideas, but maybe too complicated for smart_open and worthy of a separate independent project that emulates those ideas in Python.
At the very least, thinking of those ideas I would be inspired to implement support for gzip and bzip compression / decompression in a way that is general for all supported file types in smart_open.
from smart_open.
+1 on this, I might be able to implement it soon since we'll probably need it..thanks
from smart_open.
Sounds great @AndreaCrotti ... that would be really useful!
from smart_open.
This seems to work pretty well as a work-around: https://github.com/commoncrawl/gzipstream
from smart_open.
Sounds good, thanks for the link @mpenkov ! Can you implement this in a PR?
Depending on how tricky gzipstream is to install and how well supported it is, we could either add it to requirements (if easy), or make it optional (if difficult), or even bundle it inside smart_open directly (license permitting).
@tmylk great intro task?
from smart_open.
gzipstream doesn't really have any external requirements (just io and zlib)
so it shouldn't be hard to install.
@piskvorky I could in theory, but I'm working on something else right now.
If this doesn't get assigned to anyone by the time I'm free, I'll have a
look at it.
On Fri, Jun 3, 2016 at 2:14 PM Radim Řehůřek [email protected]
wrote:
Sounds good, thanks for the link @mpenkov https://github.com/mpenkov !
Can you implement this in a PR?Depending on how tricky gzipstream is to install and how well supported it
is, we could either add it to requirements (if easy), or make it optional
(if difficult).@tmylk https://github.com/tmylk great intro task?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#12 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABDOVAd8BBNjHNADuBXu-XMsQGNCo_Tnks5qIAyDgaJpZM4Dgi3T
.
from smart_open.
@piskvorky OK, I'm looking into this now.
from smart_open.
@piskvorky @tmylk I think we can close this now. 78c461e resolved this.
from smart_open.
Thanks @mpenkov ! Closing now
from smart_open.
It seems that smartopen can read a gzip file from s3 using an url, but not using key. Is that the case?
from smart_open.
Related Issues (20)
- python 3.11 support?
- Support for type annotations HOT 3
- Suggeted - allowing cache mechanism for files
- Getting OSError in s3 when permission for kms:Decrypt are missing HOT 4
- S3 open fails on files that contain '@' in their path HOT 5
- Writing to FTP fails with error "503 ASCII (Text) data type is not supported for file transfer operations. Please configure your FTP client to use IMAGE (Binary) type and try again" HOT 1
- Test failures with urllib3 2.0.4 HOT 4
- Compatibility issue with soundfile HOT 1
- Add OAuth2 support HOT 1
- pip install for version 3.0.0 failing HOT 14
- Feature request: zstandard compression HOT 1
- Incompatibility with moto 5 HOT 8
- Version 7.0.0 issue - import botocore error HOT 5
- Inconsistent python_requires minimum version HOT 3
- The result of smart_open.open (FileLikeProxy) lacks a __next__-method in 7.0.0 whereas in 6.4.0 (_io.TextIOWrapper) it did HOT 5
- zstd write does not work with `wb` mode
- No way to specify generation when opening a GS blob
- S3 SinglepartWriter writes on exception when garbage collected HOT 2
- [Documentation] `s3` URI example uses `my_key` ambiguously
- Add support of Huawei Object Storage Service (OBS)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from smart_open.