Giter VIP home page Giter VIP logo

smart_open's Introduction

smart_open -- utils for streaming large files

Travis_ Downloads_ License_

What?

smart_open is a Python 2 & Python 3 library for efficient streaming of very large files from/to S3, HDFS or local (compressed) files. It is well tested (using moto), well documented and sports a simple, Pythonic API:

>>> # stream lines from an S3 object
>>> for line in smart_open.smart_open('s3://mybucket/mykey.txt'):
...    print line

>>> # can use context managers too:
>>> with smart_open.smart_open('s3://mybucket/mykey.txt') as fin:
...     for line in fin:
...         print line
...     fin.seek(0)  # seek to the beginning
...     print fin.read(1000)  # read 1000 bytes

>>> # stream from HDFS
>>> for line in smart_open.smart_open('hdfs://user/hadoop/my_file.txt'):
...    print line

>>> # stream content *into* S3 (write mode):
>>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout:
...     for line in ['first line', 'second line', 'third line']:
...          fout.write(line + '\n')

>>> # stream from/to local compressed files:
>>> for line in smart_open.smart_open('./foo.txt.gz'):
...    print line

>>> with smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb') as fout:
...    fout.write("some content\n")

Since going over all (or select) keys in an S3 bucket is a very common operation, there's also an extra method smart_open.s3_iter_bucket() that does this efficiently, processing the bucket keys in parallel (using multiprocessing):

>>> # get all JSON files under "mybucket/foo/"
>>> bucket = boto.connect_s3().get_bucket('mybucket')
>>> for key, content in s3_iter_bucket(bucket, prefix='foo/', accept_key=lambda key: key.endswith('.json')):
...     print key, len(content)

For more info (S3 credentials in URI, minimum S3 part size...) and full method signatures, check out the API docs:

>>> import smart_open
>>> help(smart_open.smart_open_lib)

Why?

Working with large S3 files using Amazon's default Python library, boto, is a pain. Its key.set_contents_from_string() and key.get_contents_as_string() methods only work for small files (loaded in RAM, no streaming). There are nasty hidden gotchas when using boto's multipart upload functionality, and a lot of boilerplate.

smart_open shields you from that. It builds on boto but offers a cleaner API. The result is less code for you to write and fewer bugs to make.

Installation

The module has no dependencies beyond Python >= 2.6 (or Python >= 3.3) and boto:

pip install smart_open

Or, if you prefer to install from the source tar.gz:

python setup.py test  # run unit tests
python setup.py install

To run the unit tests (optional), you'll also need to install mock and moto (pip install mock moto). The tests are also run automatically with Travis CI on every commit push & pull request.

Todo

smart_open is an ongoing effort. Suggestions, pull request and improvements welcome!

On the roadmap:

  • improve smart_open support for HDFS (streaming from/to Hadoop File System)
  • better documentation for the default file:// scheme

Comments, bug reports

smart_open lives on github. You can file issues or pull requests there.


smart_open is open source software released under the MIT license. Copyright (c) 2015-now Radim Řehůřek.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.