Metadata-Version: 2.1 Name: smart-open Version: 1.6.0 Summary: Utils for streaming large files (S3, HDFS, gzip, bz2...) Home-page: https://github.com/piskvorky/smart_open Author: Radim Rehurek Author-email: me@radimrehurek.com Maintainer: Radim Rehurek Maintainer-email: me@radimrehurek.com License: MIT Download-URL: http://pypi.python.org/pypi/smart_open Keywords: file streaming,s3,hdfs Platform: any Classifier: Development Status :: 4 - Beta Classifier: Environment :: Console Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: MIT License Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3.3 Classifier: Programming Language :: Python :: 3.4 Classifier: Programming Language :: Python :: 3.5 Classifier: Programming Language :: Python :: 3.6 Classifier: Topic :: System :: Distributed Computing Classifier: Topic :: Database :: Front-Ends Provides-Extra: test Requires-Dist: boto (>=2.32) Requires-Dist: bz2file Requires-Dist: requests Requires-Dist: boto3 Provides-Extra: test Requires-Dist: mock; extra == 'test' Requires-Dist: moto (==0.4.31); extra == 'test' Requires-Dist: pathlib2; extra == 'test' Requires-Dist: responses; extra == 'test' ============================================= smart_open -- utils for streaming large files ============================================= |License|_ |Travis|_ .. |License| image:: https://img.shields.io/pypi/l/smart_open.svg .. |Travis| image:: https://travis-ci.org/RaRe-Technologies/smart_open.svg?branch=master .. _Travis: https://travis-ci.org/RaRe-Technologies/smart_open .. _License: https://github.com/RaRe-Technologies/smart_open/blob/master/LICENSE What? ===== ``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to S3, HDFS, WebHDFS, HTTP, or local (compressed) files. It is well tested (using `moto `_), well documented and sports a simple, Pythonic API: .. code-block:: python >>> # stream lines from an S3 object >>> for line in smart_open.smart_open('s3://mybucket/mykey.txt'): ... print line >>> # using a completely custom s3 server, like s3proxy: >>> for line in smart_open.smart_open('s3u://user:secret@host:port@mybucket/mykey.txt'): ... print line >>> # you can also use a boto.s3.key.Key instance directly: >>> key = boto.connect_s3().get_bucket("my_bucket").get_key("my_key") >>> with smart_open.smart_open(key) as fin: ... for line in fin: ... print line >>> # can use context managers too: >>> with smart_open.smart_open('s3://mybucket/mykey.txt') as fin: ... for line in fin: ... print line ... fin.seek(0) # seek to the beginning ... print fin.read(1000) # read 1000 bytes >>> # stream from HDFS >>> for line in smart_open.smart_open('hdfs://user/hadoop/my_file.txt'): ... print line >>> # stream from HTTP >>> for line in smart_open.smart_open('http://example.com/index.html'): ... print line >>> # stream from WebHDFS >>> for line in smart_open.smart_open('webhdfs://host:port/user/hadoop/my_file.txt'): ... print line >>> # stream content *into* S3 (write mode): >>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout: ... for line in ['first line', 'second line', 'third line']: ... fout.write(line + '\n') >>> # stream content *into* HDFS (write mode): >>> with smart_open.smart_open('hdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout: ... for line in ['first line', 'second line', 'third line']: ... fout.write(line + '\n') >>> # stream content *into* WebHDFS (write mode): >>> with smart_open.smart_open('webhdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout: ... for line in ['first line', 'second line', 'third line']: ... fout.write(line + '\n') >>> # stream from/to local compressed files: >>> for line in smart_open.smart_open('./foo.txt.gz'): ... print line >>> with smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb') as fout: ... fout.write("some content\n") Since going over all (or select) keys in an S3 bucket is a very common operation, there's also an extra method ``smart_open.s3_iter_bucket()`` that does this efficiently, **processing the bucket keys in parallel** (using multiprocessing): .. code-block:: python >>> # get all JSON files under "mybucket/foo/" >>> bucket = boto.connect_s3().get_bucket('mybucket') >>> for key, content in s3_iter_bucket(bucket, prefix='foo/', accept_key=lambda key: key.endswith('.json')): ... print key, len(content) For more info (S3 credentials in URI, minimum S3 part size...) and full method signatures, check out the API docs: .. code-block:: python >>> import smart_open >>> help(smart_open.smart_open_lib) S3-Specific Options ------------------- There are a few optional keyword arguments that are useful only for S3 access. The **host** and **profile** arguments are both passed to `boto.s3_connect()` as keyword arguments: .. code-block:: python >>> smart_open.smart_open('s3://', host='s3.amazonaws.com') >>> smart_open.smart_open('s3://', profile_name='my-profile') The **s3_session** argument allows you to provide a custom `boto3.Session` instance for connecting to S3: .. code-block:: python >>> smart_open.smart_open('s3://', s3_session=boto3.Session()) The **s3_upload** argument accepts a dict of any parameters accepted by `initiate_multipart_upload `_: .. code-block:: python >>> smart_open.smart_open('s3://', s3_upload={ 'ServerSideEncryption': 'AES256' }) The S3 reader supports gzipped content, as long as the key is obviously a gzipped file (e.g. ends with ".gz"). Why? ---- Working with large S3 files using Amazon's default Python library, `boto `_, is a pain. Its ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files (loaded in RAM, no streaming). There are nasty hidden gotchas when using ``boto``'s multipart upload functionality, and a lot of boilerplate. ``smart_open`` shields you from that. It builds on boto but offers a cleaner API. The result is less code for you to write and fewer bugs to make. Installation ------------ :: pip install smart_open Or, if you prefer to install from the `source tar.gz `_:: python setup.py test # run unit tests python setup.py install To run the unit tests (optional), you'll also need to install `mock `_ , `moto `_ and `responses ` (``pip install mock moto responses``). The tests are also run automatically with `Travis CI `_ on every commit push & pull request. Todo ---- ``smart_open`` is an ongoing effort. Suggestions, pull request and improvements welcome! On the roadmap: * better documentation for the default ``file://`` scheme Comments, bug reports --------------------- ``smart_open`` lives on `github `_. You can file issues or pull requests there. ---------------- ``smart_open`` is open source software released under the `MIT license `_. Copyright (c) 2015-now `Radim Řehůřek `_.