You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

237 lines
8.9 KiB

4 years ago
  1. Metadata-Version: 2.1
  2. Name: hickle
  3. Version: 3.3.2
  4. Summary: Hickle - a HDF5 based version of pickle
  5. Home-page: http://github.com/telegraphic/hickle
  6. Author: Danny Price
  7. Author-email: dan@thetelegraphic.com
  8. License: UNKNOWN
  9. Download-URL: https://github.com/telegraphic/hickle/archive/3.3.2.tar.gz
  10. Keywords: pickle,hdf5,data storage,data export
  11. Platform: Cross platform (Linux
  12. Platform: Mac OSX
  13. Platform: Windows)
  14. Requires-Python: >=2.7
  15. Requires-Dist: numpy
  16. Requires-Dist: h5py
  17. Provides-Extra: astropy
  18. Requires-Dist: astropy ; extra == 'astropy'
  19. Provides-Extra: pandas
  20. Requires-Dist: pandas ; extra == 'pandas'
  21. Provides-Extra: scipy
  22. Requires-Dist: scipy ; extra == 'scipy'
  23. [![Build Status](https://travis-ci.org/telegraphic/hickle.svg?branch=master)](https://travis-ci.org/telegraphic/hickle)
  24. [![JOSS Status](http://joss.theoj.org/papers/0c6638f84a1a574913ed7c6dd1051847/status.svg)](http://joss.theoj.org/papers/0c6638f84a1a574913ed7c6dd1051847)
  25. Hickle
  26. ======
  27. Hickle is a [HDF5](https://www.hdfgroup.org/solutions/hdf5/) based clone of `pickle`, with a twist: instead of serializing to a pickle file,
  28. Hickle dumps to a HDF5 file (Hierarchical Data Format). It is designed to be a "drop-in" replacement for pickle (for common data objects), but is
  29. really an amalgam of `h5py` and `dill`/`pickle` with extended functionality.
  30. That is: `hickle` is a neat little way of dumping python variables to HDF5 files that can be read in most programming
  31. languages, not just Python. Hickle is fast, and allows for transparent compression of your data (LZF / GZIP).
  32. Why use Hickle?
  33. ---------------
  34. While `hickle` is designed to be a drop-in replacement for `pickle` (or something like `json`), it works very differently.
  35. Instead of serializing / json-izing, it instead stores the data using the excellent [h5py](https://www.h5py.org/) module.
  36. The main reasons to use hickle are:
  37. 1. It's faster than pickle and cPickle.
  38. 2. It stores data in HDF5.
  39. 3. You can easily compress your data.
  40. The main reasons not to use hickle are:
  41. 1. You don't want to store your data in HDF5. While hickle can serialize arbitrary python objects, this functionality is provided only for convenience, and you're probably better off just using the pickle module.
  42. 2. You want to convert your data in human-readable JSON/YAML, in which case, you should do that instead.
  43. So, if you want your data in HDF5, or if your pickling is taking too long, give hickle a try.
  44. Hickle is particularly good at storing large numpy arrays, thanks to `h5py` running under the hood.
  45. Recent changes
  46. --------------
  47. * November 2018: Submitted to Journal of Open-Source Software (JOSS).
  48. * June 2018: Major refactor and support for Python 3.
  49. * Aug 2016: Added support for scipy sparse matrices `bsr_matrix`, `csr_matrix` and `csc_matrix`.
  50. Performance comparison
  51. ----------------------
  52. Hickle runs a lot faster than pickle with its default settings, and a little faster than pickle with `protocol=2` set:
  53. ```Python
  54. In [1]: import numpy as np
  55. In [2]: x = np.random.random((2000, 2000))
  56. In [3]: import pickle
  57. In [4]: f = open('foo.pkl', 'w')
  58. In [5]: %time pickle.dump(x, f) # slow by default
  59. CPU times: user 2 s, sys: 274 ms, total: 2.27 s
  60. Wall time: 2.74 s
  61. In [6]: f = open('foo.pkl', 'w')
  62. In [7]: %time pickle.dump(x, f, protocol=2) # actually very fast
  63. CPU times: user 18.8 ms, sys: 36 ms, total: 54.8 ms
  64. Wall time: 55.6 ms
  65. In [8]: import hickle
  66. In [9]: f = open('foo.hkl', 'w')
  67. In [10]: %time hickle.dump(x, f) # a bit faster
  68. dumping <type 'numpy.ndarray'> to file <HDF5 file "foo.hkl" (mode r+)>
  69. CPU times: user 764 µs, sys: 35.6 ms, total: 36.4 ms
  70. Wall time: 36.2 ms
  71. ```
  72. So if you do continue to use pickle, add the `protocol=2` keyword (thanks @mrocklin for pointing this out).
  73. For storing python dictionaries of lists, hickle beats the python json encoder, but is slower than uJson. For a dictionary with 64 entries, each containing a 4096 length list of random numbers, the times are:
  74. json took 2633.263 ms
  75. uJson took 138.482 ms
  76. hickle took 232.181 ms
  77. It should be noted that these comparisons are of course not fair: storing in HDF5 will not help you convert something into JSON, nor will it help you serialize a string. But for quick storage of the contents of a python variable, it's a pretty good option.
  78. Installation guidelines (for Linux and Mac OS).
  79. -----------------------------------------------
  80. ### Easy method
  81. Install with `pip` by running `pip install hickle` from the command line.
  82. ### Manual install
  83. 1. You should have Python 2.7 and above installed
  84. 2. Install h5py
  85. (Official page: http://docs.h5py.org/en/latest/build.html)
  86. 3. Install hdf5
  87. (Official page: http://www.hdfgroup.org/ftp/HDF5/current/src/unpacked/release_docs/INSTALL)
  88. 4. Download `hickle`:
  89. via terminal: git clone https://github.com/telegraphic/hickle.git
  90. via manual download: Go to https://github.com/telegraphic/hickle and on right hand side you will find `Download ZIP` file
  91. 5. cd to your downloaded `hickle` directory
  92. 6. Then run the following command in the `hickle` directory:
  93. `python setup.py install`
  94. Usage example
  95. -------------
  96. Hickle is nice and easy to use, and should look very familiar to those of you who have pickled before:
  97. ```python
  98. import os
  99. import hickle as hkl
  100. import numpy as np
  101. # Create a numpy array of data
  102. array_obj = np.ones(32768, dtype='float32')
  103. # Dump to file
  104. hkl.dump(array_obj, 'test.hkl', mode='w')
  105. # Dump data, with compression
  106. hkl.dump(array_obj, 'test_gzip.hkl', mode='w', compression='gzip')
  107. # Compare filesizes
  108. print('uncompressed: %i bytes' % os.path.getsize('test.hkl'))
  109. print('compressed: %i bytes' % os.path.getsize('test_gzip.hkl'))
  110. # Load data
  111. array_hkl = hkl.load('test_gzip.hkl')
  112. # Check the two are the same file
  113. assert array_hkl.dtype == array_obj.dtype
  114. assert np.all((array_hkl, array_obj))
  115. ```
  116. In short, `hickle` provides two methods: a `hickle.load` method, for loading hickle files, and a `hickle.dump` method,
  117. for dumping data into HDF5.
  118. #### Dumping to file
  119. ```
  120. Signature: hkl.dump(py_obj, file_obj, mode='w', track_times=True, path='/', **kwargs)
  121. Docstring:
  122. Write a pickled representation of obj to the open file object file.
  123. Args:
  124. Changing from hickle import * line in __init__.py for tidier import
  125. obj (object): python object o store in a Hickle
  126. file: file object, filename string, or h5py.File object
  127. file in which to store the object. A h5py.File or a filename is also
  128. acceptable.
  129. mode (str): optional argument, 'r' (read only), 'w' (write) or 'a' (append).
  130. Ignored if file is a file object.
  131. compression (str): optional argument. Applies compression to dataset. Options: None, gzip,
  132. lzf (+ szip, if installed)
  133. track_times (bool): optional argument. If set to False, repeated hickling will produce
  134. identical files.
  135. path (str): path within hdf5 file to save data to. Defaults to root /
  136. ```
  137. #### Loading from file
  138. ```
  139. Signature: hkl.load(fileobj, path='/', safe=True)
  140. Docstring:
  141. Load a hickle file and reconstruct a python object
  142. Args:
  143. fileobj: file object, h5py.File, or filename string
  144. safe (bool): Disable automatic depickling of arbitrary python objects.
  145. DO NOT set this to False unless the file is from a trusted source.
  146. (see http://www.cs.jhu.edu/~s/musings/pickle.html for an explanation)
  147. path (str): path within hdf5 file to save data to. Defaults to root /
  148. ```
  149. #### HDF5 compression options
  150. A major benefit of `hickle` over `pickle` is that it allows fancy HDF5 features to
  151. be applied, by passing on keyword arguments on to `h5py`. So, you can do things like:
  152. ```python
  153. hkl.dump(array_obj, 'test_lzf.hkl', mode='w', compression='lzf', scaleoffset=0,
  154. chunks=(100, 100), shuffle=True, fletcher32=True)
  155. ```
  156. A detailed explanation of these keywords is given at http://docs.h5py.org/en/latest/high/dataset.html,
  157. but we give a quick rundown below.
  158. In HDF5, datasets are stored as B-trees, a tree data structure that has speed benefits over contiguous
  159. blocks of data. In the B-tree, data are split into [chunks](http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage),
  160. which is leveraged to allow [dataset resizing](http://docs.h5py.org/en/latest/high/dataset.html#resizable-datasets) and
  161. compression via [filter pipelines](http://docs.h5py.org/en/latest/high/dataset.html#filter-pipeline). Filters such as
  162. `shuffle` and `scaleoffset` move your data around to improve compression ratios, and `fletcher32` computes a checksum.
  163. These file-level options are abstracted away from the data model.
  164. ## Bugs & contributing
  165. Contributions and bugfixes are very welcome. Please check out our [contribution guidelines](https://github.com/telegraphic/hickle/blob/master/CONTRIBUTING.md)
  166. for more details on how to contribute to development.
  167. ## Referencing hickle
  168. If you use `hickle` in academic research, we would be grateful if you could reference our paper ([Markdown](https://github.com/telegraphic/hickle/blob/master/paper.md) | PDF),
  169. which is currently under review in the [Journal of Open-Source Software (JOSS)](http://joss.theoj.org/about).