Metadata-Version: 2.1
|
|
Name: parsel
|
|
Version: 1.5.1
|
|
Summary: Parsel is a library to extract data from HTML and XML using XPath and CSS selectors
|
|
Home-page: https://github.com/scrapy/parsel
|
|
Author: Scrapy project
|
|
Author-email: info@scrapy.org
|
|
License: BSD
|
|
Keywords: parsel
|
|
Platform: UNKNOWN
|
|
Classifier: Development Status :: 5 - Production/Stable
|
|
Classifier: Intended Audience :: Developers
|
|
Classifier: License :: OSI Approved :: BSD License
|
|
Classifier: Natural Language :: English
|
|
Classifier: Topic :: Text Processing :: Markup
|
|
Classifier: Topic :: Text Processing :: Markup :: HTML
|
|
Classifier: Topic :: Text Processing :: Markup :: XML
|
|
Classifier: Programming Language :: Python :: 2
|
|
Classifier: Programming Language :: Python :: 2.7
|
|
Classifier: Programming Language :: Python :: 3
|
|
Classifier: Programming Language :: Python :: 3.4
|
|
Classifier: Programming Language :: Python :: 3.5
|
|
Classifier: Programming Language :: Python :: 3.6
|
|
Classifier: Programming Language :: Python :: 3.7
|
|
Classifier: Programming Language :: Python :: Implementation :: CPython
|
|
Classifier: Programming Language :: Python :: Implementation :: PyPy
|
|
Requires-Dist: w3lib (>=1.19.0)
|
|
Requires-Dist: lxml (>=2.3)
|
|
Requires-Dist: six (>=1.5.2)
|
|
Requires-Dist: cssselect (>=0.9)
|
|
Requires-Dist: functools32; python_version<'3.0'
|
|
|
|
===============================
|
|
Parsel
|
|
===============================
|
|
|
|
.. image:: https://img.shields.io/travis/scrapy/parsel/master.svg
|
|
:target: https://travis-ci.org/scrapy/parsel
|
|
:alt: Build Status
|
|
|
|
.. image:: https://img.shields.io/pypi/v/parsel.svg
|
|
:target: https://pypi.python.org/pypi/parsel
|
|
:alt: PyPI Version
|
|
|
|
.. image:: https://img.shields.io/codecov/c/github/scrapy/parsel/master.svg
|
|
:target: http://codecov.io/github/scrapy/parsel?branch=master
|
|
:alt: Coverage report
|
|
|
|
|
|
Parsel is a library to extract data from HTML and XML using XPath and CSS selectors
|
|
|
|
* Free software: BSD license
|
|
* Documentation: https://parsel.readthedocs.org.
|
|
|
|
Features
|
|
--------
|
|
|
|
* Extract text using CSS or XPath selectors
|
|
* Regular expression helper methods
|
|
|
|
Example::
|
|
|
|
>>> from parsel import Selector
|
|
>>> sel = Selector(text=u"""<html>
|
|
<body>
|
|
<h1>Hello, Parsel!</h1>
|
|
<ul>
|
|
<li><a href="http://example.com">Link 1</a></li>
|
|
<li><a href="http://scrapy.org">Link 2</a></li>
|
|
</ul
|
|
</body>
|
|
</html>""")
|
|
>>>
|
|
>>> sel.css('h1::text').get()
|
|
'Hello, Parsel!'
|
|
>>>
|
|
>>> sel.css('h1::text').re('\w+')
|
|
['Hello', 'Parsel']
|
|
>>>
|
|
>>> for e in sel.css('ul > li'):
|
|
... print(e.xpath('.//a/@href').get())
|
|
http://example.com
|
|
http://scrapy.org
|
|
|
|
|
|
|
|
|
|
History
|
|
-------
|
|
|
|
1.5.1 (2018-10-25)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* ``has-class`` XPath function handles newlines and other separators
|
|
in class names properly;
|
|
* fixed parsing of HTML documents with null bytes;
|
|
* documentation improvements;
|
|
* Python 3.7 tests are run on CI; other test improvements.
|
|
|
|
1.5.0 (2018-07-04)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* New ``Selector.attrib`` and ``SelectorList.attrib`` properties which make
|
|
it easier to get attributes of HTML elements.
|
|
* CSS selectors became faster: compilation results are cached
|
|
(LRU cache is used for ``css2xpath``), so there is
|
|
less overhead when the same CSS expression is used several times.
|
|
* ``.get()`` and ``.getall()`` selector methods are documented and recommended
|
|
over ``.extract_first()`` and ``.extract()``.
|
|
* Various documentation tweaks and improvements.
|
|
|
|
One more change is that ``.extract()`` and ``.extract_first()`` methods
|
|
are now implemented using ``.get()`` and ``.getall()``, not the other
|
|
way around, and instead of calling ``Selector.extract`` all other methods
|
|
now call ``Selector.get`` internally. It can be **backwards incompatible**
|
|
in case of custom Selector subclasses which override ``Selector.extract``
|
|
without doing the same for ``Selector.get``. If you have such Selector
|
|
subclass, make sure ``get`` method is also overridden. For example, this::
|
|
|
|
class MySelector(parsel.Selector):
|
|
def extract(self):
|
|
return super().extract() + " foo"
|
|
|
|
should be changed to this::
|
|
|
|
class MySelector(parsel.Selector):
|
|
def get(self):
|
|
return super().get() + " foo"
|
|
extract = get
|
|
|
|
|
|
1.4.0 (2018-02-08)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* ``Selector`` and ``SelectorList`` can't be pickled because
|
|
pickling/unpickling doesn't work for ``lxml.html.HtmlElement``;
|
|
parsel now raises TypeError explicitly instead of allowing pickle to
|
|
silently produce wrong output. This is technically backwards-incompatible
|
|
if you're using Python < 3.6.
|
|
|
|
|
|
1.3.1 (2017-12-28)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* Fix artifact uploads to pypi.
|
|
|
|
|
|
1.3.0 (2017-12-28)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* ``has-class`` XPath extension function;
|
|
* ``parsel.xpathfuncs.set_xpathfunc`` is a simplified way to register
|
|
XPath extensions;
|
|
* ``Selector.remove_namespaces`` now removes namespace declarations;
|
|
* Python 3.3 support is dropped;
|
|
* ``make htmlview`` command for easier Parsel docs development.
|
|
* CI: PyPy installation is fixed; parsel now runs tests for PyPy3 as well.
|
|
|
|
|
|
1.2.0 (2017-05-17)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* Add ``SelectorList.get`` and ``SelectorList.getall``
|
|
methods as aliases for ``SelectorList.extract_first``
|
|
and ``SelectorList.extract`` respectively
|
|
* Add default value parameter to ``SelectorList.re_first`` method
|
|
* Add ``Selector.re_first`` method
|
|
* Add ``replace_entities`` argument on ``.re()`` and ``.re_first()``
|
|
to turn off replacing of character entity references
|
|
* Bug fix: detect ``None`` result from lxml parsing and fallback with an empty document
|
|
* Rearrange XML/HTML examples in the selectors usage docs
|
|
* Travis CI:
|
|
|
|
* Test against Python 3.6
|
|
* Test against PyPy using "Portable PyPy for Linux" distribution
|
|
|
|
|
|
1.1.0 (2016-11-22)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* Change default HTML parser to `lxml.html.HTMLParser <http://lxml.de/api/lxml.html.HTMLParser-class.html>`_,
|
|
which makes easier to use some HTML specific features
|
|
* Add css2xpath function to translate CSS to XPath
|
|
* Add support for ad-hoc namespaces declarations
|
|
* Add support for XPath variables
|
|
* Documentation improvements and updates
|
|
|
|
|
|
1.0.3 (2016-07-29)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* Add BSD-3-Clause license file
|
|
* Re-enable PyPy tests
|
|
* Integrate py.test runs with setuptools (needed for Debian packaging)
|
|
* Changelog is now called ``NEWS``
|
|
|
|
|
|
1.0.2 (2016-04-26)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* Fix bug in exception handling causing original traceback to be lost
|
|
* Added docstrings and other doc fixes
|
|
|
|
|
|
1.0.1 (2015-08-24)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* Updated PyPI classifiers
|
|
* Added docstrings for csstranslator module and other doc fixes
|
|
|
|
|
|
1.0.0 (2015-08-22)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* Documentation fixes
|
|
|
|
|
|
0.9.6 (2015-08-14)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* Updated documentation
|
|
* Extended test coverage
|
|
|
|
|
|
0.9.5 (2015-08-11)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* Support for extending SelectorList
|
|
|
|
|
|
0.9.4 (2015-08-10)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* Try workaround for travis-ci/dpl#253
|
|
|
|
|
|
0.9.3 (2015-08-07)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* Add base_url argument
|
|
|
|
|
|
0.9.2 (2015-08-07)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* Rename module unified -> selector and promoted root attribute
|
|
* Add create_root_node function
|
|
|
|
|
|
0.9.1 (2015-08-04)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* Setup Sphinx build and docs structure
|
|
* Build universal wheels
|
|
* Rename some leftovers from package extraction
|
|
|
|
|
|
0.9.0 (2015-07-30)
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
* First release on PyPI.
|
|
|
|
|