You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
alpcentaur 9ceaa28a82 Merge remote-tracking branch 'refs/remotes/origin/master' 1 year ago
spiders added trafilatura exception 1 year ago
.gitignore first function works, actuall xml parser has still problems with certain xml types 1 year ago
README.md update README.md 1 year ago
main.py added functions for uniform and not uniform entry end points - non uniform endpoints are generally parsed as text from any paragraph xml element p 1 year ago
requirements.txt specifying the links, new exception clause if soupparser does not work 1 year ago

README.md

  __     _ _                     _     _
 / _| __| | |__        ___ _ __ (_) __| | ___ _ __
| |_ / _` | '_ \ _____/ __| '_ \| |/ _` |/ _ | '__|
|  _| (_| | |_) |_____\__ | |_) | | (_| |  __| |
|_|  \__,_|_.__/      |___| .__/|_|\__,_|\___|_|
                          |_|

Configure fdb-spider in a yaml file. Spider Multi page databases of links. Filter and serialize content to json.

Filter either by xpath syntax. Or Filter with the help of Artificial Neural Networks (work in progress).