You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
alpcentaur d3335f203b added trafilatura exception 1 year ago
spiders added trafilatura exception 1 year ago
.gitignore first function works, actuall xml parser has still problems with certain xml types 1 year ago
README.md added functions for uniform and not uniform entry end points - non uniform endpoints are generally parsed as text from any paragraph xml element p 1 year ago
main.py added functions for uniform and not uniform entry end points - non uniform endpoints are generally parsed as text from any paragraph xml element p 1 year ago
requirements.txt specifying the links, new exception clause if soupparser does not work 1 year ago

README.md


/ | | | | ___ _ __ () | | ___ _ __ | | / _ | '_ \ _____/ __| '_ \| |/ _ |/ _ \ '| | | (| | |_) |__ \ |) | | (| | / |
|| _,|.
/ |
/ .
/||_,_|__||
|
|

Configure fdb-spider in a yaml file. Spider Multi page databases of links. Filter and serialize content to json.

Filter either by xpath syntax. Or Filter with the help of Artificial Neural Networks.