fdb-spider

No description

Find a file

alpcentaur 89dcca2031 added further handling for javascript links not being urls, made config for giz work		2023-11-28 15:27:39 +00:00
spiders	added further handling for javascript links not being urls, made config for giz work	2023-11-28 15:27:39 +00:00
.gitignore	first function works, actuall xml parser has still problems with certain xml types	2023-11-06 19:19:31 +00:00
main.py	added further database in config.yaml, added new exception for downloading js generated html pages	2023-11-27 15:10:11 +00:00
README.md	update README.md	2023-11-20 16:38:18 +01:00
requirements.txt	added trafilatura to requirements	2023-11-22 00:07:59 +00:00

README.md

  __     _ _                     _     _
 / _| __| | |__        ___ _ __ (_) __| | ___ _ __
| |_ / _` | '_ \ _____/ __| '_ \| |/ _` |/ _ | '__|
|  _| (_| | |_) |_____\__ | |_) | | (_| |  __| |
|_|  \__,_|_.__/      |___| .__/|_|\__,_|\___|_|
                          |_|

Configure fdb-spider in a yaml file. Spider Multi page databases of links. Filter and serialize content to json.

Filter either by xpath syntax. Or Filter with the help of Artificial Neural Networks (work in progress).