fdb-spider

No description

Find a file

alpcentaur df4a8289b8 added pdf parser if entry link is direct pdf		2023-11-22 17:03:15 +00:00
spiders	added pdf parser if entry link is direct pdf	2023-11-22 17:03:15 +00:00
.gitignore	first function works, actuall xml parser has still problems with certain xml types	2023-11-06 19:19:31 +00:00
main.py	added pdf parser if entry link is direct pdf	2023-11-22 17:03:15 +00:00
README.md	update README.md	2023-11-20 16:38:18 +01:00
requirements.txt	added trafilatura to requirements	2023-11-22 00:07:59 +00:00

README.md

  __     _ _                     _     _
 / _| __| | |__        ___ _ __ (_) __| | ___ _ __
| |_ / _` | '_ \ _____/ __| '_ \| |/ _` |/ _ | '__|
|  _| (_| | |_) |_____\__ | |_) | | (_| |  __| |
|_|  \__,_|_.__/      |___| .__/|_|\__,_|\___|_|
                          |_|

Configure fdb-spider in a yaml file. Spider Multi page databases of links. Filter and serialize content to json.

Filter either by xpath syntax. Or Filter with the help of Artificial Neural Networks (work in progress).