|
|
@ -9,19 +9,105 @@ |
|
|
|
|_| |
|
|
|
|
|
|
|
``` |
|
|
|
Configure fdb-spider in a yaml file. |
|
|
|
Spider Multi page databases of links. |
|
|
|
Filter and serialize content to json. |
|
|
|
|
|
|
|
Filter either by xpath syntax. |
|
|
|
Or Filter with the help of Artificial Neural Networks (work in progress). |
|
|
|
1. [Introduction](#introduction) |
|
|
|
2. [Installation](#installation) |
|
|
|
3. [Usage](#usage) |
|
|
|
* [Configuration File Syntax](#configuration-file-syntax) |
|
|
|
* [Efficient Xpath Copying](#efficient-xpath-copying) |
|
|
|
|
|
|
|
To run this, create a python3 virtualenv, pip install -r requirements, |
|
|
|
and |
|
|
|
# Introduction |
|
|
|
|
|
|
|
The fdb-spider was made to gather data from Websites in an automated way. |
|
|
|
The Website to be spidered has to be a list of Links. |
|
|
|
Which makes the fdb-spider a web spider for most Plattforms. |
|
|
|
The fdb-spider is to be configured in a .yaml file to make things easy. |
|
|
|
The output of the fdb-spider is in json format to make it easy to input |
|
|
|
the json to other programs. |
|
|
|
|
|
|
|
At its core, the spider outputs tag search based entries |
|
|
|
|
|
|
|
It works together with the fdb-spider-interface. |
|
|
|
|
|
|
|
In Future, the spider will be extended by the model Sauerkraut. |
|
|
|
An !open source! Artificial Neural Network. |
|
|
|
|
|
|
|
# Installation |
|
|
|
Create a python3 virtualenv on your favourite UNIX Distribution |
|
|
|
with the command |
|
|
|
|
|
|
|
``` |
|
|
|
git clone https://code.basabuuka.org/alpcentaur/fdb-spider |
|
|
|
cd fdb-spider |
|
|
|
virtualenv venv |
|
|
|
source venv/bin/activate |
|
|
|
pip install -r requirements.txt |
|
|
|
``` |
|
|
|
|
|
|
|
then install systemwide requirements with your package manager |
|
|
|
``` |
|
|
|
# apt based unixoids |
|
|
|
apt install xvfb |
|
|
|
apt install chromium |
|
|
|
apt install chromium-webdriver |
|
|
|
|
|
|
|
# pacman based unixoids |
|
|
|
pacman -S xorg-server-xvfb |
|
|
|
pacman -S chromium |
|
|
|
|
|
|
|
``` |
|
|
|
|
|
|
|
# Usage |
|
|
|
|
|
|
|
## Configuration File Syntax |
|
|
|
|
|
|
|
The configuration file with working syntax template is |
|
|
|
``` |
|
|
|
/spiders/config.yaml |
|
|
|
``` |
|
|
|
|
|
|
|
Here you can configure new websites to spider, referred to as "databases". |
|
|
|
|
|
|
|
link1 and link2 are the links to be iterated. |
|
|
|
The assumption is, that every list of links will have a loopable structure. |
|
|
|
If links are javascript links, specify js[domain,link[1,2],iteration-var-list]. |
|
|
|
Otherwise leave them out, but specify jsdomain as 'None'. |
|
|
|
|
|
|
|
You will find parents and children of the entry list pages. |
|
|
|
Here you have to fill in the xpath of the entries to be parsed. |
|
|
|
|
|
|
|
In the entry directive, you have to specify uniform to either TRUE or FALSE. |
|
|
|
Set it to TRUE, if all the entry pages have the same template, and you |
|
|
|
are able to specify xpath again to get the text or whatever variable you |
|
|
|
like to specify. |
|
|
|
In the entry_unitrue directive, you can specify new dimensions and |
|
|
|
the json will adapt to your wishes. |
|
|
|
Under the entry-list directive this feature has to be still implemented. |
|
|
|
So use name, link, javascript-link, info, period and sponsor by commenting |
|
|
|
in or out. |
|
|
|
If javascript-link is set (which means its javascript clickable), |
|
|
|
link will be ignored. |
|
|
|
|
|
|
|
Set it to FALSE, if you have diverse pages behind the entries, |
|
|
|
and want to generally get the main text of all the different links. |
|
|
|
For your information, the library trafilature is used to gather the |
|
|
|
text generally for further processing. |
|
|
|
|
|
|
|
## Efficient Xpath Copying |
|
|
|
|
|
|
|
When copying the Xpath, most modern Webbrowsers are of help. |
|
|
|
In Firefox (or Browsers build on it like the TOR Browser) you can use |
|
|
|
``` |
|
|
|
strl-shift-c |
|
|
|
``` |
|
|
|
to open the "Inspector" in "Pick an element" mode. |
|
|
|
When you click on the desired entry on the page, |
|
|
|
it opens the actual code of the clicked element in the html search box. |
|
|
|
Now make a right click on the code in the html search box, go on "Copy", |
|
|
|
and go on XPath. |
|
|
|
Now you have the XPath of the element in your clipboard. |
|
|
|
When pasting it into the config, try to replace some slashes with double |
|
|
|
slashes. That will make the spider more stable, in case the websites |
|
|
|
html/xml gets changed for maintenance or other reasons. |
|
|
|
|
|
|
|
|