added further dokumentation to README.md
This commit is contained in:
parent
0500f5853d
commit
d7d157bf42
1 changed files with 93 additions and 7 deletions
100
README.md
100
README.md
|
@ -9,19 +9,105 @@
|
|||
|_|
|
||||
|
||||
```
|
||||
Configure fdb-spider in a yaml file.
|
||||
Spider Multi page databases of links.
|
||||
Filter and serialize content to json.
|
||||
|
||||
Filter either by xpath syntax.
|
||||
Or Filter with the help of Artificial Neural Networks (work in progress).
|
||||
1. [Introduction](#introduction)
|
||||
2. [Installation](#installation)
|
||||
3. [Usage](#usage)
|
||||
* [Configuration File Syntax](#configuration-file-syntax)
|
||||
* [Efficient Xpath Copying](#efficient-xpath-copying)
|
||||
|
||||
# Introduction
|
||||
|
||||
The fdb-spider was made to gather data from Websites in an automated way.
|
||||
The Website to be spidered has to be a list of Links.
|
||||
Which makes the fdb-spider a web spider for most Plattforms.
|
||||
The fdb-spider is to be configured in a .yaml file to make things easy.
|
||||
The output of the fdb-spider is in json format to make it easy to input
|
||||
the json to other programs.
|
||||
|
||||
At its core, the spider outputs tag search based entries
|
||||
|
||||
It works together with the fdb-spider-interface.
|
||||
|
||||
In Future, the spider will be extended by the model Sauerkraut.
|
||||
An !open source! Artificial Neural Network.
|
||||
|
||||
# Installation
|
||||
Create a python3 virtualenv on your favourite UNIX Distribution
|
||||
with the command
|
||||
|
||||
To run this, create a python3 virtualenv, pip install -r requirements,
|
||||
and
|
||||
```
|
||||
git clone https://code.basabuuka.org/alpcentaur/fdb-spider
|
||||
cd fdb-spider
|
||||
virtualenv venv
|
||||
source venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
then install systemwide requirements with your package manager
|
||||
```
|
||||
# apt based unixoids
|
||||
apt install xvfb
|
||||
apt install chromium
|
||||
apt install chromium-webdriver
|
||||
|
||||
# pacman based unixoids
|
||||
pacman -S xorg-server-xvfb
|
||||
pacman -S chromium
|
||||
|
||||
```
|
||||
|
||||
# Usage
|
||||
|
||||
## Configuration File Syntax
|
||||
|
||||
The configuration file with working syntax template is
|
||||
```
|
||||
/spiders/config.yaml
|
||||
```
|
||||
|
||||
Here you can configure new websites to spider, referred to as "databases".
|
||||
|
||||
link1 and link2 are the links to be iterated.
|
||||
The assumption is, that every list of links will have a loopable structure.
|
||||
If links are javascript links, specify js[domain,link[1,2],iteration-var-list].
|
||||
Otherwise leave them out, but specify jsdomain as 'None'.
|
||||
|
||||
You will find parents and children of the entry list pages.
|
||||
Here you have to fill in the xpath of the entries to be parsed.
|
||||
|
||||
In the entry directive, you have to specify uniform to either TRUE or FALSE.
|
||||
Set it to TRUE, if all the entry pages have the same template, and you
|
||||
are able to specify xpath again to get the text or whatever variable you
|
||||
like to specify.
|
||||
In the entry_unitrue directive, you can specify new dimensions and
|
||||
the json will adapt to your wishes.
|
||||
Under the entry-list directive this feature has to be still implemented.
|
||||
So use name, link, javascript-link, info, period and sponsor by commenting
|
||||
in or out.
|
||||
If javascript-link is set (which means its javascript clickable),
|
||||
link will be ignored.
|
||||
|
||||
Set it to FALSE, if you have diverse pages behind the entries,
|
||||
and want to generally get the main text of all the different links.
|
||||
For your information, the library trafilature is used to gather the
|
||||
text generally for further processing.
|
||||
|
||||
## Efficient Xpath Copying
|
||||
|
||||
When copying the Xpath, most modern Webbrowsers are of help.
|
||||
In Firefox (or Browsers build on it like the TOR Browser) you can use
|
||||
```
|
||||
strl-shift-c
|
||||
```
|
||||
to open the "Inspector" in "Pick an element" mode.
|
||||
When you click on the desired entry on the page,
|
||||
it opens the actual code of the clicked element in the html search box.
|
||||
Now make a right click on the code in the html search box, go on "Copy",
|
||||
and go on XPath.
|
||||
Now you have the XPath of the element in your clipboard.
|
||||
When pasting it into the config, try to replace some slashes with double
|
||||
slashes. That will make the spider more stable, in case the websites
|
||||
html/xml gets changed for maintenance or other reasons.
|
||||
|
||||
|
||||
|
|
Loading…
Reference in a new issue