added further dokumentation to README.md

10 months ago · d7d157bf42
--- a/README.md
+++ b/README.md
@ -9,19 +9,105 @@
                          |_|

 ```
 Configure fdb-spider in a yaml file.
 Spider Multi page databases of links.
 Filter and serialize content to json.

 Filter either by xpath syntax.
 Or Filter with the help of Artificial Neural Networks (work in progress).
 1. [Introduction](#introduction)
 2. [Installation](#installation)
 3. [Usage](#usage)
  *  [Configuration File Syntax](#configuration-file-syntax)
  *  [Efficient Xpath Copying](#efficient-xpath-copying)

 To run this, create a python3 virtualenv, pip install -r requirements,
 and 
 # Introduction

 The fdb-spider was made to gather data from Websites in an automated way. 
 The Website to be spidered has to be a list of Links.
 Which makes the fdb-spider a web spider for most Plattforms.
 The fdb-spider is to be configured in a .yaml file to make things easy.
 The output of the fdb-spider is in json format to make it easy to input
 the json to other programs. 

 At its core, the spider outputs tag search based entries

 It works together with the fdb-spider-interface.

 In Future, the spider will be extended by the model Sauerkraut.
 An !open source!  Artificial Neural Network.

 # Installation
 Create a python3 virtualenv on your favourite UNIX Distribution 
 with the command

 ```
 git clone https://code.basabuuka.org/alpcentaur/fdb-spider
 cd fdb-spider
 virtualenv venv
 source venv/bin/activate 
 pip install -r requirements.txt
 ```

 then install systemwide requirements with your package manager 
 ```
 # apt based unixoids
 apt install xvfb
 apt install chromium
 apt install chromium-webdriver

 # pacman based unixoids
 pacman -S xorg-server-xvfb
 pacman -S chromium

 ```

 # Usage

 ## Configuration File Syntax

 The configuration file with working syntax template is
 ```
 /spiders/config.yaml
 ```

 Here you can configure new websites to spider, referred to as "databases".

 link1 and link2 are the links to be iterated.
 The assumption is, that every list of links will have a loopable structure.
 If links are javascript links, specify js[domain,link[1,2],iteration-var-list]. 
 Otherwise leave them out, but specify jsdomain as 'None'.

 You will find parents and children of the entry list pages.
 Here you have to fill in the xpath of the entries to be parsed.

 In the entry directive, you have to specify uniform to either TRUE or FALSE.
 Set it to TRUE, if all the entry pages have the same template, and you
 are able to specify xpath again to get the text or whatever variable you
 like to specify.
 In the entry_unitrue directive, you can specify new dimensions and
 the json will adapt to your wishes.
 Under the entry-list directive this feature has to be still implemented.
 So use name, link, javascript-link, info, period and sponsor by commenting
 in or out.
 If javascript-link is set (which means its javascript clickable), 
 link will be ignored.

 Set it to FALSE, if you have diverse pages behind the entries,
 and want to generally get the main text of all the different links.
 For your information, the library trafilature is used to gather the 
 text generally for further processing.

 ## Efficient Xpath Copying

 When copying the Xpath, most modern Webbrowsers are of help.
 In Firefox (or Browsers build on it like the TOR Browser) you can use 
 ```
 strl-shift-c
 ```
 to open the "Inspector" in "Pick an element" mode.
 When you click on the desired entry on the page,
 it opens the actual code of the clicked element in the html search box.
 Now make a right click on the code in the html search box, go on "Copy",
 and go on XPath.
 Now you have the XPath of the element in your clipboard.
 When pasting it into the config, try to replace some slashes with double
 slashes. That will make the spider more stable, in case the websites 
 html/xml gets changed for maintenance or other reasons.