added further dokumentation to README.md

2024-01-21 14:07:38 +00:00 · 2024-01-21 14:07:38 +00:00 · d7d157bf42
commit d7d157bf42
parent 0500f5853d
1 changed files with 93 additions and 7 deletions
--- a/README.md
+++ b/README.md
@ -9,19 +9,105 @@
                          |_|

 ```
-Configure fdb-spider in a yaml file.
-Spider Multi page databases of links.
-Filter and serialize content to json.

-Filter either by xpath syntax.
-Or Filter with the help of Artificial Neural Networks (work in progress).
+1. [Introduction](#introduction)
+2. [Installation](#installation)
+3. [Usage](#usage)
+  *  [Configuration File Syntax](#configuration-file-syntax)
+  *  [Efficient Xpath Copying](#efficient-xpath-copying)
+
+# Introduction
+
+The fdb-spider was made to gather data from Websites in an automated way. 
+The Website to be spidered has to be a list of Links.
+Which makes the fdb-spider a web spider for most Plattforms.
+The fdb-spider is to be configured in a .yaml file to make things easy.
+The output of the fdb-spider is in json format to make it easy to input
+the json to other programs. 
+
+At its core, the spider outputs tag search based entries
+
+It works together with the fdb-spider-interface.
+
+In Future, the spider will be extended by the model Sauerkraut.
+An !open source!  Artificial Neural Network.
+
+# Installation
+Create a python3 virtualenv on your favourite UNIX Distribution 
+with the command

-To run this, create a python3 virtualenv, pip install -r requirements,
-and 
 ```
+git clone https://code.basabuuka.org/alpcentaur/fdb-spider
+cd fdb-spider
+virtualenv venv
+source venv/bin/activate 
+pip install -r requirements.txt
+```
+
+then install systemwide requirements with your package manager 
+```
+# apt based unixoids
 apt install xvfb
 apt install chromium
 apt install chromium-webdriver
+
+# pacman based unixoids
+pacman -S xorg-server-xvfb
+pacman -S chromium
+
 ```

+# Usage
+
+## Configuration File Syntax
+
+The configuration file with working syntax template is
+```
+/spiders/config.yaml
+```
+
+Here you can configure new websites to spider, referred to as "databases".
+
+link1 and link2 are the links to be iterated.
+The assumption is, that every list of links will have a loopable structure.
+If links are javascript links, specify js[domain,link[1,2],iteration-var-list]. 
+Otherwise leave them out, but specify jsdomain as 'None'.
+
+You will find parents and children of the entry list pages.
+Here you have to fill in the xpath of the entries to be parsed.
+
+In the entry directive, you have to specify uniform to either TRUE or FALSE.
+Set it to TRUE, if all the entry pages have the same template, and you
+are able to specify xpath again to get the text or whatever variable you
+like to specify.
+In the entry_unitrue directive, you can specify new dimensions and
+the json will adapt to your wishes.
+Under the entry-list directive this feature has to be still implemented.
+So use name, link, javascript-link, info, period and sponsor by commenting
+in or out.
+If javascript-link is set (which means its javascript clickable), 
+link will be ignored.
+
+Set it to FALSE, if you have diverse pages behind the entries,
+and want to generally get the main text of all the different links.
+For your information, the library trafilature is used to gather the 
+text generally for further processing.
+
+## Efficient Xpath Copying
+
+When copying the Xpath, most modern Webbrowsers are of help.
+In Firefox (or Browsers build on it like the TOR Browser) you can use 
+```
+strl-shift-c
+```
+to open the "Inspector" in "Pick an element" mode.
+When you click on the desired entry on the page,
+it opens the actual code of the clicked element in the html search box.
+Now make a right click on the code in the html search box, go on "Copy",
+and go on XPath.
+Now you have the XPath of the element in your clipboard.
+When pasting it into the config, try to replace some slashes with double
+slashes. That will make the spider more stable, in case the websites 
+html/xml gets changed for maintenance or other reasons.
+