diff --git a/README.md b/README.md index cc0d696..4c6dcad 100644 --- a/README.md +++ b/README.md @@ -9,19 +9,105 @@ |_| ``` -Configure fdb-spider in a yaml file. -Spider Multi page databases of links. -Filter and serialize content to json. -Filter either by xpath syntax. -Or Filter with the help of Artificial Neural Networks (work in progress). +1. [Introduction](#introduction) +2. [Installation](#installation) +3. [Usage](#usage) + * [Configuration File Syntax](#configuration-file-syntax) + * [Efficient Xpath Copying](#efficient-xpath-copying) -To run this, create a python3 virtualenv, pip install -r requirements, -and +# Introduction + +The fdb-spider was made to gather data from Websites in an automated way. +The Website to be spidered has to be a list of Links. +Which makes the fdb-spider a web spider for most Plattforms. +The fdb-spider is to be configured in a .yaml file to make things easy. +The output of the fdb-spider is in json format to make it easy to input +the json to other programs. + +At its core, the spider outputs tag search based entries + +It works together with the fdb-spider-interface. + +In Future, the spider will be extended by the model Sauerkraut. +An !open source! Artificial Neural Network. + +# Installation +Create a python3 virtualenv on your favourite UNIX Distribution +with the command + +``` +git clone https://code.basabuuka.org/alpcentaur/fdb-spider +cd fdb-spider +virtualenv venv +source venv/bin/activate +pip install -r requirements.txt ``` + +then install systemwide requirements with your package manager +``` +# apt based unixoids apt install xvfb apt install chromium apt install chromium-webdriver + +# pacman based unixoids +pacman -S xorg-server-xvfb +pacman -S chromium + +``` + +# Usage + +## Configuration File Syntax + +The configuration file with working syntax template is +``` +/spiders/config.yaml +``` + +Here you can configure new websites to spider, referred to as "databases". + +link1 and link2 are the links to be iterated. +The assumption is, that every list of links will have a loopable structure. +If links are javascript links, specify js[domain,link[1,2],iteration-var-list]. +Otherwise leave them out, but specify jsdomain as 'None'. + +You will find parents and children of the entry list pages. +Here you have to fill in the xpath of the entries to be parsed. + +In the entry directive, you have to specify uniform to either TRUE or FALSE. +Set it to TRUE, if all the entry pages have the same template, and you +are able to specify xpath again to get the text or whatever variable you +like to specify. +In the entry_unitrue directive, you can specify new dimensions and +the json will adapt to your wishes. +Under the entry-list directive this feature has to be still implemented. +So use name, link, javascript-link, info, period and sponsor by commenting +in or out. +If javascript-link is set (which means its javascript clickable), +link will be ignored. + +Set it to FALSE, if you have diverse pages behind the entries, +and want to generally get the main text of all the different links. +For your information, the library trafilature is used to gather the +text generally for further processing. + +## Efficient Xpath Copying + +When copying the Xpath, most modern Webbrowsers are of help. +In Firefox (or Browsers build on it like the TOR Browser) you can use +``` +strl-shift-c ``` +to open the "Inspector" in "Pick an element" mode. +When you click on the desired entry on the page, +it opens the actual code of the clicked element in the html search box. +Now make a right click on the code in the html search box, go on "Copy", +and go on XPath. +Now you have the XPath of the element in your clipboard. +When pasting it into the config, try to replace some slashes with double +slashes. That will make the spider more stable, in case the websites +html/xml gets changed for maintenance or other reasons.