|
|
-
-
- ```
- __ _ _ _ _
- / _| __| | |__ ___ _ __ (_) __| | ___ _ __
- | |_ / _` | '_ \ _____/ __| '_ \| |/ _` |/ _ | '__|
- | _| (_| | |_) |_____\__ | |_) | | (_| | __| |
- |_| \__,_|_.__/ |___| .__/|_|\__,_|\___|_|
- |_|
-
- ```
-
- 1. [Introduction](#introduction)
- 2. [Installation](#installation)
- 3. [Usage](#usage)
- * [Configuration File Syntax](#configuration-file-syntax)
- * [Efficient Xpath Copying](#efficient-xpath-copying)
-
- # Introduction
-
- The fdb-spider was made to gather data from Websites in an automated way.
- The Website to be spidered has to be a list of Links.
- Which makes the fdb-spider a web spider for most Plattforms.
- The fdb-spider is to be configured in a .yaml file to make things easy.
- The output of the fdb-spider is in json format to make it easy to input
- the json to other programs.
-
- At its core, the spider outputs tag search based entries
-
- It works together with the fdb-spider-interface.
-
- In Future, the spider will be extended by the model Sauerkraut.
- An !open source! Artificial Neural Network.
-
- # Installation
- Create a python3 virtualenv on your favourite UNIX Distribution
- with the command
-
- ```
- git clone https://code.basabuuka.org/alpcentaur/fdb-spider
- cd fdb-spider
- virtualenv venv
- source venv/bin/activate
- pip install -r requirements.txt
- ```
-
- then install systemwide requirements with your package manager
- ```
- # apt based unixoids
- apt install xvfb
- apt install chromium
- apt install chromium-webdriver
-
- # pacman based unixoids
- pacman -S xorg-server-xvfb
- pacman -S chromium
-
- ```
-
- # Usage
-
- ## Configuration File Syntax
-
- The configuration file with working syntax template is
- ```
- /spiders/config.yaml
- ```
-
- Here you can configure new websites to spider, referred to as "databases".
-
- link1 and link2 are the links to be iterated.
- The assumption is, that every list of links will have a loopable structure.
- If links are javascript links, specify js[domain,link[1,2],iteration-var-list].
- Otherwise leave them out, but specify jsdomain as 'None'.
-
- You will find parents and children of the entry list pages.
- Here you have to fill in the xpath of the entries to be parsed.
-
- In the entry directive, you have to specify uniform to either TRUE or FALSE.
- Set it to TRUE, if all the entry pages have the same template, and you
- are able to specify xpath again to get the text or whatever variable you
- like to specify.
- In the entry_unitrue directive, you can specify new dimensions and
- the json will adapt to your wishes.
- Under the entry-list directive this feature has to be still implemented.
- So use name, link, javascript-link, info, period and sponsor by commenting
- in or out.
- If javascript-link is set (which means its javascript clickable),
- link will be ignored.
-
- Set it to FALSE, if you have diverse pages behind the entries,
- and want to generally get the main text of all the different links.
- For your information, the library trafilature is used to gather the
- text generally for further processing.
-
- ## Efficient Xpath Copying
-
- When copying the Xpath, most modern Webbrowsers are of help.
- In Firefox (or Browsers build on it like the TOR Browser) you can use
- ```
- strl-shift-c
- ```
- to open the "Inspector" in "Pick an element" mode.
- When you click on the desired entry on the page,
- it opens the actual code of the clicked element in the html search box.
- Now make a right click on the code in the html search box, go on "Copy",
- and go on XPath.
- Now you have the XPath of the element in your clipboard.
- When pasting it into the config, try to replace some slashes with double
- slashes. That will make the spider more stable, in case the websites
- html/xml gets changed for maintenance or other reasons.
-
-
|