- ```
- __ _ _ _ _
- / _| __| | |__ ___ _ __ (_) __| | ___ _ __
- | |_ / _` | '_ \ _____/ __| '_ \| |/ _` |/ _ | '__|
- | _| (_| | |_) |_____\__ | |_) | | (_| | __| |
- |_| \__,_|_.__/ |___| .__/|_|\__,_|\___|_|
- |_|
- ```
- 1. [Introduction](#introduction)
- 2. [Installation](#installation)
- 3. [Usage](#usage)
- * [Configuration File Syntax](#configuration-file-syntax)
- * [Efficient Xpath Copying](#efficient-xpath-copying)
- * [Step By Step Guide](#step-by-step-guide)
- # Introduction
- The fdb-spider was made to gather data from Websites in an automated way.
- The Website to be spidered has to be a list of Links.
- Which makes the fdb-spider a web spider for most Plattforms.
- The fdb-spider is to be configured in a .yaml file to make things easy.
- The output of the fdb-spider is in json format to make it easy to input
- the json to other programs.
- At its core, the spider outputs tag search based entries
- It works together with the fdb-spider-interface.
- In Future, the spider will be extended by the model Sauerkraut.
- An !open source! Artificial Neural Network.
- # Installation
- Create a python3 virtualenv on your favourite UNIX Distribution
- with the command
- ```
- git clone https://code.basabuuka.org/alpcentaur/fdb-spider
- cd fdb-spider
- virtualenv venv
- source venv/bin/activate
- pip install -r requirements.txt
- ```
- then install systemwide requirements with your package manager
- ```
- # apt based unixoids
- apt install xvfb
- apt install chromium
- apt install chromium-webdriver
- # pacman based unixoids
- pacman -S xorg-server-xvfb
- pacman -S chromium
- ```
- # Usage
- ## Configuration File Syntax
- The configuration file with working syntax template is
- ```
- /spiders/config.yaml
- ```
- Here you can configure new websites to spider, referred to as "databases".
- link1 and link2 are the links to be iterated.
- The assumption is, that every list of links will have a loopable structure.
- If links are javascript links, specify js[domain,link[1,2],iteration-var-list].
- Otherwise leave them out, but specify jsdomain as 'None'.
- You will find parents and children of the entry list pages.
- Here you have to fill in the xpath of the entries to be parsed.
- In the entry directive, you have to specify uniform to either TRUE or FALSE.
- Set it to TRUE, if all the entry pages have the same template, and you
- are able to specify xpath again to get the text or whatever variable you
- like to specify.
- In the entry_unitrue directive, you can specify new dimensions and
- the json will adapt to your wishes.
- Under the entry-list directive this feature has to be still implemented.
- So use name, link, javascript-link, info, period and sponsor by commenting
- in or out.
- If javascript-link is set (which means its javascript clickable),
- link will be ignored.
- Set it to FALSE, if you have diverse pages behind the entries,
- and want to generally get the main text of all the different links.
- For your information, the library trafilature is used to gather the
- text generally for further processing.
- ## Efficient Xpath Copying
- When copying the Xpath, most modern Webbrowsers are of help.
- In Firefox (or Browsers build on it like the TOR Browser) you can use
- ```
- strl-shift-c
- ```
- to open the "Inspector" in "Pick an element" mode.
- When you click on the desired entry on the page,
- it opens the actual code of the clicked element in the html search box.
- Now make a right click on the code in the html search box, go on "Copy",
- and go on XPath.
- Now you have the XPath of the element in your clipboard.
- When pasting it into the config, try to replace some slashes with double
- slashes. That will make the spider more stable, in case the websites
- html/xml gets changed for maintenance or other reasons.
- ## Step By Step Guide
- Start with an old Configuration that is similar to what you need.
- There are Three Types of Configurations:
- The first Type is purely path based. An example is greenjobs.de.
- The second Type is a mixture of path and javascript functions, giz is an example for this Type.
- The third Type is purely javascript based. An example is ted.europe.eu.
- Type 1:
- Start with collecting every variable.
- From up to down.
- ### var domain
- domain is the variable for the root of the website.
- In case links are glued, they will be glued based on the root.
- ### var entry-list
- Now come all the variables regarding the entry list pages.
- #### var link1, link2 and iteration-var-list
- In Pseudo Code, whats happening with these three variables is
- ```
- for n in iteration var list:
- get(link1 + n + link2)
- ```
- So if you are on the no javascript side of reality, you are lucky. Thats all needed to get the collection of links.
- An example to understand better:
- Lets say we go on greenjobs.de.
- We go on search without search query. To get the biggest displayed output, in best case a table of everything the site has listed.
- https://www.greenjobs.de/angebote/index.html?s=&loc=&countrycode=de&dist=10&lng=&lat=
- is the resulting url.
- So now we navigate through the pages.
- In this case everything is displayed and scrollable on exactly this url. Which means, we leave link2 and iteration var list empty. And put the resulting url into link1.
- Another example:
- This time we go on giz. There we have https://ausschreibungen.giz.de/Satellite/company/welcome.do as our url for a general search. If I go on the "nextpage" button of the displayed table, a new url pattern appears being on the next page:
- https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=2
- Going on the next side again, we get the url:
- https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=3
- So now we already see the pattern, that any and every machine generated output cant hide.
- RSULT=1 .... we put it in the url bar of the browser
- https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=1
- and get to the first pages.
- Which leads to the following variables, considering that there were 6 pages:
- * link1 = "https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT="
- * link2 = ""
- * iteration-var-list = "[1,2,3,4,5,6]"
- Having done the configuration, we can just come to
- #### var parent
- The parent means