You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

187 lines
6.5 KiB

  1. ```
  2. __ _ _ _ _
  3. / _| __| | |__ ___ _ __ (_) __| | ___ _ __
  4. | |_ / _` | '_ \ _____/ __| '_ \| |/ _` |/ _ | '__|
  5. | _| (_| | |_) |_____\__ | |_) | | (_| | __| |
  6. |_| \__,_|_.__/ |___| .__/|_|\__,_|\___|_|
  7. |_|
  8. ```
  9. 1. [Introduction](#introduction)
  10. 2. [Installation](#installation)
  11. 3. [Usage](#usage)
  12. * [Configuration File Syntax](#configuration-file-syntax)
  13. * [Efficient Xpath Copying](#efficient-xpath-copying)
  14. * [Step By Step Guide](#step-by-step-guide)
  15. # Introduction
  16. The fdb-spider was made to gather data from Websites in an automated way.
  17. The Website to be spidered has to be a list of Links.
  18. Which makes the fdb-spider a web spider for most Plattforms.
  19. The fdb-spider is to be configured in a .yaml file to make things easy.
  20. The output of the fdb-spider is in json format to make it easy to input
  21. the json to other programs.
  22. At its core, the spider outputs tag search based entries
  23. It works together with the fdb-spider-interface.
  24. In Future, the spider will be extended by the model Sauerkraut.
  25. An !open source! Artificial Neural Network.
  26. # Installation
  27. Create a python3 virtualenv on your favourite UNIX Distribution
  28. with the command
  29. ```
  30. git clone https://code.basabuuka.org/alpcentaur/fdb-spider
  31. cd fdb-spider
  32. virtualenv venv
  33. source venv/bin/activate
  34. pip install -r requirements.txt
  35. ```
  36. then install systemwide requirements with your package manager
  37. ```
  38. # apt based unixoids
  39. apt install xvfb
  40. apt install chromium
  41. apt install chromium-webdriver
  42. # pacman based unixoids
  43. pacman -S xorg-server-xvfb
  44. pacman -S chromium
  45. ```
  46. # Usage
  47. ## Configuration File Syntax
  48. The configuration file with working syntax template is
  49. ```
  50. /spiders/config.yaml
  51. ```
  52. Here you can configure new websites to spider, referred to as "databases".
  53. link1 and link2 are the links to be iterated.
  54. The assumption is, that every list of links will have a loopable structure.
  55. If links are javascript links, specify js[domain,link[1,2],iteration-var-list].
  56. Otherwise leave them out, but specify jsdomain as 'None'.
  57. You will find parents and children of the entry list pages.
  58. Here you have to fill in the xpath of the entries to be parsed.
  59. In the entry directive, you have to specify uniform to either TRUE or FALSE.
  60. Set it to TRUE, if all the entry pages have the same template, and you
  61. are able to specify xpath again to get the text or whatever variable you
  62. like to specify.
  63. In the entry_unitrue directive, you can specify new dimensions and
  64. the json will adapt to your wishes.
  65. Under the entry-list directive this feature has to be still implemented.
  66. So use name, link, javascript-link, info, period and sponsor by commenting
  67. in or out.
  68. If javascript-link is set (which means its javascript clickable),
  69. link will be ignored.
  70. Set it to FALSE, if you have diverse pages behind the entries,
  71. and want to generally get the main text of all the different links.
  72. For your information, the library trafilature is used to gather the
  73. text generally for further processing.
  74. ## Efficient Xpath Copying
  75. When copying the Xpath, most modern Webbrowsers are of help.
  76. In Firefox (or Browsers build on it like the TOR Browser) you can use
  77. ```
  78. strl-shift-c
  79. ```
  80. to open the "Inspector" in "Pick an element" mode.
  81. When you click on the desired entry on the page,
  82. it opens the actual code of the clicked element in the html search box.
  83. Now make a right click on the code in the html search box, go on "Copy",
  84. and go on XPath.
  85. Now you have the XPath of the element in your clipboard.
  86. When pasting it into the config, try to replace some slashes with double
  87. slashes. That will make the spider more stable, in case the websites
  88. html/xml gets changed for maintenance or other reasons.
  89. ## Step By Step Guide
  90. Start with an old Configuration that is similar to what you need.
  91. There are Three Types of Configurations:
  92. The first Type is purely path based. An example is greenjobs.de.
  93. The second Type is a mixture of path and javascript functions, giz is an example for this Type.
  94. The third Type is purely javascript based. An example is ted.europe.eu.
  95. Type 1:
  96. Start with collecting every variable.
  97. From up to down.
  98. ### var domain
  99. domain is the variable for the root of the website.
  100. In case links are glued, they will be glued based on the root.
  101. ### var entry-list
  102. Now come all the variables regarding the entry list pages.
  103. #### var link1, link2 and iteration-var-list
  104. In Pseudo Code, whats happening with these three variables is
  105. ```
  106. for n in iteration var list:
  107. get(link1 + n + link2)
  108. ```
  109. So if you are on the no javascript side of reality, you are lucky. Thats all needed to get the collection of links.
  110. An example to understand better:
  111. Lets say we go on greenjobs.de.
  112. We go on search without search query. To get the biggest displayed output, in best case a table of everything the site has listed.
  113. https://www.greenjobs.de/angebote/index.html?s=&loc=&countrycode=de&dist=10&lng=&lat=
  114. is the resulting url.
  115. So now we navigate through the pages.
  116. In this case everything is displayed and scrollable on exactly this url. Which means, we leave link2 and iteration var list empty. And put the resulting url into link1.
  117. Another example:
  118. This time we go on giz. There we have https://ausschreibungen.giz.de/Satellite/company/welcome.do as our url for a general search. If I go on the "nextpage" button of the displayed table, a new url pattern appears being on the next page:
  119. https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=2
  120. Going on the next side again, we get the url:
  121. https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=3
  122. So now we already see the pattern, that any and every machine generated output cant hide.
  123. RSULT=1 .... we put it in the url bar of the browser
  124. https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=1
  125. and get to the first pages.
  126. Which leads to the following variables, considering that there were 6 pages:
  127. * link1 = "https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT="
  128. * link2 = ""
  129. * iteration-var-list = "[1,2,3,4,5,6]"
  130. Having done the configuration, we can just come to
  131. #### var parent
  132. The parent means