You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

297 lines
14 KiB

10 months ago
10 months ago
10 months ago
  1. ```
  2. __ _ _ _ _
  3. / _| __| | |__ ___ _ __ (_) __| | ___ _ __
  4. | |_ / _` | '_ \ _____/ __| '_ \| |/ _` |/ _ | '__|
  5. | _| (_| | |_) |_____\__ | |_) | | (_| | __| |
  6. |_| \__,_|_.__/ |___| .__/|_|\__,_|\___|_|
  7. |_|
  8. ```
  9. 1. [Introduction](#introduction)
  10. 2. [Installation](#installation)
  11. 3. [Usage](#usage)
  12. * [Configuration File Syntax](#configuration-file-syntax)
  13. * [Efficient Xpath Copying](#efficient-xpath-copying)
  14. * [Step By Step Guide](#step-by-step-guide)
  15. - [var domain](#var-domain)
  16. - [var entry list](#var-entry-list)
  17. - [var link and iteration](#var-link-and-iteration)
  18. - [example1 link](#example1-link)
  19. - [example2 link](#example2-link)
  20. - [javascript](#javascript)
  21. - [var jsdomain](#var-jsdomain)
  22. - [vars jslink and jsiteration](#vars-jslink-and-jsiteration)
  23. - [example1 jslink and jsiteration](#example1-jslink-and-jsiteration)
  24. - [var parent](#var-parent)
  25. - [example1 parent](#example1-parent)
  26. - [vars children](#vars-children)
  27. - [var slow downloading](#var-slow-downloading)
  28. # Introduction
  29. The fdb-spider was made to gather data from Websites in an automated way.
  30. The Website to be spidered has to be a list of Links.
  31. Which makes the fdb-spider a web spider for most Plattforms.
  32. The fdb-spider is to be configured in a .yaml file to make things easy.
  33. The output of the fdb-spider is in json format to make it easy to input
  34. the json to other programs.
  35. At its core, the spider outputs tag search based entries
  36. It works together with the fdb-spider-interface.
  37. In Future, the spider will be extended by the model Sauerkraut.
  38. An !open source! Artificial Neural Network.
  39. # Installation
  40. Create a python3 virtualenv on your favourite UNIX Distribution
  41. with the command
  42. ```
  43. git clone https://code.basabuuka.org/alpcentaur/fdb-spider
  44. cd fdb-spider
  45. virtualenv venv
  46. source venv/bin/activate
  47. pip install -r requirements.txt
  48. ```
  49. then install systemwide requirements with your package manager
  50. ```
  51. # apt based unixoids
  52. apt install xvfb
  53. apt install chromium
  54. apt install chromium-webdriver
  55. # pacman based unixoids
  56. pacman -S xorg-server-xvfb
  57. pacman -S chromium
  58. ```
  59. # Usage
  60. ## Configuration File Syntax
  61. The configuration file with working syntax template is
  62. ```
  63. /spiders/config.yaml
  64. ```
  65. Here you can configure new websites to spider, referred to as "databases".
  66. link1 and link2 are the links to be iterated.
  67. The assumption is, that every list of links will have a loopable structure.
  68. If links are javascript links, specify js[domain,link[1,2],iteration-var-list].
  69. Otherwise leave them out, but specify jsdomain as 'None'.
  70. You will find parents and children of the entry list pages.
  71. Here you have to fill in the xpath of the entries to be parsed.
  72. In the entry directive, you have to specify uniform to either TRUE or FALSE.
  73. Set it to TRUE, if all the entry pages have the same template, and you
  74. are able to specify xpath again to get the text or whatever variable you
  75. like to specify.
  76. In the entry_unitrue directive, you can specify new dimensions and
  77. the json will adapt to your wishes.
  78. Under the entry-list directive this feature has to be still implemented.
  79. So use name, link, javascript-link, info, period and sponsor by commenting
  80. in or out.
  81. If javascript-link is set (which means its javascript clickable),
  82. link will be ignored.
  83. Set it to FALSE, if you have diverse pages behind the entries,
  84. and want to generally get the main text of all the different links.
  85. For your information, the library trafilature is used to gather the
  86. text generally for further processing.
  87. ## Efficient Xpath Copying
  88. When copying the Xpath, most modern Webbrowsers are of help.
  89. In Firefox (or Browsers build on it like the TOR Browser) you can use
  90. ```
  91. strl-shift-c
  92. ```
  93. to open the "Inspector" in "Pick an element" mode.
  94. When you click on the desired entry on the page,
  95. it opens the actual code of the clicked element in the html search box.
  96. Now make a right click on the code in the html search box, go on "Copy",
  97. and go on XPath.
  98. Now you have the XPath of the element in your clipboard.
  99. When pasting it into the config, try to replace some slashes with double
  100. slashes. That will make the spider more stable, in case the websites
  101. html/xml gets changed for maintenance or other reasons.
  102. ## Step By Step Guide
  103. Start with an old Configuration that is similar to what you need.
  104. There are Three Types of Configurations:
  105. The first Type is purely path based. An example is greenjobs.de.
  106. The second Type is a mixture of path and javascript functions, giz is an example for this Type.
  107. The third Type is purely javascript based. An example is ted.europe.eu.
  108. Type 1:
  109. Start with collecting every variable.
  110. From up to down.
  111. ### var domain
  112. domain is the variable for the root of the website.
  113. In case links are glued, they will be glued based on the root.
  114. ### var entry list
  115. Now come all the variables regarding the entry list pages.
  116. #### var link and iteration
  117. In Pseudo Code, whats happening with these three variables is
  118. ```
  119. for n in iteration var list:
  120. get(link1 glued to n glued to link2)
  121. ```
  122. So if you are on the no javascript side of reality, you are lucky. Thats all needed to get the collection of links.
  123. #### example1 link
  124. Lets say we go on greenjobs.de.
  125. We go on search without search query. To get the biggest displayed output, in best case a table of everything the site has listed.
  126. https://www.greenjobs.de/angebote/index.html?s=&loc=&countrycode=de&dist=10&lng=&lat=
  127. is the resulting url.
  128. So now we navigate through the pages.
  129. In this case everything is displayed and scrollable on exactly this url. Which means, we leave link2 and iteration var list empty. And put the resulting url into link1.
  130. #### example2 link
  131. This time we go on giz. There we have https://ausschreibungen.giz.de/Satellite/company/welcome.do as our url for a general search. If I go on the "nextpage" button of the displayed table, a new url pattern appears being on the next page:
  132. https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=2
  133. Going on the next side again, we get the url:
  134. https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=3
  135. So now we already see the pattern, that any and every machine generated output cant hide.
  136. RSULT=1 .... we put it in the url bar of the browser
  137. https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=1
  138. and get to the first pages.
  139. Which leads to the following variables, considering that there were 6 pages:
  140. * link1 = "https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT="
  141. * link2 = ""
  142. * iteration-var-list = "[1,2,3,4,5,6]"
  143. #### javascript
  144. It happens, that it is not possible to see a pattern in the urls. Probably because the website hoster is not smart or just a thief in a bad sense. In this case you only get html gibberish. To get the desired info with the help of this program, you have the possibility to give the config.yaml paths of clickable items. The spider will open an axctual browser and click through the pages that start to exist.
  145. #### var jsdomain
  146. If jsdomain is "None" (And here it is important to use None and not NONE), it will try to get the domains and download elements based on the gets with a variety of libraries.
  147. But if you have a javascript situation, where the first html are already javascript generated kaos without xml to parse.. then you need to put an url here. By putting the url, the spider will open that website with a virtual graphical browser (using selenium), wait for the javascript to load, and by clicking on jslink, go through the pages.
  148. In pseudo code that means, if you fill out the variable jsdomain in the config.yaml, the spider will do
  149. ```
  150. for i in jsiteration-var-list:
  151. click on the jslink string glued to i glued to jslink2
  152. ```
  153. #### vars jslink and jsiteration
  154. In jslink1 and jslink2 you have to put the xpath of the button that clicks to the next site of entry links to download.
  155. Sometimes the xpath changes, after the new js content got loaded. That is where the jsiteration-var-list comes in. And like this, you define the button that gets klicked every site. Sometimes it stays the same, then you just need an occurence of the same number exactly the same times as in the var-iteration, which will define how many pages will be downloaded generally. The var iteration-var-list defines the folder structure of the json output files.
  156. Which means we emulate a whole virtual "user" using a virtual "browser" on his or her virtual "screen". In the end the clickable elements are defined by xpath too, so put these accordingly in the jslink and jsiteration variables.
  157. You can run the spider with display=1 instead of display=0 in the python line of the virtual display the chromium driver is running on. I will put that in the initialization of the spider. How to do this, in general if you use any of the js related variables instead of setting "NONE" (important: and not "None"), will be described in the paragraph display. It is very useful to debug js related configs.
  158. With running the spider while watching the automated mouse moves and clicks, you will be able to find the right xpath for every step and element.
  159. #### example1 jslink and jsiteration
  160. So let us consider evergabe-online as an example.
  161. ```
  162. evergabe-online:
  163. domain: 'https://www.evergabe-online.de/'
  164. entry-list:
  165. link1: 'https://www.evergabe-online.de/search.html?101-1.-searchPanel>
  166. link2: '-pageLink'
  167. jsdomain: 'https://www.evergabe-online.de/search.html'
  168. jslink1: '/html/body/div[8]/main/div[4]/div/div/div[2]/table/thead/tr[1]/td/div[2]/div/span['
  169. jslink2: ']'
  170. jsiteration-var-list: "[1,2, 3, 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
  171. iteration-var-list: "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
  172. ```
  173. Go on jsdomain https://www.evergabe-online.de/search.html.
  174. You will see the table we want to spider.
  175. Open the inspector, and have a look at the button to get to the next site.
  176. Its xpath is '/html/body/div[8]/main/div[4]/div/div/div[2]/table/thead/tr[1]/td/div[2]/div/span[1]'
  177. Now we click on it. On page two, the button to click us to page three has the xpath:
  178. '/html/body/div[8]/main/div[4]/div/div/div[2]/table/thead/tr[1]/td/div[2]/div/span[2]'
  179. From page 5 on, the button to get to the next pages stays
  180. '/html/body/div[8]/main/div[4]/div/div/div[2]/table/thead/tr[1]/td/div[2]/div/span[6]'
  181. until the end.
  182. #### display
  183. When you run the spider with js spidering enabled, in fdb_spider.py a display will get created. If you open nano, and press crtl+w, you can type display and enter. This will bring you to the lines of code, generating the display.
  184. Watch out for the line
  185. ```
  186. display = Display(visible=0, size=(1200, 800))
  187. ```
  188. If you change visible=0 to visible=1 here, the spider will run with actually open a viewable browser on the workspace.
  189. This line is present two times in the code. One for downloading the pages with the links, and one for downloading the pages of/behind the links.
  190. After finding and jumping to "display" with ctrl-w, go down some lines, and issue ctrl-w again. Or find out in the manual how to jump to the next occurance, I knew it once.
  191. #### var parent
  192. The parent stands for the last xml element which contains the entry links. Go with the Inspector on the entry respectively one of the links, klick on it and klick in the code view afterwards. Now use the arrow up key to get to the last child before it comes to the parent. You can see it on the rendered html blue contained.
  193. From the last child, when it is selected in the code view of the Inspector, do a left click on it.
  194. Select copy --> full xpath.
  195. Now copy paste the xpath to the parent variable, and put a double slash in front of the last element in the route.
  196. #### example1 parent
  197. For the List of the Gesellschaft für internationale Zusammenarbeit the last child before the parent is
  198. //html//body//div//div//table[contains(@class, 'csx-new-table')]//tbody//tr
  199. a tr which kind of stands for row in html. This, because GIZ is cooperative and ordered. You need the //tr because that says "search for any tr" which means we are reffering to a list of elements.
  200. #### vars children
  201. for the children it is about to define in xpath syntax, where the text lies that we want to be parsed automatically. child-name, child-link, child-period, child-info and child-sponsor are defined until now.
  202. In future it will be possible to define any variables anywhere and get that fed into the json output.
  203. #### var javascript link
  204. In case the whole website to spider is javascript generated gibberish, there is another possibility for you. To find out if the website is generated gibberish not containing your payload, just search in the outputed pages for the child name etc. If you do not find them, or directly see on the html pages no real xml, try to download your sites with jsdomain, and go up again to the paragraphs before. For the actual link child, the spider can javascript style clicking download the htmls behind the links already while downloading the entry list htmls with javascript.
  205. For that to happen, you can define the javascript link that needs to be clicked in xpath syntax. If it becomes a pop up, which source code needs to be processed, or if it becomes an actual page the source code will be parsed, or if the clickable link refers to a pdf, the spider will handle all situations and output the resulting text in json under spiders/output.
  206. #### var slow downlading
  207. slow downloading comes into take when a website uses lazy loading and / or is used by too many users resulting in loading very slow. In this case, the selenium part, as any normal user, runs into problems of timing. Which leads to much longer processing time, when harvesting the lists. Depending on the configuration of the firewall of the server you are harvesting, there may even be a limit by time. And again, we just can act as a lot of different instances and everyone just gets a part of it to download. In this sense, even for future totalitarist system that may come, the freeing of information from big platforms is always and will always be possible in peer to peer systems.