Merge remote-tracking branch 'refs/remotes/origin/master'

Merging local changes to the code with changes to the README.md on gitea instance
2024-02-29 13:16:48 +00:00 · 2024-02-29 13:16:48 +00:00 · c33dbc37e6
commit c33dbc37e6
parent a07d2e93f6 5fd6b7f781
1 changed files with 75 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -15,6 +15,7 @@
 3. [Usage](#usage)
  *  [Configuration File Syntax](#configuration-file-syntax)
  *  [Efficient Xpath Copying](#efficient-xpath-copying)
+  *  [Step By Step Guide](#step-by-step-guide)

 # Introduction

@ -111,3 +112,77 @@ slashes. That will make the spider more stable, in case the websites
 html/xml gets changed for maintenance or other reasons.


+## Step By Step Guide
+
+Start with an old Configuration that is similar to what you need.
+
+There are Three Types of Configurations:
+
+The first Type is purely path based. An example is greenjobs.de.
+The second Type is a mixture of path and javascript functions, giz is an example for this Type.
+The third Type is purely javascript based. An example is ted.europe.eu.
+
+Type 1:
+
+Start with collecting every variable.
+From up to down.
+
+### var domain
+
+domain is the variable for the root of the website.
+In case links are glued, they will be glued based on the root.
+
+### var entry-list
+
+Now come all the variables regarding the entry list pages.
+
+#### var link1, link2 and iteration-var-list
+
+In Pseudo Code, whats happening with these three variables is
+
+```
+for n in iteration var list:
+    get(link1 + n + link2)
+```
+
+So if you are on the no javascript side of reality, you are lucky. Thats all needed to get the collection of links.
+
+
+An example to understand better:
+Lets say we go on greenjobs.de.
+We go on search without search query. To get the biggest displayed output, in best case a table of everything the site has listed.
+
+https://www.greenjobs.de/angebote/index.html?s=&loc=&countrycode=de&dist=10&lng=&lat=
+is the resulting url.
+
+So now we navigate through the pages.
+In this case everything is displayed and scrollable on exactly this url. Which means, we leave link2 and iteration var list empty. And put the resulting url into link1.
+
+Another example:
+This time we go on giz. There we have https://ausschreibungen.giz.de/Satellite/company/welcome.do as our url for a general search. If I go on the "nextpage" button of the displayed table, a new url pattern appears being on the next page: 
+
+https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=2
+
+Going on the next side again, we get the url:
+
+https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=3
+
+So now we already see the pattern, that any and every machine generated output cant hide.
+
+RSULT=1 .... we put it in the url bar of the browser
+
+https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=1
+
+and get to the first pages.
+Which leads to the following variables, considering that there were 6 pages:
+
+* link1 = "https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT="
+* link2 = ""
+* iteration-var-list = "[1,2,3,4,5,6]"
+
+
+Having done the configuration, we can just come to 
+
+#### var parent
+
+The parent means