diff --git a/README.md b/README.md index 4c6dcad..c4c19fb 100644 --- a/README.md +++ b/README.md @@ -15,6 +15,7 @@ 3. [Usage](#usage) * [Configuration File Syntax](#configuration-file-syntax) * [Efficient Xpath Copying](#efficient-xpath-copying) + * [Step By Step Guide](#step-by-step-guide) # Introduction @@ -111,3 +112,77 @@ slashes. That will make the spider more stable, in case the websites html/xml gets changed for maintenance or other reasons. +## Step By Step Guide + +Start with an old Configuration that is similar to what you need. + +There are Three Types of Configurations: + +The first Type is purely path based. An example is greenjobs.de. +The second Type is a mixture of path and javascript functions, giz is an example for this Type. +The third Type is purely javascript based. An example is ted.europe.eu. + +Type 1: + +Start with collecting every variable. +From up to down. + +### var domain + +domain is the variable for the root of the website. +In case links are glued, they will be glued based on the root. + +### var entry-list + +Now come all the variables regarding the entry list pages. + +#### var link1, link2 and iteration-var-list + +In Pseudo Code, whats happening with these three variables is + +``` +for n in iteration var list: + get(link1 + n + link2) +``` + +So if you are on the no javascript side of reality, you are lucky. Thats all needed to get the collection of links. + + +An example to understand better: +Lets say we go on greenjobs.de. +We go on search without search query. To get the biggest displayed output, in best case a table of everything the site has listed. + +https://www.greenjobs.de/angebote/index.html?s=&loc=&countrycode=de&dist=10&lng=&lat= +is the resulting url. + +So now we navigate through the pages. +In this case everything is displayed and scrollable on exactly this url. Which means, we leave link2 and iteration var list empty. And put the resulting url into link1. + +Another example: +This time we go on giz. There we have https://ausschreibungen.giz.de/Satellite/company/welcome.do as our url for a general search. If I go on the "nextpage" button of the displayed table, a new url pattern appears being on the next page: + +https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=2 + +Going on the next side again, we get the url: + +https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=3 + +So now we already see the pattern, that any and every machine generated output cant hide. + +RSULT=1 .... we put it in the url bar of the browser + +https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=1 + +and get to the first pages. +Which leads to the following variables, considering that there were 6 pages: + +* link1 = "https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=" +* link2 = "" +* iteration-var-list = "[1,2,3,4,5,6]" + + +Having done the configuration, we can just come to + +#### var parent + +The parent means \ No newline at end of file