Merge remote-tracking branch 'refs/remotes/origin/master'

Merging local changes to the code with changes to the README.md on gitea instance
This commit is contained in:
alpcentaur 2024-02-29 13:16:48 +00:00
commit c33dbc37e6

View file

@ -15,6 +15,7 @@
3. [Usage](#usage)
* [Configuration File Syntax](#configuration-file-syntax)
* [Efficient Xpath Copying](#efficient-xpath-copying)
* [Step By Step Guide](#step-by-step-guide)
# Introduction
@ -111,3 +112,77 @@ slashes. That will make the spider more stable, in case the websites
html/xml gets changed for maintenance or other reasons.
## Step By Step Guide
Start with an old Configuration that is similar to what you need.
There are Three Types of Configurations:
The first Type is purely path based. An example is greenjobs.de.
The second Type is a mixture of path and javascript functions, giz is an example for this Type.
The third Type is purely javascript based. An example is ted.europe.eu.
Type 1:
Start with collecting every variable.
From up to down.
### var domain
domain is the variable for the root of the website.
In case links are glued, they will be glued based on the root.
### var entry-list
Now come all the variables regarding the entry list pages.
#### var link1, link2 and iteration-var-list
In Pseudo Code, whats happening with these three variables is
```
for n in iteration var list:
get(link1 + n + link2)
```
So if you are on the no javascript side of reality, you are lucky. Thats all needed to get the collection of links.
An example to understand better:
Lets say we go on greenjobs.de.
We go on search without search query. To get the biggest displayed output, in best case a table of everything the site has listed.
https://www.greenjobs.de/angebote/index.html?s=&loc=&countrycode=de&dist=10&lng=&lat=
is the resulting url.
So now we navigate through the pages.
In this case everything is displayed and scrollable on exactly this url. Which means, we leave link2 and iteration var list empty. And put the resulting url into link1.
Another example:
This time we go on giz. There we have https://ausschreibungen.giz.de/Satellite/company/welcome.do as our url for a general search. If I go on the "nextpage" button of the displayed table, a new url pattern appears being on the next page:
https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=2
Going on the next side again, we get the url:
https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=3
So now we already see the pattern, that any and every machine generated output cant hide.
RSULT=1 .... we put it in the url bar of the browser
https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=1
and get to the first pages.
Which leads to the following variables, considering that there were 6 pages:
* link1 = "https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT="
* link2 = ""
* iteration-var-list = "[1,2,3,4,5,6]"
Having done the configuration, we can just come to
#### var parent
The parent means