Merge remote-tracking branch 'refs/remotes/origin/master'
Merging local changes to the code with changes to the README.md on gitea instance
This commit is contained in:
commit
c33dbc37e6
1 changed files with 75 additions and 0 deletions
75
README.md
75
README.md
|
@ -15,6 +15,7 @@
|
||||||
3. [Usage](#usage)
|
3. [Usage](#usage)
|
||||||
* [Configuration File Syntax](#configuration-file-syntax)
|
* [Configuration File Syntax](#configuration-file-syntax)
|
||||||
* [Efficient Xpath Copying](#efficient-xpath-copying)
|
* [Efficient Xpath Copying](#efficient-xpath-copying)
|
||||||
|
* [Step By Step Guide](#step-by-step-guide)
|
||||||
|
|
||||||
# Introduction
|
# Introduction
|
||||||
|
|
||||||
|
@ -111,3 +112,77 @@ slashes. That will make the spider more stable, in case the websites
|
||||||
html/xml gets changed for maintenance or other reasons.
|
html/xml gets changed for maintenance or other reasons.
|
||||||
|
|
||||||
|
|
||||||
|
## Step By Step Guide
|
||||||
|
|
||||||
|
Start with an old Configuration that is similar to what you need.
|
||||||
|
|
||||||
|
There are Three Types of Configurations:
|
||||||
|
|
||||||
|
The first Type is purely path based. An example is greenjobs.de.
|
||||||
|
The second Type is a mixture of path and javascript functions, giz is an example for this Type.
|
||||||
|
The third Type is purely javascript based. An example is ted.europe.eu.
|
||||||
|
|
||||||
|
Type 1:
|
||||||
|
|
||||||
|
Start with collecting every variable.
|
||||||
|
From up to down.
|
||||||
|
|
||||||
|
### var domain
|
||||||
|
|
||||||
|
domain is the variable for the root of the website.
|
||||||
|
In case links are glued, they will be glued based on the root.
|
||||||
|
|
||||||
|
### var entry-list
|
||||||
|
|
||||||
|
Now come all the variables regarding the entry list pages.
|
||||||
|
|
||||||
|
#### var link1, link2 and iteration-var-list
|
||||||
|
|
||||||
|
In Pseudo Code, whats happening with these three variables is
|
||||||
|
|
||||||
|
```
|
||||||
|
for n in iteration var list:
|
||||||
|
get(link1 + n + link2)
|
||||||
|
```
|
||||||
|
|
||||||
|
So if you are on the no javascript side of reality, you are lucky. Thats all needed to get the collection of links.
|
||||||
|
|
||||||
|
|
||||||
|
An example to understand better:
|
||||||
|
Lets say we go on greenjobs.de.
|
||||||
|
We go on search without search query. To get the biggest displayed output, in best case a table of everything the site has listed.
|
||||||
|
|
||||||
|
https://www.greenjobs.de/angebote/index.html?s=&loc=&countrycode=de&dist=10&lng=&lat=
|
||||||
|
is the resulting url.
|
||||||
|
|
||||||
|
So now we navigate through the pages.
|
||||||
|
In this case everything is displayed and scrollable on exactly this url. Which means, we leave link2 and iteration var list empty. And put the resulting url into link1.
|
||||||
|
|
||||||
|
Another example:
|
||||||
|
This time we go on giz. There we have https://ausschreibungen.giz.de/Satellite/company/welcome.do as our url for a general search. If I go on the "nextpage" button of the displayed table, a new url pattern appears being on the next page:
|
||||||
|
|
||||||
|
https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=2
|
||||||
|
|
||||||
|
Going on the next side again, we get the url:
|
||||||
|
|
||||||
|
https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=3
|
||||||
|
|
||||||
|
So now we already see the pattern, that any and every machine generated output cant hide.
|
||||||
|
|
||||||
|
RSULT=1 .... we put it in the url bar of the browser
|
||||||
|
|
||||||
|
https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=1
|
||||||
|
|
||||||
|
and get to the first pages.
|
||||||
|
Which leads to the following variables, considering that there were 6 pages:
|
||||||
|
|
||||||
|
* link1 = "https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT="
|
||||||
|
* link2 = ""
|
||||||
|
* iteration-var-list = "[1,2,3,4,5,6]"
|
||||||
|
|
||||||
|
|
||||||
|
Having done the configuration, we can just come to
|
||||||
|
|
||||||
|
#### var parent
|
||||||
|
|
||||||
|
The parent means
|
Loading…
Reference in a new issue