Browse Source

fixed the links for the clickable content summary

master
alpcentaur 6 months ago
parent
commit
ff0fe5193d
1 changed files with 8 additions and 7 deletions
  1. +8
    -7
      README.md

+ 8
- 7
README.md View File

@ -24,6 +24,7 @@
- [var jslink and jsiteration](#var-jslink-and-jsiteration)
- [var parent](#var-parent)
- [example1 parent](#example1-parent)
- [vars children](#vars-children)
- [var slow downloading](#var-slow-downloading)
# Introduction
@ -141,11 +142,11 @@ From up to down.
domain is the variable for the root of the website.
In case links are glued, they will be glued based on the root.
### var entry-list
### var entry list
Now come all the variables regarding the entry list pages.
#### var link1, link2 and iteration-var-list
#### var link and iteration
In Pseudo Code, whats happening with these three variables is
@ -157,7 +158,7 @@ for n in iteration var list:
So if you are on the no javascript side of reality, you are lucky. Thats all needed to get the collection of links.
#### example 1 var link1
#### example1 link
Lets say we go on greenjobs.de.
We go on search without search query. To get the biggest displayed output, in best case a table of everything the site has listed.
@ -167,7 +168,7 @@ is the resulting url.
So now we navigate through the pages.
In this case everything is displayed and scrollable on exactly this url. Which means, we leave link2 and iteration var list empty. And put the resulting url into link1.
#### example 2 var link1
#### example2 link
This time we go on giz. There we have https://ausschreibungen.giz.de/Satellite/company/welcome.do as our url for a general search. If I go on the "nextpage" button of the displayed table, a new url pattern appears being on the next page:
https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=2
@ -190,7 +191,7 @@ Which leads to the following variables, considering that there were 6 pages:
* iteration-var-list = "[1,2,3,4,5,6]"
#### var jslink1, jslink2, jsvar-iteration-list
#### var jslink and jsiteration
It happens, that it is not possible to see a pattern in the urls. Probably because the website hoster is not smart or just a thief in a bad sense. In this case you will be forced to use the selenium part of the spider.
@ -210,7 +211,7 @@ From the last child, when it is selected in the code view of the Inspector, do a
Select copy --> full xpath.
Now copy paste the xpath to the parent variable, and put a double slash in front of the last element in the route.
#### example 1 var parent
#### example1 parent
For the List of the Gesellschaft für internationale Zusammenarbeit the last child before the parent is
@ -224,6 +225,6 @@ for the children it is about to define in xpath syntax, where the text lies that
In future it will be possible to define any variables anywhere and get that fed into the json output.
#### var slow-downlading
#### var slow downlading
slow downloading comes into take when a website uses lazy loading and / or is used by too many users resulting in loading very slow. In this case, the selenium part, as any normal user, runs into problems of timing. Which leads to much longer processing time, when harvesting the lists. Depending on the configuration of the firewall of the server you are harvesting, there may even be a limit by time. And again, we just can act as a lot of different instances and everyone just gets a part of it to download. In this sense, even for future totalitarist system that may come, the freeing of information from big platforms is always and will always be possible in peer to peer systems.

Loading…
Cancel
Save