Step by Step Guide continuation of describing the variables

This commit is contained in:
alpcentaur 2024-03-01 00:09:38 +01:00
parent 483eaec26e
commit eca77f9b63

View file

@ -148,7 +148,7 @@ for n in iteration var list:
So if you are on the no javascript side of reality, you are lucky. Thats all needed to get the collection of links. So if you are on the no javascript side of reality, you are lucky. Thats all needed to get the collection of links.
An example to understand better: #### example 1 var link1
Lets say we go on greenjobs.de. Lets say we go on greenjobs.de.
We go on search without search query. To get the biggest displayed output, in best case a table of everything the site has listed. We go on search without search query. To get the biggest displayed output, in best case a table of everything the site has listed.
@ -158,7 +158,7 @@ is the resulting url.
So now we navigate through the pages. So now we navigate through the pages.
In this case everything is displayed and scrollable on exactly this url. Which means, we leave link2 and iteration var list empty. And put the resulting url into link1. In this case everything is displayed and scrollable on exactly this url. Which means, we leave link2 and iteration var list empty. And put the resulting url into link1.
Another example: #### example 2 var link1
This time we go on giz. There we have https://ausschreibungen.giz.de/Satellite/company/welcome.do as our url for a general search. If I go on the "nextpage" button of the displayed table, a new url pattern appears being on the next page: This time we go on giz. There we have https://ausschreibungen.giz.de/Satellite/company/welcome.do as our url for a general search. If I go on the "nextpage" button of the displayed table, a new url pattern appears being on the next page:
https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=2 https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=2
@ -181,8 +181,40 @@ Which leads to the following variables, considering that there were 6 pages:
* iteration-var-list = "[1,2,3,4,5,6]" * iteration-var-list = "[1,2,3,4,5,6]"
Having done the configuration, we can just come to #### var jslink1, jslink2, jsvar-iteration-list
It happens, that it is not possible to see a pattern in the urls. Probably because the website hoster is not smart or just a thief in a bad sense. In this case you will be forced to use the selenium part of the spider.
Which means we emulate a whole virtual "user" using a virtual "browser" on his or her virtual "screen". In the end the clickable elements are defined by xpath too ( with some double slashes).
You can run the spider with display=1 instead of display=0 in the python line of the virtual display the chromium driver is running on. I will put that in the initialization of the spider.
With running the spider while watching the automated mouse moves and clicks, you will be able to find the right xpath for every step and element.
#### var parent #### var parent
The parent means The parent stands for the last xml element which contains the entry links. Go with the Inspector on the entry respectively one of the links, klick on it and klick in the code view afterwards. Now use the arrow up key to get to the last child before it comes to the parent. You can see it on the rendered html blue contained.
From the last child, when it is selected in the code view of the Inspector, do a left click on it.
Select copy --> full xpath.
Now copy paste the xpath to the parent variable, and put a double slash in front of the last element in the route.
#### example 1 var parent
For the List of the Gesellschaft für internationale Zusammenarbeit the last child before the parent is
//html//body//div//div//table[contains(@class, 'csx-new-table')]//tbody//tr
a tr which kind of stands for row in html. This, because GIZ is cooperative and ordered. You need the //tr because that says "search for any tr" which means we are reffering to a list of elements.
#### vars children
for the children it is about to define in xpath syntax, where the text lies that we want to be parsed automatically. child-name, child-link, child-period, child-info and child-sponsor are defined until now.
In future it will be possible to define any variables anywhere and get that fed into the json output.
#### var slow-downlading
slow downloading comes into take when a website uses lazy loading and / or is used by too many users resulting in loading very slow. In this case, the selenium part, as any normal user, runs into problems of timing. Which leads to much longer processing time, when harvesting the lists. Depending on the configuration of the firewall of the server you are harvesting, there may even be a limit by time. And again, we just can act as a lot of different instances and everyone just gets a part of it to download. In this sense, even for future totalitarist system that may come, the freeing of information from big platforms is always and will always be possible in peer to peer systems.