clarifications for javascript link and js link plus js iteration

2024-03-03 18:24:07 +01:00 · 2024-03-03 18:24:07 +01:00 · 0908ccf6e5
commit 0908ccf6e5
parent ff0fe5193d
1 changed files with 7 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -193,9 +193,11 @@ Which leads to the following variables, considering that there were 6 pages:

 #### var jslink and jsiteration

-It happens, that it is not possible to see a pattern in the urls. Probably because the website hoster is not smart or just a thief in a bad sense. In this case you will be forced to use the selenium part of the spider. 
+It happens, that it is not possible to see a pattern in the urls. Probably because the website hoster is not smart or just a thief in a bad sense. In this case you only get html gibberish. To get the desired info with the help of this program, you have the possibility to give the config.yaml paths of clickable items. The spider will open an axctual browser and click through the pages that start to exist. 

-Which means we emulate a whole virtual "user" using a virtual "browser" on his or her virtual "screen". In the end the clickable elements are defined by xpath too ( with some double slashes). 
+Sometimes the xpath changes, after the new js content got loaded. That is where the jsiteration-var-list comes in. And like this, you define the button that gets klicked every site. Sometimes it stays the same, then you just need an occurence of the same number exactly the same times as in the var-iteration, which will define how many pages will be downloaded generally. The var js-iteration-list defines the folder structure of the json output files.  
+
+Which means we emulate a whole virtual "user" using a virtual "browser" on his or her virtual "screen". In the end the clickable elements are defined by xpath too, so put these accordingly in the jslink and jsiteration variables. 

 You can run the spider with display=1 instead of display=0 in the python line of the virtual display the chromium driver is running on. I will put that in the initialization of the spider.

@ -224,6 +226,9 @@ a tr which kind of stands for row in html. This, because GIZ is cooperative and
 for the children it is about to define in xpath syntax, where the text lies that we want to be parsed automatically. child-name, child-link, child-period, child-info and child-sponsor are defined until now.
 In future it will be possible to define any variables anywhere and get that fed into the json output.

+#### var javascript link
+
+In case the whole website to spider is javascript generated gibberish, you need to download the htmls behind the links already while downloading the entry list htmls with javascript. Then define the javascript link that needs to be clicked in xpath syntax. To either become a pop up, which source code will be processed, or to become an actual page which source code will be parsed, or to become a pdf that gets downloaded and parsed to text.

 #### var slow downlading