Browse Source

clarifications for javascript link and js link plus js iteration

master
alpcentaur 6 months ago
parent
commit
0908ccf6e5
1 changed files with 7 additions and 2 deletions
  1. +7
    -2
      README.md

+ 7
- 2
README.md View File

@ -193,9 +193,11 @@ Which leads to the following variables, considering that there were 6 pages:
#### var jslink and jsiteration
It happens, that it is not possible to see a pattern in the urls. Probably because the website hoster is not smart or just a thief in a bad sense. In this case you will be forced to use the selenium part of the spider.
It happens, that it is not possible to see a pattern in the urls. Probably because the website hoster is not smart or just a thief in a bad sense. In this case you only get html gibberish. To get the desired info with the help of this program, you have the possibility to give the config.yaml paths of clickable items. The spider will open an axctual browser and click through the pages that start to exist.
Which means we emulate a whole virtual "user" using a virtual "browser" on his or her virtual "screen". In the end the clickable elements are defined by xpath too ( with some double slashes).
Sometimes the xpath changes, after the new js content got loaded. That is where the jsiteration-var-list comes in. And like this, you define the button that gets klicked every site. Sometimes it stays the same, then you just need an occurence of the same number exactly the same times as in the var-iteration, which will define how many pages will be downloaded generally. The var js-iteration-list defines the folder structure of the json output files.
Which means we emulate a whole virtual "user" using a virtual "browser" on his or her virtual "screen". In the end the clickable elements are defined by xpath too, so put these accordingly in the jslink and jsiteration variables.
You can run the spider with display=1 instead of display=0 in the python line of the virtual display the chromium driver is running on. I will put that in the initialization of the spider.
@ -224,6 +226,9 @@ a tr which kind of stands for row in html. This, because GIZ is cooperative and
for the children it is about to define in xpath syntax, where the text lies that we want to be parsed automatically. child-name, child-link, child-period, child-info and child-sponsor are defined until now.
In future it will be possible to define any variables anywhere and get that fed into the json output.
#### var javascript link
In case the whole website to spider is javascript generated gibberish, you need to download the htmls behind the links already while downloading the entry list htmls with javascript. Then define the javascript link that needs to be clicked in xpath syntax. To either become a pop up, which source code will be processed, or to become an actual page which source code will be parsed, or to become a pdf that gets downloaded and parsed to text.
#### var slow downlading

Loading…
Cancel
Save