Browse Source

Update README.md with better explanation how to js spider

master
alpcentaur 6 months ago
parent
commit
5000dca314
1 changed files with 22 additions and 4 deletions
  1. +22
    -4
      README.md

+ 22
- 4
README.md View File

@ -21,7 +21,9 @@
- [var link and iteration](#var-link-and-iteration)
- [example1 link](#example1-link)
- [example2 link](#example2-link)
- [var jslink and jsiteration](#var-jslink-and-jsiteration)
- [javascript](#javascript)
- [var jsdomain](#var-jsdomain)
- [vars jslink and jsiteration](#vars-jslink-and-jsiteration)
- [var parent](#var-parent)
- [example1 parent](#example1-parent)
- [vars children](#vars-children)
@ -152,7 +154,7 @@ In Pseudo Code, whats happening with these three variables is
```
for n in iteration var list:
get(link1 + n + link2)
get(link1 glued to n glued to link2)
```
So if you are on the no javascript side of reality, you are lucky. Thats all needed to get the collection of links.
@ -191,11 +193,27 @@ Which leads to the following variables, considering that there were 6 pages:
* iteration-var-list = "[1,2,3,4,5,6]"
#### var jslink and jsiteration
#### javascript
It happens, that it is not possible to see a pattern in the urls. Probably because the website hoster is not smart or just a thief in a bad sense. In this case you only get html gibberish. To get the desired info with the help of this program, you have the possibility to give the config.yaml paths of clickable items. The spider will open an axctual browser and click through the pages that start to exist.
Sometimes the xpath changes, after the new js content got loaded. That is where the jsiteration-var-list comes in. And like this, you define the button that gets klicked every site. Sometimes it stays the same, then you just need an occurence of the same number exactly the same times as in the var-iteration, which will define how many pages will be downloaded generally. The var js-iteration-list defines the folder structure of the json output files.
#### var jsdomain
If jsdomain is "None" (And here it is important to use None and not NONE), it will try to get the domains and download elements based on the gets with a variety of libraries.
But if you have a javascript situation, where the first html are already javascript generated kaos without xml to parse.. then you need to put an url here. By putting the url, the spider will open that website with a virtual graphical browser (using selenium), wait for the javascript to load, and by clicking on jslink, go through the pages.
In pseudo code that means, if you fill out the variable jsdomain in the config.yaml, the spider will do
```
for i in jsiteration-var-list:
click on the jslink string glued to i glued to jslink2
```
#### vars jslink and jsiteration
In jslink1 and jslink2 you have to put the xpath of the button that clicks to the next site of entry links to download.
Sometimes the xpath changes, after the new js content got loaded. That is where the jsiteration-var-list comes in. And like this, you define the button that gets klicked every site. Sometimes it stays the same, then you just need an occurence of the same number exactly the same times as in the var-iteration, which will define how many pages will be downloaded generally. The var iteration-var-list defines the folder structure of the json output files.
Which means we emulate a whole virtual "user" using a virtual "browser" on his or her virtual "screen". In the end the clickable elements are defined by xpath too, so put these accordingly in the jslink and jsiteration variables.

Loading…
Cancel
Save