|
@ -15,6 +15,7 @@ |
|
|
3. [Usage](#usage) |
|
|
3. [Usage](#usage) |
|
|
* [Configuration File Syntax](#configuration-file-syntax) |
|
|
* [Configuration File Syntax](#configuration-file-syntax) |
|
|
* [Efficient Xpath Copying](#efficient-xpath-copying) |
|
|
* [Efficient Xpath Copying](#efficient-xpath-copying) |
|
|
|
|
|
* [Step By Step Guide](#step-by-step-guide) |
|
|
|
|
|
|
|
|
# Introduction |
|
|
# Introduction |
|
|
|
|
|
|
|
@ -111,3 +112,43 @@ slashes. That will make the spider more stable, in case the websites |
|
|
html/xml gets changed for maintenance or other reasons. |
|
|
html/xml gets changed for maintenance or other reasons. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Step By Step Guide |
|
|
|
|
|
|
|
|
|
|
|
Start with an old Configuration that is similar to what you need. |
|
|
|
|
|
|
|
|
|
|
|
There are Three Types of Configurations: |
|
|
|
|
|
|
|
|
|
|
|
The first Type is purely path based. An example is greenjobs.de. |
|
|
|
|
|
The second Type is a mixture of path and javascript functions, giz is an example for this Type. |
|
|
|
|
|
The third Type is purely javascript based. An example is ted.europe.eu. |
|
|
|
|
|
|
|
|
|
|
|
Type 1: |
|
|
|
|
|
|
|
|
|
|
|
Start with collecting every variable. |
|
|
|
|
|
From up to down. |
|
|
|
|
|
|
|
|
|
|
|
### var domain |
|
|
|
|
|
|
|
|
|
|
|
domain is the variable for the root of the website. |
|
|
|
|
|
In case links are glued, they will be glued based on the root. |
|
|
|
|
|
|
|
|
|
|
|
### var entry-list |
|
|
|
|
|
|
|
|
|
|
|
Now come all the variables regarding the entry list pages. |
|
|
|
|
|
|
|
|
|
|
|
#### var link1, link2 and iteration-var-list |
|
|
|
|
|
|
|
|
|
|
|
In Pseudo Code, whats happening with these three variables is |
|
|
|
|
|
|
|
|
|
|
|
``` |
|
|
|
|
|
for n in iteration var list: |
|
|
|
|
|
get(link1 + n + link2) |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
So if you are on the no javascript side of reality, you are lucky. Thats all needed to get the collection of links. |
|
|
|
|
|
|
|
|
|
|
|
We can just come to |
|
|
|
|
|
|
|
|
|
|
|
#### var parent |
|
|
|
|
|
|
|
|
|
|
|
Oi |