From e4fa13d29d4b804ab6f58eb2b92b9fa6b304b983 Mon Sep 17 00:00:00 2001 From: alpcentaur Date: Wed, 28 Feb 2024 17:17:27 +0100 Subject: [PATCH] Start of Step by Step Guide Oi --- README.md | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/README.md b/README.md index 4c6dcad..970103e 100644 --- a/README.md +++ b/README.md @@ -15,6 +15,7 @@ 3. [Usage](#usage) * [Configuration File Syntax](#configuration-file-syntax) * [Efficient Xpath Copying](#efficient-xpath-copying) + * [Step By Step Guide](#step-by-step-guide) # Introduction @@ -111,3 +112,43 @@ slashes. That will make the spider more stable, in case the websites html/xml gets changed for maintenance or other reasons. +## Step By Step Guide + +Start with an old Configuration that is similar to what you need. + +There are Three Types of Configurations: + +The first Type is purely path based. An example is greenjobs.de. +The second Type is a mixture of path and javascript functions, giz is an example for this Type. +The third Type is purely javascript based. An example is ted.europe.eu. + +Type 1: + +Start with collecting every variable. +From up to down. + +### var domain + +domain is the variable for the root of the website. +In case links are glued, they will be glued based on the root. + +### var entry-list + +Now come all the variables regarding the entry list pages. + +#### var link1, link2 and iteration-var-list + +In Pseudo Code, whats happening with these three variables is + +``` +for n in iteration var list: + get(link1 + n + link2) +``` + +So if you are on the no javascript side of reality, you are lucky. Thats all needed to get the collection of links. + +We can just come to + +#### var parent + +Oi \ No newline at end of file