Compare commits

...

40 commits

Author SHA1 Message Date
alpcentaur
2aa1134b48 updated gitignore 2024-03-05 14:56:03 +00:00
alpcentaur
a9c2346c04 first change to also click Accept Button in English if may come for js spidering functionality 2024-03-05 14:50:34 +00:00
alpcentaur
0808e5a42d main.py and config.yaml are left out from updates, only examples are provided. Change in Readme too 2024-03-05 14:42:30 +00:00
alpcentaur
4ec9f76080 added xorg-server-xephyr as dep to install 2024-03-05 14:52:02 +01:00
alpcentaur
10cdab6f60 updated README with new and working install order 2024-03-05 14:43:43 +01:00
alpcentaur
ccfe20044f added another tip to README.md, header for display and another tip added too 2024-03-05 12:34:32 +01:00
alpcentaur
0fa420d74c added explanation of display variable in the spiders code 2024-03-05 12:30:29 +01:00
alpcentaur
0d7728240e update var javascriptlink in README.md 2024-03-04 17:13:33 +01:00
alpcentaur
c52ea0cf0a added example1 for js configuration in README.md 2024-03-04 16:46:57 +01:00
alpcentaur
5000dca314 Update README.md with better explanation how to js spider 2024-03-04 16:30:31 +01:00
alpcentaur
0908ccf6e5 clarifications for javascript link and js link plus js iteration 2024-03-03 18:24:07 +01:00
alpcentaur
ff0fe5193d fixed the links for the clickable content summary 2024-03-03 17:57:48 +01:00
alpcentaur
49d5c2ffa9 third try ordering 2024-03-03 17:53:05 +01:00
alpcentaur
f489106ea0 second try ordering 2024-03-03 17:50:12 +01:00
alpcentaur
32fceffd01 searchable headers for step by step guide started 2024-03-03 17:48:52 +01:00
alpcentaur
eca77f9b63 Step by Step Guide continuation of describing the variables 2024-03-01 00:09:38 +01:00
alpcentaur
483eaec26e changed domain for new configuration dtvp 2024-02-29 14:19:45 +01:00
alpcentaur
c33dbc37e6 Merge remote-tracking branch 'refs/remotes/origin/master'
Merging local changes to the code with changes to the README.md on gitea instance
2024-02-29 13:16:48 +00:00
alpcentaur
a07d2e93f6 changes for new database dtvp, new exceptions trying to click away cookie pop ups 2024-02-29 13:15:34 +00:00
alpcentaur
d284fef015 changes for new database dtvp, new exceptions trying to click away cookie pop ups 2024-02-29 13:15:01 +00:00
alpcentaur
5fd6b7f781 Part 2 of Step by Step Guide 2024-02-28 17:34:57 +01:00
alpcentaur
e4fa13d29d Start of Step by Step Guide
Oi
2024-02-28 17:17:27 +01:00
alpcentaur
7ba196b0c2 changed size of virtual window, added some scrolling and shortened the time for js lazy loading enforced slow downloading 2024-02-11 17:08:33 +00:00
alpcentaur
a56569712e another small change to config.yaml before pushing 2024-02-11 16:43:44 +00:00
alpcentaur
a0dd469f25 added new database ted.europe.eu, created new case of slow downloading, intergrated scrolling into entrylistpagesdownload 2024-02-09 18:38:49 +00:00
alpcentaur
094f092291 deleted fdb entry that was a ghost for syntax reasons, but same syntax should be in other fdb anyway 2024-01-23 17:17:40 +01:00
alpcentaur
d7d157bf42 added further dokumentation to README.md 2024-01-21 14:07:38 +00:00
alpcentaur
0500f5853d full working example from localhost 2024-01-15 21:08:23 +00:00
alpcentaur
0411d74936 deleted config.yaml.save 2024-01-15 19:12:04 +00:00
alpcentaur
cf3bb52684 corrected link glueing for pdf links for loop 2024-01-15 19:09:28 +00:00
alpcentaur
af8374f715 added other exception for unitrue var text not being found, before saving index 0 to variable produced error to whole execution 2024-01-10 15:28:41 +00:00
alpcentaur
20db0028e1 added first changes to fix js related bug for giz db 2024-01-10 15:18:36 +01:00
alpcentaur
dec60f9bf5 added changed logic for link addition regarding entry links 2023-12-18 21:26:53 +00:00
alpcentaur
5d17f4e421 corrected error which arised in logic of wget backup get 2023-12-15 14:36:08 +01:00
alpcentaur
92c238a2ed added instruction for downloading chromium driver for python selenium to README.md 2023-12-15 14:13:41 +01:00
alpcentaur
ece5cf1301 added better logic for getting the right link of entry 2023-12-15 13:34:23 +01:00
alpcentaur
0e58756600 added last resort exception for entry page downloading with wget, also implemented some further logic regarding getting the right links 2023-12-15 11:33:50 +00:00
alpcentaur
16199256e3 javascript on highest level done better 2023-12-14 23:37:10 +00:00
alpcentaur
5627c80177 merged onlinkgen with master, and added more universal chrome driver initialization to the beginning of the javascript entries gothrough function in download_entry_list_pages_of_funding_databases() 2023-12-14 12:38:14 +00:00
alpcentaur
14b8db7941 started adding javascript handling on highest spider level 2023-12-14 12:07:14 +00:00
19 changed files with 38240 additions and 17563 deletions

7
.gitignore vendored
View file

@ -1,3 +1,10 @@
spiders/config.yaml
main.py
/venv /venv
/spiders/pages/** /spiders/pages/**
/spiders/output/** /spiders/output/**
/spiders/config.yaml
/spiders/output_old
/spiders/pages_old
spiders/__pycache__

319
README.md
View file

@ -9,12 +9,321 @@
|_| |_|
``` ```
Configure fdb-spider in a yaml file.
Spider Multi page databases of links.
Filter and serialize content to json.
Filter either by xpath syntax. 1. [Introduction](#introduction)
Or Filter with the help of Artificial Neural Networks (work in progress). 2. [Installation](#installation)
3. [Usage](#usage)
* [Configuration File Syntax](#configuration-file-syntax)
* [Efficient Xpath Copying](#efficient-xpath-copying)
* [Step By Step Guide](#step-by-step-guide)
- [var domain](#var-domain)
- [var entry list](#var-entry-list)
- [var link and iteration](#var-link-and-iteration)
- [example1 link](#example1-link)
- [example2 link](#example2-link)
- [javascript](#javascript)
- [var jsdomain](#var-jsdomain)
- [vars jslink and jsiteration](#vars-jslink-and-jsiteration)
- [example1 jslink and jsiteration](#example1-jslink-and-jsiteration)
- [display](#display)
- [another tip](#another-tip)
- [var parent](#var-parent)
- [example1 parent](#example1-parent)
- [vars children](#vars-children)
- [var slow downloading](#var-slow-downloading)
# Introduction
The fdb-spider was made to gather data from Websites in an automated way.
The Website to be spidered has to be a list of Links.
Which makes the fdb-spider a web spider for most Plattforms.
The fdb-spider is to be configured in a .yaml file to make things easy.
The output of the fdb-spider is in json format to make it easy to input
the json to other programs.
At its core, the spider outputs tag search based entries
It works together with the fdb-spider-interface.
In Future, the spider will be extended by the model Sauerkraut.
An !open source! Artificial Neural Network.
# Installation
FIRST install systemwide requirements with your package manager
```
# apt based unixoids
apt install xvfb
apt install chromium
apt install chromium-webdriver
# pacman based unixoids
pacman -S xorg-server-xvfb
pacman -S xorg-server-xephyr
pacman -S chromium
```
THEN create a python3 virtualenv on your favourite UNIX Distribution
with the command
```
git clone https://code.basabuuka.org/alpcentaur/fdb-spider
cd fdb-spider
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
```
# Usage
Use it step by step. First care for the htmls of the lists of the links.
Then care for getting the first json output from the first layer of html
pages.
Copy the two examples to the file name, in which they will be
input to the spider
```
cp main.py_example main.py
cp spiders/config.yaml_example spiders/config.yaml
```
## Configuration File Syntax
The configuration file with working syntax template is
```
/spiders/config.yaml
```
Here you can configure new websites to spider, referred to as "databases".
link1 and link2 are the links to be iterated.
The assumption is, that every list of links will have a loopable structure.
If links are javascript links, specify js[domain,link[1,2],iteration-var-list].
Otherwise leave them out, but specify jsdomain as 'None'.
You will find parents and children of the entry list pages.
Here you have to fill in the xpath of the entries to be parsed.
In the entry directive, you have to specify uniform to either TRUE or FALSE.
Set it to TRUE, if all the entry pages have the same template, and you
are able to specify xpath again to get the text or whatever variable you
like to specify.
In the entry_unitrue directive, you can specify new dimensions and
the json will adapt to your wishes.
Under the entry-list directive this feature has to be still implemented.
So use name, link, javascript-link, info, period and sponsor by commenting
in or out.
If javascript-link is set (which means its javascript clickable),
link will be ignored.
Set it to FALSE, if you have diverse pages behind the entries,
and want to generally get the main text of all the different links.
For your information, the library trafilature is used to gather the
text generally for further processing.
## Efficient Xpath Copying
When copying the Xpath, most modern Webbrowsers are of help.
In Firefox (or Browsers build on it like the TOR Browser) you can use
```
strl-shift-c
```
to open the "Inspector" in "Pick an element" mode.
When you click on the desired entry on the page,
it opens the actual code of the clicked element in the html search box.
Now make a right click on the code in the html search box, go on "Copy",
and go on XPath.
Now you have the XPath of the element in your clipboard.
When pasting it into the config, try to replace some slashes with double
slashes. That will make the spider more stable, in case the websites
html/xml gets changed for maintenance or other reasons.
## Step By Step Guide
Start with an old Configuration that is similar to what you need.
There are Three Types of Configurations:
The first Type is purely path based. An example is greenjobs.de.
The second Type is a mixture of path and javascript functions, giz is an example for this Type.
The third Type is purely javascript based. An example is ted.europe.eu.
Type 1:
Start with collecting every variable.
From up to down.
### var domain
domain is the variable for the root of the website.
In case links are glued, they will be glued based on the root.
### var entry list
Now come all the variables regarding the entry list pages.
#### var link and iteration
In Pseudo Code, whats happening with these three variables is
```
for n in iteration var list:
get(link1 glued to n glued to link2)
```
So if you are on the no javascript side of reality, you are lucky. Thats all needed to get the collection of links.
#### example1 link
Lets say we go on greenjobs.de.
We go on search without search query. To get the biggest displayed output, in best case a table of everything the site has listed.
https://www.greenjobs.de/angebote/index.html?s=&loc=&countrycode=de&dist=10&lng=&lat=
is the resulting url.
So now we navigate through the pages.
In this case everything is displayed and scrollable on exactly this url. Which means, we leave link2 and iteration var list empty. And put the resulting url into link1.
#### example2 link
This time we go on giz. There we have https://ausschreibungen.giz.de/Satellite/company/welcome.do as our url for a general search. If I go on the "nextpage" button of the displayed table, a new url pattern appears being on the next page:
https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=2
Going on the next side again, we get the url:
https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=3
So now we already see the pattern, that any and every machine generated output cant hide.
RSULT=1 .... we put it in the url bar of the browser
https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT=1
and get to the first pages.
Which leads to the following variables, considering that there were 6 pages:
* link1 = "https://ausschreibungen.giz.de/Satellite/company/welcome.do?method=showTable&fromSearch=1&tableSortPROJECT_RESULT=2&tableSortAttributePROJECT_RESULT=publicationDate&selectedTablePagePROJECT_RESULT="
* link2 = ""
* iteration-var-list = "[1,2,3,4,5,6]"
#### javascript
It happens, that it is not possible to see a pattern in the urls. Probably because the website hoster is not smart or just a thief in a bad sense. In this case you only get html gibberish. To get the desired info with the help of this program, you have the possibility to give the config.yaml paths of clickable items. The spider will open an axctual browser and click through the pages that start to exist.
#### var jsdomain
If jsdomain is "None" (And here it is important to use None and not NONE), it will try to get the domains and download elements based on the gets with a variety of libraries.
But if you have a javascript situation, where the first html are already javascript generated kaos without xml to parse.. then you need to put an url here. By putting the url, the spider will open that website with a virtual graphical browser (using selenium), wait for the javascript to load, and by clicking on jslink, go through the pages.
In pseudo code that means, if you fill out the variable jsdomain in the config.yaml, the spider will do
```
for i in jsiteration-var-list:
click on the jslink string glued to i glued to jslink2
```
#### vars jslink and jsiteration
In jslink1 and jslink2 you have to put the xpath of the button that clicks to the next site of entry links to download.
Sometimes the xpath changes, after the new js content got loaded. That is where the jsiteration-var-list comes in. And like this, you define the button that gets klicked every site. Sometimes it stays the same, then you just need an occurence of the same number exactly the same times as in the var-iteration, which will define how many pages will be downloaded generally. The var iteration-var-list defines the folder structure of the json output files.
Which means we emulate a whole virtual "user" using a virtual "browser" on his or her virtual "screen". In the end the clickable elements are defined by xpath too, so put these accordingly in the jslink and jsiteration variables.
You can run the spider with display=1 instead of display=0 in the python line of the virtual display the chromium driver is running on. I will put that in the initialization of the spider. How to do this, in general if you use any of the js related variables instead of setting "NONE" (important: and not "None"), will be described in the paragraph display. It is very useful to debug js related configs.
With running the spider while watching the automated mouse moves and clicks, you will be able to find the right xpath for every step and element.
#### example1 jslink and jsiteration
So let us consider evergabe-online as an example.
```
evergabe-online:
domain: 'https://www.evergabe-online.de/'
entry-list:
link1: 'https://www.evergabe-online.de/search.html?101-1.-searchPanel>
link2: '-pageLink'
jsdomain: 'https://www.evergabe-online.de/search.html'
jslink1: '/html/body/div[8]/main/div[4]/div/div/div[2]/table/thead/tr[1]/td/div[2]/div/span['
jslink2: ']'
jsiteration-var-list: "[1,2, 3, 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
iteration-var-list: "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
```
Go on jsdomain https://www.evergabe-online.de/search.html.
You will see the table we want to spider.
Open the inspector, and have a look at the button to get to the next site.
Its xpath is '/html/body/div[8]/main/div[4]/div/div/div[2]/table/thead/tr[1]/td/div[2]/div/span[1]'
Now we click on it. On page two, the button to click us to page three has the xpath:
'/html/body/div[8]/main/div[4]/div/div/div[2]/table/thead/tr[1]/td/div[2]/div/span[2]'
From page 5 on, the button to get to the next pages stays
'/html/body/div[8]/main/div[4]/div/div/div[2]/table/thead/tr[1]/td/div[2]/div/span[6]'
until the end.
#### display
When you run the spider with js spidering enabled, in fdb_spider.py a display will get created. If you open nano, and press crtl+w, you can type display and enter. This will bring you to the lines of code, generating the display.
Watch out for the line
```
display = Display(visible=0, size=(1200, 800))
```
If you change visible=0 to visible=1 here, the spider will run with actually open a viewable browser on the workspace.
This line is present two times in the code. One for downloading the pages with the links, and one for downloading the pages of/behind the links.
After finding and jumping to "display" with ctrl-w, issue alt+w to get to the next display variable definition.
#### another tip
In main.py, where the spiders code gets loaded, you have also a function available called
```
spider.find_config_parameter(list_of_fdbs)
```
This function helps you to find the right config parameters, because it shows you what you get. That can be a bit tricky, because if you get nothing, it does not really help you. But when you have the first ones running, it shows you exactly that: That its always possible and no magic needed.
#### var parent
The parent stands for the last xml element which contains the entry links. Go with the Inspector on the entry respectively one of the links, klick on it and klick in the code view afterwards. Now use the arrow up key to get to the last child before it comes to the parent. You can see it on the rendered html blue contained.
From the last child, when it is selected in the code view of the Inspector, do a left click on it.
Select copy --> full xpath.
Now copy paste the xpath to the parent variable, and put a double slash in front of the last element in the route.
#### example1 parent
For the List of the Gesellschaft für internationale Zusammenarbeit the last child before the parent is
//html//body//div//div//table[contains(@class, 'csx-new-table')]//tbody//tr
a tr which kind of stands for row in html. This, because GIZ is cooperative and ordered. You need the //tr because that says "search for any tr" which means we are reffering to a list of elements.
#### vars children
for the children it is about to define in xpath syntax, where the text lies that we want to be parsed automatically. child-name, child-link, child-period, child-info and child-sponsor are defined until now.
In future it will be possible to define any variables anywhere and get that fed into the json output.
#### var javascript link
In case the whole website to spider is javascript generated gibberish, there is another possibility for you. To find out if the website is generated gibberish not containing your payload, just search in the outputed pages for the child name etc. If you do not find them, or directly see on the html pages no real xml, try to download your sites with jsdomain, and go up again to the paragraphs before. For the actual link child, the spider can javascript style clicking download the htmls behind the links already while downloading the entry list htmls with javascript.
For that to happen, you can define the javascript link that needs to be clicked in xpath syntax. If it becomes a pop up, which source code needs to be processed, or if it becomes an actual page the source code will be parsed, or if the clickable link refers to a pdf, the spider will handle all situations and output the resulting text in json under spiders/output.
#### var slow downlading
slow downloading comes into take when a website uses lazy loading and / or is used by too many users resulting in loading very slow. In this case, the selenium part, as any normal user, runs into problems of timing. Which leads to much longer processing time, when harvesting the lists. Depending on the configuration of the firewall of the server you are harvesting, there may even be a limit by time. And again, we just can act as a lot of different instances and everyone just gets a part of it to download. In this sense, even for future totalitarist system that may come, the freeing of information from big platforms is always and will always be possible in peer to peer systems.

12
main.py
View file

@ -4,21 +4,25 @@ from spiders.fdb_spider import *
import sys import sys
config = "spiders/config.yaml" config = "spiders/config.yaml"
list_of_fdbs = eval(sys.argv[1]) #list_of_fdbs = eval(sys.argv[1])
#list_of_fdbs = ["giz","evergabe-online","foerderinfo.bund.de-bekanntmachungen"]
#list_of_fdbs = ["giz","evergabe-online"]
#list_of_fdbs = ["foerderinfo.bund.de-bekanntmachungen"] #list_of_fdbs = ["foerderinfo.bund.de-bekanntmachungen"]
list_of_fdbs = ["ted.europa.eu"]
#list_of_fdbs = ["dtvp"]
# doing the crawling of government websites # doing the crawling of government websites
spider = fdb_spider(config) spider = fdb_spider(config)
#spider.download_entry_list_pages_of_funding_databases(list_of_fdbs) spider.download_entry_list_pages_of_funding_databases(list_of_fdbs)
#spider.find_config_parameter(list_of_fdbs) #spider.find_config_parameter(list_of_fdbs)
spider.parse_entry_list_data2dictionary(list_of_fdbs) spider.parse_entry_list_data2dictionary(list_of_fdbs)
spider.download_entry_data_htmls(list_of_fdbs) #spider.download_entry_data_htmls(list_of_fdbs)
spider.parse_entry_data2dictionary(list_of_fdbs) #spider.parse_entry_data2dictionary(list_of_fdbs)

28
main.py_example Normal file
View file

@ -0,0 +1,28 @@
from spiders.fdb_spider import *
import sys
config = "spiders/config.yaml"
#list_of_fdbs = eval(sys.argv[1])
#list_of_fdbs = ["giz","evergabe-online","foerderinfo.bund.de-bekanntmachungen"]
#list_of_fdbs = ["giz","evergabe-online"]
#list_of_fdbs = ["foerderinfo.bund.de-bekanntmachungen"]
list_of_fdbs = ["ted.europa.eu"]
#list_of_fdbs = ["dtvp"]
# doing the crawling of government websites
spider = fdb_spider(config)
spider.download_entry_list_pages_of_funding_databases(list_of_fdbs)
#spider.find_config_parameter(list_of_fdbs)
spider.parse_entry_list_data2dictionary(list_of_fdbs)
#spider.download_entry_data_htmls(list_of_fdbs)
#spider.parse_entry_data2dictionary(list_of_fdbs)

37308
spider.log Normal file

File diff suppressed because it is too large Load diff

File diff suppressed because one or more lines are too long

158
spiders/config.yaml_example Normal file

File diff suppressed because one or more lines are too long

View file

@ -17,6 +17,11 @@ from trafilatura import extract
from pdfminer.high_level import extract_pages from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer from pdfminer.layout import LTTextContainer
import time
import subprocess
class fdb_spider(object): class fdb_spider(object):
def __init__(self, config_file): def __init__(self, config_file):
with open(config_file, "r") as stream: with open(config_file, "r") as stream:
@ -56,6 +61,23 @@ class fdb_spider(object):
e, e,
) )
try:
entry_list_jslink1 = entry_list.get("jslink1")
except Exception as e:
print(
"No jslink1 defined in config.yaml - the original error message is:",
e,
)
entry_list_jslink1 = 'NONE'
try:
entry_list_jslink2 = entry_list.get("jslink2")
except Exception as e:
print(
"No jslink2 defined in config.yaml - the original error message is:",
e,
)
entry_list_jslink2 = 'NONE'
try: try:
entry_iteration_var_list = eval(entry_list.get("iteration-var-list")) entry_iteration_var_list = eval(entry_list.get("iteration-var-list"))
except Exception as e: except Exception as e:
@ -63,9 +85,29 @@ class fdb_spider(object):
"No iteration-var-list defined in config.yaml - the original error message is:", "No iteration-var-list defined in config.yaml - the original error message is:",
e, e,
) )
try:
entry_jsiteration_var_list = eval(entry_list.get("jsiteration-var-list"))
except Exception as e:
print(
"No jsiteration-var-list defined in config.yaml - the original error message is:",
e,
)
try:
entry_jsdomain = entry_list.get("jsdomain")
except Exception as e:
print(
"No jsdomain defined in config.yaml - the original error message is:",
e,
)
entry_jsdomain = 'NONE'
if entry_jsdomain == 'NONE' or entry_jsdomain == 'None':
for i in entry_iteration_var_list: for i in entry_iteration_var_list:
# download the html page of the List of entrys # download the html page of the List of entrys
response = urllib.request.urlopen(entry_list_link1 + str(i) + entry_list_link2) response = urllib.request.urlopen(entry_list_link1 + str(i) + entry_list_link2)
@ -101,6 +143,80 @@ class fdb_spider(object):
f = open("spiders/pages/" + key + str(i) + "entryList.html", "w+") f = open("spiders/pages/" + key + str(i) + "entryList.html", "w+")
f.write(web_content) f.write(web_content)
f.close f.close
else:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
#from selenium.webdriver.common.action_chains import ActionChains
from pyvirtualdisplay import Display
# changed display to 1200, because element was not found in "mobile version" with 800 width
display = Display(visible=0, size=(1200, 800))
display.start()
##outputdir = '.'
##service_log_path = "{}/chromedriver.log".format(outputdir)
##service_args = ['--verbose']
##driver = webdriver.Chrome('/usr/bin/chromium')
options = webdriver.ChromeOptions()
#options.add_argument('headless')
options.add_argument("--remote-debugging-port=9222")
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
service = Service(executable_path='/usr/bin/chromedriver')
driver = webdriver.Chrome(options=options, service=service)
# driver = webdriver.Chrome()
driver.implicitly_wait(5)
driver.get(entry_jsdomain)
try:
accept_button = driver.find_element("xpath","//button[contains(text(), 'akzeptieren')]")
accept_button.click()
except Exception as e:
print(e, 'no cookies to accept..')
pass
try:
accept_button = driver.find_element("xpath","//button[contains(text(), 'Accept')]")
accept_button.click()
except Exception as e:
print(e, 'no cookies to accept..')
pass
for i in range(len(entry_jsiteration_var_list)):
time.sleep(1)
print('trying to get element')
try:
# scroll down, to get the javascript view loading to get the elements
driver.execute_script("scroll(0, 600)")
element = driver.find_element(
"xpath",
entry_list_jslink1
+ str(entry_jsiteration_var_list[i])
+ entry_list_jslink2
)
print(entry_iteration_var_list[i])
time.sleep(1)
print('scrolling..')
# scroll into view, because otherwise with javascript generated elements
# it can be that clicking returns an error
driver.execute_script("arguments[0].scrollIntoView();", element)
print('clicking..')
time.sleep(1)
element.click()
time.sleep(1)
#window_after = driver.window_handles[1]
print('length of the window handles', len(driver.window_handles))
#driver.switch_to.window(window_after)
web_content = driver.page_source
f = open("spiders/pages/" + key + str(entry_iteration_var_list[i]) + "entryList.html", "w+")
f.write(web_content)
f.close
except Exception as e:
print('the iteration var element for clicking the pages was not found.. the original message is:',e )
def find_config_parameter(self, list_of_fdbs): def find_config_parameter(self, list_of_fdbs):
for fdb in list_of_fdbs: for fdb in list_of_fdbs:
@ -146,11 +262,11 @@ class fdb_spider(object):
print('this is the n looped elements of the parent specified in config.yaml:') print('this is the n looped elements of the parent specified in config.yaml:')
#print('entrylistparent', fdb_conf_entry_list_parent) print('entrylistparent', fdb_conf_entry_list_parent)
#print(tree.xpath("//html//body//div//main//div//div[@class='row']//section[@class='l-search-result-list']")) print(tree.xpath("//html//body//div"))
#print(etree.tostring(tree.xpath(fdb_conf_entry_list_parent)).decode()) print(etree.tostring(tree.xpath(fdb_conf_entry_list_parent)[0]).decode())
for n in range(len(tree.xpath(fdb_conf_entry_list_parent))): for n in range(len(tree.xpath(fdb_conf_entry_list_parent))):
print('-----------------------------------------------------------------------------------------------------------------------------------------') print('-----------------------------------------------------------------------------------------------------------------------------------------')
@ -317,10 +433,12 @@ class fdb_spider(object):
dictionary_entry_list[n]["name"] = name dictionary_entry_list[n]["name"] = name
dictionary_entry_list[n]["info"] = info dictionary_entry_list[n]["info"] = info
dictionary_entry_list[n]["period"] = period dictionary_entry_list[n]["period"] = period
print('linklink', link, fdb_domain)
if fdb_domain in link: if fdb_domain in link:
print('oi')
dictionary_entry_list[n]["link"] = link dictionary_entry_list[n]["link"] = link
if fdb_domain not in link and 'http:' in link: if fdb_domain not in link and 'http:' in link:
print('oiA')
dictionary_entry_list[n]["link"] = link dictionary_entry_list[n]["link"] = link
if fdb_domain not in link and 'www.' in link: if fdb_domain not in link and 'www.' in link:
dictionary_entry_list[n]["link"] = link dictionary_entry_list[n]["link"] = link
@ -328,11 +446,28 @@ class fdb_spider(object):
dictionary_entry_list[n]["link"] = link dictionary_entry_list[n]["link"] = link
if 'javascript:' in link: if 'javascript:' in link:
dictionary_entry_list[n]["link"] = link dictionary_entry_list[n]["link"] = link
if fdb_domain not in link and ('http' or 'https' or 'www.') not in link: if fdb_domain not in link:
if link[-1] == '/': if 'http' not in link:
if 'www' not in link:
print('oiB')
if link[0] == '/':
if fdb_domain[-1] != '/':
dictionary_entry_list[n]["link"] = fdb_domain + link dictionary_entry_list[n]["link"] = fdb_domain + link
else: #print('got into D', dictionary_entry_list[n]["link"])
if fdb_domain[-1] == '/':
dictionary_entry_list[n]["link"] = fdb_domain + link[1:]
#print('got into C', dictionary_entry_list[n]["link"])
if link[0] == '.' and link[1] == '/':
if fdb_domain[-1] != '/':
dictionary_entry_list[n]["link"] = fdb_domain + link[1:]
print('got into B', dictionary_entry_list[n]["link"])
if fdb_domain[-1] == '/':
dictionary_entry_list[n]["link"] = fdb_domain + link[2:]
print('got into A', dictionary_entry_list[n]["link"])
if link[0] != '/' and link[0] != '.':
dictionary_entry_list[n]["link"] = fdb_domain + '/' + link dictionary_entry_list[n]["link"] = fdb_domain + '/' + link
#print('got into last else', dictionary_entry_list[n]["link"])
@ -361,15 +496,16 @@ class fdb_spider(object):
#service_args = ['--verbose'] #service_args = ['--verbose']
#driver = webdriver.Chrome('/usr/bin/chromium') #driver = webdriver.Chrome('/usr/bin/chromium')
options = webdriver.ChromeOptions() options = webdriver.ChromeOptions()
options.add_argument('headless') #options.add_argument('headless')
options.add_argument("--remote-debugging-port=9222") options.add_argument("--remote-debugging-port=9222")
options.add_argument('--no-sandbox') options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage') options.add_argument('--disable-dev-shm-usage')
service = Service(executable_path='/usr/bin/chromedriver') service = Service(executable_path='/usr/bin/chromedriver')
driver = webdriver.Chrome(options=options, service=service) driver = webdriver.Chrome(options=options, service=service)
driver.implicitly_wait(10)
#driver = webdriver.Chrome() #driver = webdriver.Chrome()
for fdb in list_of_fdbs: for fdb in list_of_fdbs:
print('spidering ' + fdb + ' ..')
try: try:
iteration_var_list = eval(self.config.get(fdb).get("entry-list").get("iteration-var-list")) iteration_var_list = eval(self.config.get(fdb).get("entry-list").get("iteration-var-list"))
except Exception as e: except Exception as e:
@ -394,20 +530,42 @@ class fdb_spider(object):
try: try:
fdb_conf_entry_list_javascript_link = fdb_conf_entry_list.get("javascript-link") fdb_conf_entry_list_javascript_link = fdb_conf_entry_list.get("javascript-link")
except Exception as e: except Exception as e:
fdb_conf_entry_list_javascript_link = 'NONE'
print('the javascript link in the config is missing, original error message is:', e) print('the javascript link in the config is missing, original error message is:', e)
try:
fdb_conf_entry_list_slow_downloading = fdb_conf_entry_list.get("slow-downloading")
except Exception as e:
print('the slow-downloading parameter is not set, original error message is:', e)
fdb_conf_entry_list_link1 = fdb_conf_entry_list.get("link1") fdb_conf_entry_list_link1 = fdb_conf_entry_list.get("link1")
fdb_conf_entry_list_link2 = fdb_conf_entry_list.get("link2") fdb_conf_entry_list_link2 = fdb_conf_entry_list.get("link2")
if fdb_conf_entry_list_slow_downloading == 'FALSE':
driver.get(fdb_conf_entry_list_link1 + str(i) + fdb_conf_entry_list_link2) driver.get(fdb_conf_entry_list_link1 + str(i) + fdb_conf_entry_list_link2)
else:
pass
for entry_id in dictionary_entry_list: for entry_id in dictionary_entry_list:
print(entry_id)
entry_link = dictionary_entry_list[entry_id]["link"] entry_link = dictionary_entry_list[entry_id]["link"]
web_content = 'NONE' web_content = 'NONE'
# download the html page of the entry # download the html page of the entry
print(entry_link)
if 'javascript' in entry_link or fdb_conf_entry_list_javascript_link != 'NONE':
if 'javascript' in entry_link: try:
accept_button = driver.find_element("xpath","//button[contains(text(), 'akzeptieren')]")
accept_button.click()
except Exception as e:
print(e, 'no cookies to accept..')
pass
driver.execute_script("scroll(0, 600)")
print('oioioi',fdb_conf_entry_list_parent, entry_id, fdb_conf_entry_list_javascript_link)
element = driver.find_element( element = driver.find_element(
"xpath", "xpath",
fdb_conf_entry_list_parent fdb_conf_entry_list_parent
@ -418,8 +576,8 @@ class fdb_spider(object):
) )
# to time.sleep was suggested for errors # to time.sleep was suggested for errors
#import time import time
#time.sleep(1) time.sleep(1)
element.click() element.click()
window_after = driver.window_handles[1] window_after = driver.window_handles[1]
@ -427,6 +585,9 @@ class fdb_spider(object):
#element = driver.find_element("xpath", "//html") #element = driver.find_element("xpath", "//html")
#web_content = element.text #web_content = element.text
#entry_domain = driver.getCurrentUrl() #entry_domain = driver.getCurrentUrl()
entry_domain = driver.current_url entry_domain = driver.current_url
@ -446,14 +607,33 @@ class fdb_spider(object):
driver.switch_to.window(window_before) driver.switch_to.window(window_before)
if ('http' or 'www') in entry_link and 'javascript' not in entry_link and '.pdf' not in entry_link: if 'javascript' not in entry_link and '.pdf' not in entry_link and fdb_conf_entry_list_javascript_link == 'NONE':
print('blabuuuuuba')
#print('oi')
if fdb_conf_entry_list_slow_downloading == 'TRUE':
try:
print("trying to get slowly entry link " , entry_link)
driver.get(entry_link)
time.sleep(3)
web_content = driver.page_source
except Exception as e:
print("getting the html behind the entry link did not work, ori message is:", e)
else:
try: try:
# defining cookie to not end up in endless loop because of cookie banners pointing to redirects # defining cookie to not end up in endless loop because of cookie banners pointing to redirects
url = entry_link url = entry_link
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0', 'Cookie':'myCookie=lovely'}) req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0', 'Cookie':'myCookie=oioioioi'})
response = urllib.request.urlopen(req) response = urllib.request.urlopen(req)
print('response from first one', response)
except Exception as e: except Exception as e:
print('cookie giving then downloading did not work, original error is:', e)
try: try:
response = urllib.request.urlopen(entry_link.encode('ascii', errors='xmlcharrefreplace').decode('ascii')) response = urllib.request.urlopen(entry_link.encode('ascii', errors='xmlcharrefreplace').decode('ascii'))
print( print(
@ -478,7 +658,7 @@ class fdb_spider(object):
# save interim results to files # save interim results to files
if '.pdf' in entry_link: if '.pdf' in entry_link and fdb_conf_entry_list_javascript_link == 'NONE':
file_name = "spiders/pages/" + fdb + str(i) + "/" + str(entry_id) + ".html" file_name = "spiders/pages/" + fdb + str(i) + "/" + str(entry_id) + ".html"
response = requests.get(entry_link) response = requests.get(entry_link)
@ -487,9 +667,15 @@ class fdb_spider(object):
f.write(response.content) f.write(response.content)
f.close f.close
else:
file_name = "spiders/pages/" + fdb + str(i) + "/" + str(entry_id) + ".html" file_name = "spiders/pages/" + fdb + str(i) + "/" + str(entry_id) + ".html"
wget_wrote = False
if web_content == 'NONE': if web_content == 'NONE':
print('other downloading approaches did not work, trying requests') print('other downloading approaches did not work, trying requests')
@ -504,9 +690,17 @@ class fdb_spider(object):
web_content = r.text web_content = r.text
except Exception as e: except Exception as e:
print('requests_html HTMLSession did not work') print('requests_html HTMLSession did not work trying wget, ori error is:', e)
try:
os.makedirs(os.path.dirname(file_name), exist_ok=True)
oi = subprocess.run(["wget", entry_link, '--output-document=' + file_name])
wget_wrote = True
except subprocess.CalledProcessError:
print('wget downloading did not work.. saving NONE to file now')
if wget_wrote == False:
os.makedirs(os.path.dirname(file_name), exist_ok=True) os.makedirs(os.path.dirname(file_name), exist_ok=True)
f = open(file_name, "w+") f = open(file_name, "w+")
f.write(web_content) f.write(web_content)
@ -576,20 +770,33 @@ class fdb_spider(object):
fdb_conf_entry_unitrue_child = fdb_conf_entry_unitrue.get(key) fdb_conf_entry_unitrue_child = fdb_conf_entry_unitrue.get(key)
print('unitrue_child',fdb_conf_entry_unitrue_child)
try:
child = tree.xpath( child = tree.xpath(
fdb_conf_entry_unitrue_child fdb_conf_entry_unitrue_child
)[0] )[0]
print('oi', child)
except:
print('getting unitruechild did not work')
child = 'NONE'
print("oi", child) print("oi", child)
if '.pdf' in child: if '.pdf' in child:
print('child in entry data is pdf, downloading it..') print('child in entry data is pdf, downloading it..')
file_name = "spiders/pages/" + fdb + str(i) + "/" + str(entry_id) + ".pdf" file_name = "spiders/pages/" + fdb + str(i) + "/" + str(entry_id) + ".pdf"
entry_link = dictionary_entry_list[entry_id]["link"] entry_link = dictionary_entry_list[entry_id]["link"]
print('that is the child: ' + child)
if 'http' in child:
try:
response = requests.get(child)
except Exception as e:
print(child + ' does not appear to be valid pdf link to download, original message is ' + e)
if 'http' not in child: if 'http' not in child:
if 'javascript' or 'js' not in entry_link and 'http' in entry_link: if 'javascript' or 'js' not in entry_link and 'http' in entry_link:
try: try:
@ -603,15 +810,21 @@ class fdb_spider(object):
if entry_domain[-1] == '/': if entry_domain[-1] == '/':
pdf_link = entry_domain[:-1] + child[1:] pdf_link = entry_domain[:-1] + child[1:]
if entry_domain[-1] != '/': if entry_domain[-1] != '/':
#print('it got into OIOIOIOOIOI')
#print('before loop ', entry_domain)
cut_value = 0
for n in range(len(entry_domain)): for n in range(len(entry_domain)):
if entry_domain[-1] != '/': if entry_domain[-n] != '/':
entry_domain = entry_domain[:-1] cut_value += 1
else: else:
break break
entry_domain = entry_domain[:-cut_value]
#print('after loop ', entry_domain)
pdf_link = entry_domain + child[1:] pdf_link = entry_domain + child[1:]
#print('the pdf link after recursive until slash: ', pdf_link)
if child[0] == '/': if child[0] == '/':
if entry_domain[-1] == '/': if entry_domain[-1] == '/':

View file

@ -1 +0,0 @@
{0: {'name': 'Newsletter', 'link': 'http://foerderinfo.bund.de/foerderinfo/de/news/newsletter/newsletter.html'}, 1: {'name': 'Wettbewerbe, Preise', 'link': 'http://foerderinfo.bund.de/foerderinfo/de/news/wettbewerbe-preise/wettbewerbe-preise.html'}, 2: {'name': 'Veranstaltungen', 'link': 'http://foerderinfo.bund.de/foerderinfo/de/news/veranstaltungen/veranstaltungen.html'}, 3: {'name': 'Projektträger in der Forschungsförderung', 'link': 'http://foerderinfo.bund.de/foerderinfo/de/beratung/projekttraeger/projekttraeger-in-der-forschungsfoerderung.html'}, 4: {'name': 'Leichte Sprache', 'link': 'http://foerderinfo.bund.de/foerderinfo/de/services/leichtesprache/leichte-sprache.html'}, 5: {'name': 'Ausführliche Informationen', 'link': 'http://foerderinfo.bund.de/foerderinfo/de/services/leichtesprache/ausfuehrliche-informationen.html'}, 6: {'name': 'Erklärung zur Barrierefreiheit', 'link': 'http://foerderinfo.bund.de/foerderinfo/de/services/leichtesprache/erklaerung-zur-barrierefreiheit.html'}, 7: {'name': 'Darum geht es auf dieser Seite', 'link': 'http://foerderinfo.bund.de/foerderinfo/de/services/leichtesprache/darum-geht-es-auf-dieser-seite.html'}, 8: {'name': 'FAQ', 'link': 'http://foerderinfo.bund.de/foerderinfo/de/beratung/faq/faq.html'}, 9: {'name': 'Forschungs- und Innovationsförderung', 'link': 'http://foerderinfo.bund.de/foerderinfo/de/beratung/forschungs-und-innovationsfoerderung/forschungs-und-innovationsfoerderung.html'}, 10: {'name': 'Glossar', 'link': 'http://foerderinfo.bund.de/foerderinfo/de/beratung/glossar/glossar.html'}, 11: {'name': 'Bei uns sind Sie bestens beraten!', 'link': 'http://foerderinfo.bund.de/foerderinfo/de/beratung/erstberatung/bei-uns-sind-sie-bestens-beraten_.html'}, 12: {'name': 'Unser Service', 'link': 'http://foerderinfo.bund.de/foerderinfo/de/beratung/unser-service/unser-service.html'}, 13: {'name': 'Was wir tun', 'link': 'http://foerderinfo.bund.de/foerderinfo/de/beratung/was-wir-tun/was-wir-tun.html'}, 14: {'name': '„Ich hab da mal eine Idee“ Die Förderberatung des Bundes im Gespräch', 'link': 'http://foerderinfo.bund.de/foerderinfo/de/_documents/ich-hab-da-mal-eine-idee.html'}}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long