main.py and config.yaml are left out from updates, only examples are provided. Change in Readme too
This commit is contained in:
parent
4ec9f76080
commit
0808e5a42d
4 changed files with 203 additions and 1 deletions
2
.gitignore
vendored
2
.gitignore
vendored
|
@ -1,3 +1,5 @@
|
|||
spiders/config.yaml
|
||||
main.py
|
||||
/venv
|
||||
/spiders/pages/**
|
||||
/spiders/output/**
|
||||
|
|
16
README.md
16
README.md
|
@ -80,6 +80,20 @@ pip install -r requirements.txt
|
|||
|
||||
# Usage
|
||||
|
||||
Use it step by step. First care for the htmls of the lists of the links.
|
||||
Then care for getting the first json output from the first layer of html
|
||||
pages.
|
||||
|
||||
Copy the two examples to the file name, in which they will be
|
||||
input to the spider
|
||||
|
||||
```
|
||||
cp main.py_example main.py
|
||||
|
||||
cp spiders/config.yaml_example spiders/config.yaml
|
||||
|
||||
```
|
||||
|
||||
## Configuration File Syntax
|
||||
|
||||
The configuration file with working syntax template is
|
||||
|
@ -312,4 +326,4 @@ For that to happen, you can define the javascript link that needs to be clicked
|
|||
|
||||
#### var slow downlading
|
||||
|
||||
slow downloading comes into take when a website uses lazy loading and / or is used by too many users resulting in loading very slow. In this case, the selenium part, as any normal user, runs into problems of timing. Which leads to much longer processing time, when harvesting the lists. Depending on the configuration of the firewall of the server you are harvesting, there may even be a limit by time. And again, we just can act as a lot of different instances and everyone just gets a part of it to download. In this sense, even for future totalitarist system that may come, the freeing of information from big platforms is always and will always be possible in peer to peer systems.
|
||||
slow downloading comes into take when a website uses lazy loading and / or is used by too many users resulting in loading very slow. In this case, the selenium part, as any normal user, runs into problems of timing. Which leads to much longer processing time, when harvesting the lists. Depending on the configuration of the firewall of the server you are harvesting, there may even be a limit by time. And again, we just can act as a lot of different instances and everyone just gets a part of it to download. In this sense, even for future totalitarist system that may come, the freeing of information from big platforms is always and will always be possible in peer to peer systems.
|
||||
|
|
28
main.py_example
Normal file
28
main.py_example
Normal file
|
@ -0,0 +1,28 @@
|
|||
from spiders.fdb_spider import *
|
||||
|
||||
|
||||
import sys
|
||||
|
||||
config = "spiders/config.yaml"
|
||||
#list_of_fdbs = eval(sys.argv[1])
|
||||
#list_of_fdbs = ["giz","evergabe-online","foerderinfo.bund.de-bekanntmachungen"]
|
||||
#list_of_fdbs = ["giz","evergabe-online"]
|
||||
#list_of_fdbs = ["foerderinfo.bund.de-bekanntmachungen"]
|
||||
list_of_fdbs = ["ted.europa.eu"]
|
||||
#list_of_fdbs = ["dtvp"]
|
||||
|
||||
|
||||
# doing the crawling of government websites
|
||||
|
||||
spider = fdb_spider(config)
|
||||
|
||||
spider.download_entry_list_pages_of_funding_databases(list_of_fdbs)
|
||||
|
||||
#spider.find_config_parameter(list_of_fdbs)
|
||||
|
||||
spider.parse_entry_list_data2dictionary(list_of_fdbs)
|
||||
|
||||
#spider.download_entry_data_htmls(list_of_fdbs)
|
||||
|
||||
#spider.parse_entry_data2dictionary(list_of_fdbs)
|
||||
|
158
spiders/config.yaml_example
Normal file
158
spiders/config.yaml_example
Normal file
File diff suppressed because one or more lines are too long
Loading…
Reference in a new issue