Commit graph

34 commits

Author SHA1 Message Date
alpcentaur
5627c80177 merged onlinkgen with master, and added more universal chrome driver initialization to the beginning of the javascript entries gothrough function in download_entry_list_pages_of_funding_databases() 2023-12-14 12:38:14 +00:00
alpcentaur
14b8db7941 started adding javascript handling on highest spider level 2023-12-14 12:07:14 +00:00
alpcentaur
fbee5d6229 last commit in detached head 2023-12-13 16:20:27 +01:00
alpcentaur
953f85ee5b added new lines to chromedriver, to make it work on other systems 2023-12-13 16:05:26 +01:00
alpcentaur
d2324d265a added pdf child text downloading and parse to json exceptions/cases for javascript entry data and normal data 2023-12-06 13:46:54 +00:00
alpcentaur
885c210971 added selenium for pop up entry links 2023-12-05 22:19:00 +00:00
alpcentaur
ec180bed0a added flow for selenium grabbing popup instead of links for entries 2023-12-05 22:16:07 +00:00
alpcentaur
b4fd385c5d did some changes to main.py for using sys.argv 2023-12-05 18:23:57 +01:00
alpcentaur
99c74dcbad updated requirements.txt 2023-12-05 16:59:13 +00:00
alpcentaur
54daad8dfa started sys arguments for main.py, to be able to control spider from interface 2023-12-05 17:51:16 +01:00
alpcentaur
89dcca2031 added further handling for javascript links not being urls, made config for giz work 2023-11-28 15:27:39 +00:00
alpcentaur
a0075e429d added further database in config.yaml, added new exception for downloading js generated html pages 2023-11-27 15:10:11 +00:00
alpcentaur
df4a8289b8 added pdf parser if entry link is direct pdf 2023-11-22 17:03:15 +00:00
alpcentaur
677e54c0c2 added trafilatura to requirements 2023-11-22 00:07:59 +00:00
alpcentaur
9ceaa28a82 Merge remote-tracking branch 'refs/remotes/origin/master'
merging with a small README.md change
2023-11-22 00:04:16 +00:00
alpcentaur
d3335f203b added trafilatura exception 2023-11-22 00:02:29 +00:00
alpcentaur
61f9ba67fb update README.md 2023-11-20 16:38:18 +01:00
alpcentaur
69c517292b Update 'README.md' 2023-11-20 16:37:27 +01:00
alpcentaur
14ece9bceb added functions for uniform and not uniform entry end points - non uniform endpoints are generally parsed as text from any paragraph xml element p 2023-11-20 15:28:04 +00:00
alpcentaur
b2cf4b67ce added first config parameters for search on not uniform entries 2023-11-15 17:27:54 +00:00
alpcentaur
42841ee650 added some exceptions for bad encoding and get errors 2023-11-14 14:38:45 +00:00
alpcentaur
317ef99720 changed code in entrylist data2dictionary to handle empty or missing xml elements 2023-11-14 10:22:26 +00:00
alpcentaur
ff23c22e3c added working bund.de-bekanntmachungen config with new example of xpath contains 2023-11-13 16:44:11 +00:00
alpcentaur
06fa81e549 added function find config parameter and changed core spider 2023-11-10 01:12:49 +00:00
alpcentaur
a846ce04cc specifying the links, new exception clause if soupparser does not work 2023-11-07 14:55:05 +00:00
alpcentaur
a99881796a first function works, actuall xml parser has still problems with certain xml types 2023-11-06 19:19:31 +00:00
alpcentaur
c078ee4b1b first function works, actuall xml parser has still problems with certain xml types 2023-11-06 19:17:45 +00:00
alpcentaur
8b20bc178f added multi pages configuration and code 2023-11-06 18:17:32 +00:00
alpcentaur
7aa903883b update to config.yaml 2023-11-03 12:23:04 +00:00
alpcentaur
59838bb8e1 added main.py importing and using the spider functions 2023-11-02 10:54:16 +00:00
alpcentaur
5ac07d151a added first config.yaml template and started creating folder structure 2023-10-31 17:41:44 +00:00
alpcentaur
b3011efc73 small change of naming in error message added 2023-10-30 16:43:32 +00:00
alpcentaur
687d40f156 first change of naming, first commit for the actual spider based on importPEP 2023-10-30 16:41:14 +00:00
alpcentaur
8783251133 first commit 2023-10-30 14:35:32 +00:00