Commit graph

56 commits

Author SHA1 Message Date
alpcentaur
483eaec26e changed domain for new configuration dtvp 2024-02-29 14:19:45 +01:00
alpcentaur
c33dbc37e6 Merge remote-tracking branch 'refs/remotes/origin/master'
Merging local changes to the code with changes to the README.md on gitea instance
2024-02-29 13:16:48 +00:00
alpcentaur
a07d2e93f6 changes for new database dtvp, new exceptions trying to click away cookie pop ups 2024-02-29 13:15:34 +00:00
alpcentaur
d284fef015 changes for new database dtvp, new exceptions trying to click away cookie pop ups 2024-02-29 13:15:01 +00:00
alpcentaur
5fd6b7f781 Part 2 of Step by Step Guide 2024-02-28 17:34:57 +01:00
alpcentaur
e4fa13d29d Start of Step by Step Guide
Oi
2024-02-28 17:17:27 +01:00
alpcentaur
7ba196b0c2 changed size of virtual window, added some scrolling and shortened the time for js lazy loading enforced slow downloading 2024-02-11 17:08:33 +00:00
alpcentaur
a56569712e another small change to config.yaml before pushing 2024-02-11 16:43:44 +00:00
alpcentaur
a0dd469f25 added new database ted.europe.eu, created new case of slow downloading, intergrated scrolling into entrylistpagesdownload 2024-02-09 18:38:49 +00:00
alpcentaur
094f092291 deleted fdb entry that was a ghost for syntax reasons, but same syntax should be in other fdb anyway 2024-01-23 17:17:40 +01:00
alpcentaur
d7d157bf42 added further dokumentation to README.md 2024-01-21 14:07:38 +00:00
alpcentaur
0500f5853d full working example from localhost 2024-01-15 21:08:23 +00:00
alpcentaur
0411d74936 deleted config.yaml.save 2024-01-15 19:12:04 +00:00
alpcentaur
cf3bb52684 corrected link glueing for pdf links for loop 2024-01-15 19:09:28 +00:00
alpcentaur
af8374f715 added other exception for unitrue var text not being found, before saving index 0 to variable produced error to whole execution 2024-01-10 15:28:41 +00:00
alpcentaur
20db0028e1 added first changes to fix js related bug for giz db 2024-01-10 15:18:36 +01:00
alpcentaur
dec60f9bf5 added changed logic for link addition regarding entry links 2023-12-18 21:26:53 +00:00
alpcentaur
5d17f4e421 corrected error which arised in logic of wget backup get 2023-12-15 14:36:08 +01:00
alpcentaur
92c238a2ed added instruction for downloading chromium driver for python selenium to README.md 2023-12-15 14:13:41 +01:00
alpcentaur
ece5cf1301 added better logic for getting the right link of entry 2023-12-15 13:34:23 +01:00
alpcentaur
0e58756600 added last resort exception for entry page downloading with wget, also implemented some further logic regarding getting the right links 2023-12-15 11:33:50 +00:00
alpcentaur
16199256e3 javascript on highest level done better 2023-12-14 23:37:10 +00:00
alpcentaur
5627c80177 merged onlinkgen with master, and added more universal chrome driver initialization to the beginning of the javascript entries gothrough function in download_entry_list_pages_of_funding_databases() 2023-12-14 12:38:14 +00:00
alpcentaur
14b8db7941 started adding javascript handling on highest spider level 2023-12-14 12:07:14 +00:00
alpcentaur
fbee5d6229 last commit in detached head 2023-12-13 16:20:27 +01:00
alpcentaur
953f85ee5b added new lines to chromedriver, to make it work on other systems 2023-12-13 16:05:26 +01:00
alpcentaur
d2324d265a added pdf child text downloading and parse to json exceptions/cases for javascript entry data and normal data 2023-12-06 13:46:54 +00:00
alpcentaur
885c210971 added selenium for pop up entry links 2023-12-05 22:19:00 +00:00
alpcentaur
ec180bed0a added flow for selenium grabbing popup instead of links for entries 2023-12-05 22:16:07 +00:00
alpcentaur
b4fd385c5d did some changes to main.py for using sys.argv 2023-12-05 18:23:57 +01:00
alpcentaur
99c74dcbad updated requirements.txt 2023-12-05 16:59:13 +00:00
alpcentaur
54daad8dfa started sys arguments for main.py, to be able to control spider from interface 2023-12-05 17:51:16 +01:00
alpcentaur
89dcca2031 added further handling for javascript links not being urls, made config for giz work 2023-11-28 15:27:39 +00:00
alpcentaur
a0075e429d added further database in config.yaml, added new exception for downloading js generated html pages 2023-11-27 15:10:11 +00:00
alpcentaur
df4a8289b8 added pdf parser if entry link is direct pdf 2023-11-22 17:03:15 +00:00
alpcentaur
677e54c0c2 added trafilatura to requirements 2023-11-22 00:07:59 +00:00
alpcentaur
9ceaa28a82 Merge remote-tracking branch 'refs/remotes/origin/master'
merging with a small README.md change
2023-11-22 00:04:16 +00:00
alpcentaur
d3335f203b added trafilatura exception 2023-11-22 00:02:29 +00:00
alpcentaur
61f9ba67fb update README.md 2023-11-20 16:38:18 +01:00
alpcentaur
69c517292b Update 'README.md' 2023-11-20 16:37:27 +01:00
alpcentaur
14ece9bceb added functions for uniform and not uniform entry end points - non uniform endpoints are generally parsed as text from any paragraph xml element p 2023-11-20 15:28:04 +00:00
alpcentaur
b2cf4b67ce added first config parameters for search on not uniform entries 2023-11-15 17:27:54 +00:00
alpcentaur
42841ee650 added some exceptions for bad encoding and get errors 2023-11-14 14:38:45 +00:00
alpcentaur
317ef99720 changed code in entrylist data2dictionary to handle empty or missing xml elements 2023-11-14 10:22:26 +00:00
alpcentaur
ff23c22e3c added working bund.de-bekanntmachungen config with new example of xpath contains 2023-11-13 16:44:11 +00:00
alpcentaur
06fa81e549 added function find config parameter and changed core spider 2023-11-10 01:12:49 +00:00
alpcentaur
a846ce04cc specifying the links, new exception clause if soupparser does not work 2023-11-07 14:55:05 +00:00
alpcentaur
a99881796a first function works, actuall xml parser has still problems with certain xml types 2023-11-06 19:19:31 +00:00
alpcentaur
c078ee4b1b first function works, actuall xml parser has still problems with certain xml types 2023-11-06 19:17:45 +00:00
alpcentaur
8b20bc178f added multi pages configuration and code 2023-11-06 18:17:32 +00:00