Commit graph

16 commits

Author SHA1 Message Date
alpcentaur
14b8db7941 started adding javascript handling on highest spider level 2023-12-14 12:07:14 +00:00
alpcentaur
d2324d265a added pdf child text downloading and parse to json exceptions/cases for javascript entry data and normal data 2023-12-06 13:46:54 +00:00
alpcentaur
885c210971 added selenium for pop up entry links 2023-12-05 22:19:00 +00:00
alpcentaur
ec180bed0a added flow for selenium grabbing popup instead of links for entries 2023-12-05 22:16:07 +00:00
alpcentaur
b4fd385c5d did some changes to main.py for using sys.argv 2023-12-05 18:23:57 +01:00
alpcentaur
89dcca2031 added further handling for javascript links not being urls, made config for giz work 2023-11-28 15:27:39 +00:00
alpcentaur
a0075e429d added further database in config.yaml, added new exception for downloading js generated html pages 2023-11-27 15:10:11 +00:00
alpcentaur
df4a8289b8 added pdf parser if entry link is direct pdf 2023-11-22 17:03:15 +00:00
alpcentaur
d3335f203b added trafilatura exception 2023-11-22 00:02:29 +00:00
alpcentaur
14ece9bceb added functions for uniform and not uniform entry end points - non uniform endpoints are generally parsed as text from any paragraph xml element p 2023-11-20 15:28:04 +00:00
alpcentaur
42841ee650 added some exceptions for bad encoding and get errors 2023-11-14 14:38:45 +00:00
alpcentaur
317ef99720 changed code in entrylist data2dictionary to handle empty or missing xml elements 2023-11-14 10:22:26 +00:00
alpcentaur
ff23c22e3c added working bund.de-bekanntmachungen config with new example of xpath contains 2023-11-13 16:44:11 +00:00
alpcentaur
06fa81e549 added function find config parameter and changed core spider 2023-11-10 01:12:49 +00:00
alpcentaur
a846ce04cc specifying the links, new exception clause if soupparser does not work 2023-11-07 14:55:05 +00:00
alpcentaur
c078ee4b1b first function works, actuall xml parser has still problems with certain xml types 2023-11-06 19:17:45 +00:00