17 Commits (ec180bed0a5026969973800b1c656c8d8ee54030)

Author SHA1 Message Date
  alpcentaur ec180bed0a added flow for selenium grabbing popup instead of links for entries 11 months ago
  alpcentaur 89dcca2031 added further handling for javascript links not being urls, made config for giz work 1 year ago
  alpcentaur a0075e429d added further database in config.yaml, added new exception for downloading js generated html pages 1 year ago
  alpcentaur df4a8289b8 added pdf parser if entry link is direct pdf 1 year ago
  alpcentaur d3335f203b added trafilatura exception 1 year ago
  alpcentaur 14ece9bceb added functions for uniform and not uniform entry end points - non uniform endpoints are generally parsed as text from any paragraph xml element p 1 year ago
  alpcentaur b2cf4b67ce added first config parameters for search on not uniform entries 1 year ago
  alpcentaur 42841ee650 added some exceptions for bad encoding and get errors 1 year ago
  alpcentaur 317ef99720 changed code in entrylist data2dictionary to handle empty or missing xml elements 1 year ago
  alpcentaur ff23c22e3c added working bund.de-bekanntmachungen config with new example of xpath contains 1 year ago
  alpcentaur 06fa81e549 added function find config parameter and changed core spider 1 year ago
  alpcentaur a846ce04cc specifying the links, new exception clause if soupparser does not work 1 year ago
  alpcentaur c078ee4b1b first function works, actuall xml parser has still problems with certain xml types 1 year ago
  alpcentaur 8b20bc178f added multi pages configuration and code 1 year ago
  alpcentaur 7aa903883b update to config.yaml 1 year ago
  alpcentaur 59838bb8e1 added main.py importing and using the spider functions 1 year ago
  alpcentaur 5ac07d151a added first config.yaml template and started creating folder structure 1 year ago