41 Commits (master)

Author SHA1 Message Date
  alpcentaur a9c2346c04 first change to also click Accept Button in English if may come for js spidering functionality 6 months ago
  alpcentaur 0808e5a42d main.py and config.yaml are left out from updates, only examples are provided. Change in Readme too 6 months ago
  alpcentaur 483eaec26e changed domain for new configuration dtvp 6 months ago
  alpcentaur d284fef015 changes for new database dtvp, new exceptions trying to click away cookie pop ups 6 months ago
  alpcentaur 7ba196b0c2 changed size of virtual window, added some scrolling and shortened the time for js lazy loading enforced slow downloading 7 months ago
  alpcentaur a56569712e another small change to config.yaml before pushing 7 months ago
  alpcentaur a0dd469f25 added new database ted.europe.eu, created new case of slow downloading, intergrated scrolling into entrylistpagesdownload 7 months ago
  alpcentaur 094f092291 deleted fdb entry that was a ghost for syntax reasons, but same syntax should be in other fdb anyway 7 months ago
  alpcentaur 0500f5853d full working example from localhost 8 months ago
  alpcentaur 0411d74936 deleted config.yaml.save 8 months ago
  alpcentaur cf3bb52684 corrected link glueing for pdf links for loop 8 months ago
  alpcentaur af8374f715 added other exception for unitrue var text not being found, before saving index 0 to variable produced error to whole execution 8 months ago
  alpcentaur 20db0028e1 added first changes to fix js related bug for giz db 8 months ago
  alpcentaur dec60f9bf5 added changed logic for link addition regarding entry links 9 months ago
  alpcentaur 5d17f4e421 corrected error which arised in logic of wget backup get 9 months ago
  alpcentaur ece5cf1301 added better logic for getting the right link of entry 9 months ago
  alpcentaur 0e58756600 added last resort exception for entry page downloading with wget, also implemented some further logic regarding getting the right links 9 months ago
  alpcentaur 16199256e3 javascript on highest level done better 9 months ago
  alpcentaur 14b8db7941 started adding javascript handling on highest spider level 9 months ago
  alpcentaur 953f85ee5b added new lines to chromedriver, to make it work on other systems 9 months ago
  alpcentaur d2324d265a added pdf child text downloading and parse to json exceptions/cases for javascript entry data and normal data 9 months ago
  alpcentaur ec180bed0a added flow for selenium grabbing popup instead of links for entries 9 months ago
  alpcentaur b4fd385c5d did some changes to main.py for using sys.argv 9 months ago
  alpcentaur 89dcca2031 added further handling for javascript links not being urls, made config for giz work 9 months ago
  alpcentaur a0075e429d added further database in config.yaml, added new exception for downloading js generated html pages 9 months ago
  alpcentaur df4a8289b8 added pdf parser if entry link is direct pdf 9 months ago
  alpcentaur d3335f203b added trafilatura exception 9 months ago
  alpcentaur 14ece9bceb added functions for uniform and not uniform entry end points - non uniform endpoints are generally parsed as text from any paragraph xml element p 10 months ago
  alpcentaur b2cf4b67ce added first config parameters for search on not uniform entries 10 months ago
  alpcentaur 42841ee650 added some exceptions for bad encoding and get errors 10 months ago
  alpcentaur 317ef99720 changed code in entrylist data2dictionary to handle empty or missing xml elements 10 months ago
  alpcentaur ff23c22e3c added working bund.de-bekanntmachungen config with new example of xpath contains 10 months ago
  alpcentaur 06fa81e549 added function find config parameter and changed core spider 10 months ago
  alpcentaur a846ce04cc specifying the links, new exception clause if soupparser does not work 10 months ago
  alpcentaur c078ee4b1b first function works, actuall xml parser has still problems with certain xml types 10 months ago
  alpcentaur 8b20bc178f added multi pages configuration and code 10 months ago
  alpcentaur 7aa903883b update to config.yaml 10 months ago
  alpcentaur 59838bb8e1 added main.py importing and using the spider functions 10 months ago
  alpcentaur 5ac07d151a added first config.yaml template and started creating folder structure 10 months ago