Commit Graph

  • 2aa1134 (HEAD -> master) updated gitignore by alpcentaur 2024-03-05 14:56:03 +0000
  • a9c2346 first change to also click Accept Button in English if may come for js spidering functionality by alpcentaur 2024-03-05 14:50:34 +0000
  • 0808e5a main.py and config.yaml are left out from updates, only examples are provided. Change in Readme too by alpcentaur 2024-03-05 14:42:30 +0000
  • 4ec9f76 added xorg-server-xephyr as dep to install by alpcentaur 2024-03-05 14:52:02 +0100
  • 10cdab6 updated README with new and working install order by alpcentaur 2024-03-05 14:43:43 +0100
  • ccfe200 added another tip to README.md, header for display and another tip added too by alpcentaur 2024-03-05 12:34:32 +0100
  • 0fa420d added explanation of display variable in the spiders code by alpcentaur 2024-03-05 12:30:29 +0100
  • 0d77282 update var javascriptlink in README.md by alpcentaur 2024-03-04 17:13:33 +0100
  • c52ea0c added example1 for js configuration in README.md by alpcentaur 2024-03-04 16:46:57 +0100
  • 5000dca Update README.md with better explanation how to js spider by alpcentaur 2024-03-04 16:30:31 +0100
  • 0908ccf clarifications for javascript link and js link plus js iteration by alpcentaur 2024-03-03 18:24:07 +0100
  • ff0fe51 fixed the links for the clickable content summary by alpcentaur 2024-03-03 17:57:48 +0100
  • 49d5c2f third try ordering by alpcentaur 2024-03-03 17:53:05 +0100
  • f489106 second try ordering by alpcentaur 2024-03-03 17:50:12 +0100
  • 32fceff searchable headers for step by step guide started by alpcentaur 2024-03-03 17:48:52 +0100
  • eca77f9 Step by Step Guide continuation of describing the variables by alpcentaur 2024-03-01 00:09:38 +0100
  • 483eaec changed domain for new configuration dtvp by alpcentaur 2024-02-29 14:19:45 +0100
  • c33dbc3 Merge remote-tracking branch 'refs/remotes/origin/master' Merging local changes to the code with changes to the README.md on gitea instance by alpcentaur 2024-02-29 13:16:48 +0000
  • a07d2e9 changes for new database dtvp, new exceptions trying to click away cookie pop ups by alpcentaur 2024-02-29 13:15:34 +0000
  • d284fef changes for new database dtvp, new exceptions trying to click away cookie pop ups by alpcentaur 2024-02-29 13:15:01 +0000
  • 5fd6b7f Part 2 of Step by Step Guide by alpcentaur 2024-02-28 17:34:57 +0100
  • e4fa13d Start of Step by Step Guide by alpcentaur 2024-02-28 17:17:27 +0100
  • 7ba196b changed size of virtual window, added some scrolling and shortened the time for js lazy loading enforced slow downloading by alpcentaur 2024-02-11 17:08:33 +0000
  • a565697 another small change to config.yaml before pushing by alpcentaur 2024-02-11 16:43:44 +0000
  • a0dd469 added new database ted.europe.eu, created new case of slow downloading, intergrated scrolling into entrylistpagesdownload by alpcentaur 2024-02-09 18:38:49 +0000
  • 094f092 deleted fdb entry that was a ghost for syntax reasons, but same syntax should be in other fdb anyway by alpcentaur 2024-01-23 17:17:40 +0100
  • d7d157b added further dokumentation to README.md by alpcentaur 2024-01-21 14:07:38 +0000
  • 0500f58 full working example from localhost by alpcentaur 2024-01-15 21:08:23 +0000
  • 0411d74 deleted config.yaml.save by alpcentaur 2024-01-15 19:12:04 +0000
  • cf3bb52 corrected link glueing for pdf links for loop by alpcentaur 2024-01-15 19:09:28 +0000
  • af8374f added other exception for unitrue var text not being found, before saving index 0 to variable produced error to whole execution by alpcentaur 2024-01-10 15:28:41 +0000
  • 20db002 added first changes to fix js related bug for giz db by alpcentaur 2024-01-10 15:18:36 +0100
  • dec60f9 added changed logic for link addition regarding entry links by alpcentaur 2023-12-18 21:26:53 +0000
  • 5d17f4e corrected error which arised in logic of wget backup get by alpcentaur 2023-12-15 14:36:08 +0100
  • 92c238a added instruction for downloading chromium driver for python selenium to README.md by alpcentaur 2023-12-15 14:13:41 +0100
  • ece5cf1 added better logic for getting the right link of entry by alpcentaur 2023-12-15 13:34:23 +0100
  • 0e58756 added last resort exception for entry page downloading with wget, also implemented some further logic regarding getting the right links by alpcentaur 2023-12-15 11:33:50 +0000
  • 1619925 javascript on highest level done better by alpcentaur 2023-12-14 23:37:10 +0000
  • 5627c80 merged onlinkgen with master, and added more universal chrome driver initialization to the beginning of the javascript entries gothrough function in download_entry_list_pages_of_funding_databases() by alpcentaur 2023-12-14 12:38:14 +0000
  • 14b8db7 started adding javascript handling on highest spider level by alpcentaur 2023-12-14 12:07:14 +0000
  • fbee5d6 (onlinkgen) last commit in detached head by alpcentaur 2023-12-13 16:20:27 +0100
  • 953f85e added new lines to chromedriver, to make it work on other systems by alpcentaur 2023-12-13 16:05:26 +0100
  • d2324d2 added pdf child text downloading and parse to json exceptions/cases for javascript entry data and normal data by alpcentaur 2023-12-06 13:46:54 +0000
  • 885c210 added selenium for pop up entry links by alpcentaur 2023-12-05 22:19:00 +0000
  • ec180be added flow for selenium grabbing popup instead of links for entries by alpcentaur 2023-12-05 22:16:07 +0000
  • b4fd385 did some changes to main.py for using sys.argv by alpcentaur 2023-12-05 18:23:57 +0100
  • 99c74dc updated requirements.txt by alpcentaur 2023-12-05 16:59:13 +0000
  • 54daad8 started sys arguments for main.py, to be able to control spider from interface by alpcentaur 2023-12-05 17:51:16 +0100
  • 89dcca2 (tag: v2, tag: v1) added further handling for javascript links not being urls, made config for giz work by alpcentaur 2023-11-28 15:27:39 +0000
  • a0075e4 added further database in config.yaml, added new exception for downloading js generated html pages by alpcentaur 2023-11-27 15:10:11 +0000
  • df4a828 added pdf parser if entry link is direct pdf by alpcentaur 2023-11-22 17:03:15 +0000
  • 677e54c added trafilatura to requirements by alpcentaur 2023-11-22 00:07:59 +0000
  • 9ceaa28 Merge remote-tracking branch 'refs/remotes/origin/master' merging with a small README.md change by alpcentaur 2023-11-22 00:04:16 +0000
  • d3335f2 added trafilatura exception by alpcentaur 2023-11-22 00:02:29 +0000
  • 61f9ba6 update README.md by alpcentaur 2023-11-20 16:38:18 +0100
  • 69c5172 Update 'README.md' by alpcentaur 2023-11-20 16:37:27 +0100
  • 14ece9b added functions for uniform and not uniform entry end points - non uniform endpoints are generally parsed as text from any paragraph xml element p by alpcentaur 2023-11-20 15:28:04 +0000
  • b2cf4b6 added first config parameters for search on not uniform entries by alpcentaur 2023-11-15 17:27:54 +0000
  • 42841ee added some exceptions for bad encoding and get errors by alpcentaur 2023-11-14 14:38:45 +0000
  • 317ef99 changed code in entrylist data2dictionary to handle empty or missing xml elements by alpcentaur 2023-11-14 10:22:26 +0000
  • ff23c22 added working bund.de-bekanntmachungen config with new example of xpath contains by alpcentaur 2023-11-13 16:44:11 +0000
  • 06fa81e added function find config parameter and changed core spider by alpcentaur 2023-11-10 01:12:49 +0000
  • a846ce0 specifying the links, new exception clause if soupparser does not work by alpcentaur 2023-11-07 14:55:05 +0000
  • a998817 first function works, actuall xml parser has still problems with certain xml types by alpcentaur 2023-11-06 19:19:31 +0000
  • c078ee4 first function works, actuall xml parser has still problems with certain xml types by alpcentaur 2023-11-06 19:17:45 +0000
  • 8b20bc1 added multi pages configuration and code by alpcentaur 2023-11-06 18:17:32 +0000
  • 7aa9038 update to config.yaml by alpcentaur 2023-11-03 12:23:04 +0000
  • 59838bb added main.py importing and using the spider functions by alpcentaur 2023-11-02 10:54:16 +0000
  • 5ac07d1 added first config.yaml template and started creating folder structure by alpcentaur 2023-10-31 17:41:44 +0000
  • b3011ef small change of naming in error message added by alpcentaur 2023-10-30 16:43:32 +0000
  • 687d40f first change of naming, first commit for the actual spider based on importPEP by alpcentaur 2023-10-30 16:41:14 +0000
  • 8783251 first commit by alpcentaur 2023-10-30 14:35:32 +0000