2aa1134
(HEAD -> master)
updated gitignore by
2024-03-05 14:56:03 +0000
a9c2346
first change to also click Accept Button in English if may come for js spidering functionality by
2024-03-05 14:50:34 +0000
0808e5a
main.py and config.yaml are left out from updates, only examples are provided. Change in Readme too by
2024-03-05 14:42:30 +0000
4ec9f76
added xorg-server-xephyr as dep to install by
2024-03-05 14:52:02 +0100
10cdab6
updated README with new and working install order by
2024-03-05 14:43:43 +0100
ccfe200
added another tip to README.md, header for display and another tip added too by
2024-03-05 12:34:32 +0100
0fa420d
added explanation of display variable in the spiders code by
2024-03-05 12:30:29 +0100
0d77282
update var javascriptlink in README.md by
2024-03-04 17:13:33 +0100
c52ea0c
added example1 for js configuration in README.md by
2024-03-04 16:46:57 +0100
5000dca
Update README.md with better explanation how to js spider by
2024-03-04 16:30:31 +0100
0908ccf
clarifications for javascript link and js link plus js iteration by
2024-03-03 18:24:07 +0100
ff0fe51
fixed the links for the clickable content summary by
2024-03-03 17:57:48 +0100
49d5c2f
third try ordering by
2024-03-03 17:53:05 +0100
f489106
second try ordering by
2024-03-03 17:50:12 +0100
32fceff
searchable headers for step by step guide started by
2024-03-03 17:48:52 +0100
eca77f9
Step by Step Guide continuation of describing the variables by
2024-03-01 00:09:38 +0100
483eaec
changed domain for new configuration dtvp by
2024-02-29 14:19:45 +0100
c33dbc3
Merge remote-tracking branch 'refs/remotes/origin/master' Merging local changes to the code with changes to the README.md on gitea instance by
2024-02-29 13:16:48 +0000
a07d2e9
changes for new database dtvp, new exceptions trying to click away cookie pop ups by
2024-02-29 13:15:34 +0000
d284fef
changes for new database dtvp, new exceptions trying to click away cookie pop ups by
2024-02-29 13:15:01 +0000
5fd6b7f
Part 2 of Step by Step Guide by
2024-02-28 17:34:57 +0100
e4fa13d
Start of Step by Step Guide by
2024-02-28 17:17:27 +0100
7ba196b
changed size of virtual window, added some scrolling and shortened the time for js lazy loading enforced slow downloading by
2024-02-11 17:08:33 +0000
a565697
another small change to config.yaml before pushing by
2024-02-11 16:43:44 +0000
a0dd469
added new database ted.europe.eu, created new case of slow downloading, intergrated scrolling into entrylistpagesdownload by
2024-02-09 18:38:49 +0000
094f092
deleted fdb entry that was a ghost for syntax reasons, but same syntax should be in other fdb anyway by
2024-01-23 17:17:40 +0100
d7d157b
added further dokumentation to README.md by
2024-01-21 14:07:38 +0000
0500f58
full working example from localhost by
2024-01-15 21:08:23 +0000
0411d74
deleted config.yaml.save by
2024-01-15 19:12:04 +0000
cf3bb52
corrected link glueing for pdf links for loop by
2024-01-15 19:09:28 +0000
af8374f
added other exception for unitrue var text not being found, before saving index 0 to variable produced error to whole execution by
2024-01-10 15:28:41 +0000
20db002
added first changes to fix js related bug for giz db by
2024-01-10 15:18:36 +0100
dec60f9
added changed logic for link addition regarding entry links by
2023-12-18 21:26:53 +0000
5d17f4e
corrected error which arised in logic of wget backup get by
2023-12-15 14:36:08 +0100
92c238a
added instruction for downloading chromium driver for python selenium to README.md by
2023-12-15 14:13:41 +0100
ece5cf1
added better logic for getting the right link of entry by
2023-12-15 13:34:23 +0100
0e58756
added last resort exception for entry page downloading with wget, also implemented some further logic regarding getting the right links by
2023-12-15 11:33:50 +0000
1619925
javascript on highest level done better by
2023-12-14 23:37:10 +0000
5627c80
merged onlinkgen with master, and added more universal chrome driver initialization to the beginning of the javascript entries gothrough function in download_entry_list_pages_of_funding_databases() by
2023-12-14 12:38:14 +0000
14b8db7
started adding javascript handling on highest spider level by
2023-12-14 12:07:14 +0000
fbee5d6
(onlinkgen)
last commit in detached head by
2023-12-13 16:20:27 +0100
953f85e
added new lines to chromedriver, to make it work on other systems by
2023-12-13 16:05:26 +0100
d2324d2
added pdf child text downloading and parse to json exceptions/cases for javascript entry data and normal data by
2023-12-06 13:46:54 +0000
885c210
added selenium for pop up entry links by
2023-12-05 22:19:00 +0000
ec180be
added flow for selenium grabbing popup instead of links for entries by
2023-12-05 22:16:07 +0000
b4fd385
did some changes to main.py for using sys.argv by
2023-12-05 18:23:57 +0100
99c74dc
updated requirements.txt by
2023-12-05 16:59:13 +0000
54daad8
started sys arguments for main.py, to be able to control spider from interface by
2023-12-05 17:51:16 +0100
89dcca2
(tag: v2, tag: v1)
added further handling for javascript links not being urls, made config for giz work by
2023-11-28 15:27:39 +0000
a0075e4
added further database in config.yaml, added new exception for downloading js generated html pages by
2023-11-27 15:10:11 +0000
df4a828
added pdf parser if entry link is direct pdf by
2023-11-22 17:03:15 +0000
677e54c
added trafilatura to requirements by
2023-11-22 00:07:59 +0000
9ceaa28
Merge remote-tracking branch 'refs/remotes/origin/master' merging with a small README.md change by
2023-11-22 00:04:16 +0000
d3335f2
added trafilatura exception by
2023-11-22 00:02:29 +0000
61f9ba6
update README.md by
2023-11-20 16:38:18 +0100
69c5172
Update 'README.md' by
2023-11-20 16:37:27 +0100
14ece9b
added functions for uniform and not uniform entry end points - non uniform endpoints are generally parsed as text from any paragraph xml element p by
2023-11-20 15:28:04 +0000
b2cf4b6
added first config parameters for search on not uniform entries by
2023-11-15 17:27:54 +0000
42841ee
added some exceptions for bad encoding and get errors by
2023-11-14 14:38:45 +0000
317ef99
changed code in entrylist data2dictionary to handle empty or missing xml elements by
2023-11-14 10:22:26 +0000
ff23c22
added working bund.de-bekanntmachungen config with new example of xpath contains by
2023-11-13 16:44:11 +0000
06fa81e
added function find config parameter and changed core spider by
2023-11-10 01:12:49 +0000
a846ce0
specifying the links, new exception clause if soupparser does not work by
2023-11-07 14:55:05 +0000
a998817
first function works, actuall xml parser has still problems with certain xml types by
2023-11-06 19:19:31 +0000
c078ee4
first function works, actuall xml parser has still problems with certain xml types by
2023-11-06 19:17:45 +0000
8b20bc1
added multi pages configuration and code by
2023-11-06 18:17:32 +0000
7aa9038
update to config.yaml by
2023-11-03 12:23:04 +0000
59838bb
added main.py importing and using the spider functions by
2023-11-02 10:54:16 +0000
5ac07d1
added first config.yaml template and started creating folder structure by
2023-10-31 17:41:44 +0000
b3011ef
small change of naming in error message added by
2023-10-30 16:43:32 +0000
687d40f
first change of naming, first commit for the actual spider based on importPEP by
2023-10-30 16:41:14 +0000
8783251
first commit by
2023-10-30 14:35:32 +0000