alpcentaur
|
0500f5853d
|
full working example from localhost
|
2024-01-15 21:08:23 +00:00 |
|
alpcentaur
|
0411d74936
|
deleted config.yaml.save
|
2024-01-15 19:12:04 +00:00 |
|
alpcentaur
|
cf3bb52684
|
corrected link glueing for pdf links for loop
|
2024-01-15 19:09:28 +00:00 |
|
alpcentaur
|
af8374f715
|
added other exception for unitrue var text not being found, before saving index 0 to variable produced error to whole execution
|
2024-01-10 15:28:41 +00:00 |
|
alpcentaur
|
20db0028e1
|
added first changes to fix js related bug for giz db
|
2024-01-10 15:18:36 +01:00 |
|
alpcentaur
|
dec60f9bf5
|
added changed logic for link addition regarding entry links
|
2023-12-18 21:26:53 +00:00 |
|
alpcentaur
|
5d17f4e421
|
corrected error which arised in logic of wget backup get
|
2023-12-15 14:36:08 +01:00 |
|
alpcentaur
|
92c238a2ed
|
added instruction for downloading chromium driver for python selenium to README.md
|
2023-12-15 14:13:41 +01:00 |
|
alpcentaur
|
ece5cf1301
|
added better logic for getting the right link of entry
|
2023-12-15 13:34:23 +01:00 |
|
alpcentaur
|
0e58756600
|
added last resort exception for entry page downloading with wget, also implemented some further logic regarding getting the right links
|
2023-12-15 11:33:50 +00:00 |
|
alpcentaur
|
16199256e3
|
javascript on highest level done better
|
2023-12-14 23:37:10 +00:00 |
|
alpcentaur
|
5627c80177
|
merged onlinkgen with master, and added more universal chrome driver initialization to the beginning of the javascript entries gothrough function in download_entry_list_pages_of_funding_databases()
|
2023-12-14 12:38:14 +00:00 |
|
alpcentaur
|
14b8db7941
|
started adding javascript handling on highest spider level
|
2023-12-14 12:07:14 +00:00 |
|
alpcentaur
|
fbee5d6229
|
last commit in detached head
|
2023-12-13 16:20:27 +01:00 |
|
alpcentaur
|
953f85ee5b
|
added new lines to chromedriver, to make it work on other systems
|
2023-12-13 16:05:26 +01:00 |
|
alpcentaur
|
d2324d265a
|
added pdf child text downloading and parse to json exceptions/cases for javascript entry data and normal data
|
2023-12-06 13:46:54 +00:00 |
|
alpcentaur
|
885c210971
|
added selenium for pop up entry links
|
2023-12-05 22:19:00 +00:00 |
|
alpcentaur
|
ec180bed0a
|
added flow for selenium grabbing popup instead of links for entries
|
2023-12-05 22:16:07 +00:00 |
|
alpcentaur
|
b4fd385c5d
|
did some changes to main.py for using sys.argv
|
2023-12-05 18:23:57 +01:00 |
|
alpcentaur
|
99c74dcbad
|
updated requirements.txt
|
2023-12-05 16:59:13 +00:00 |
|
alpcentaur
|
54daad8dfa
|
started sys arguments for main.py, to be able to control spider from interface
|
2023-12-05 17:51:16 +01:00 |
|
alpcentaur
|
89dcca2031
|
added further handling for javascript links not being urls, made config for giz work
|
2023-11-28 15:27:39 +00:00 |
|
alpcentaur
|
a0075e429d
|
added further database in config.yaml, added new exception for downloading js generated html pages
|
2023-11-27 15:10:11 +00:00 |
|
alpcentaur
|
df4a8289b8
|
added pdf parser if entry link is direct pdf
|
2023-11-22 17:03:15 +00:00 |
|
alpcentaur
|
677e54c0c2
|
added trafilatura to requirements
|
2023-11-22 00:07:59 +00:00 |
|
alpcentaur
|
9ceaa28a82
|
Merge remote-tracking branch 'refs/remotes/origin/master'
merging with a small README.md change
|
2023-11-22 00:04:16 +00:00 |
|
alpcentaur
|
d3335f203b
|
added trafilatura exception
|
2023-11-22 00:02:29 +00:00 |
|
alpcentaur
|
61f9ba67fb
|
update README.md
|
2023-11-20 16:38:18 +01:00 |
|
alpcentaur
|
69c517292b
|
Update 'README.md'
|
2023-11-20 16:37:27 +01:00 |
|
alpcentaur
|
14ece9bceb
|
added functions for uniform and not uniform entry end points - non uniform endpoints are generally parsed as text from any paragraph xml element p
|
2023-11-20 15:28:04 +00:00 |
|
alpcentaur
|
b2cf4b67ce
|
added first config parameters for search on not uniform entries
|
2023-11-15 17:27:54 +00:00 |
|
alpcentaur
|
42841ee650
|
added some exceptions for bad encoding and get errors
|
2023-11-14 14:38:45 +00:00 |
|
alpcentaur
|
317ef99720
|
changed code in entrylist data2dictionary to handle empty or missing xml elements
|
2023-11-14 10:22:26 +00:00 |
|
alpcentaur
|
ff23c22e3c
|
added working bund.de-bekanntmachungen config with new example of xpath contains
|
2023-11-13 16:44:11 +00:00 |
|
alpcentaur
|
06fa81e549
|
added function find config parameter and changed core spider
|
2023-11-10 01:12:49 +00:00 |
|
alpcentaur
|
a846ce04cc
|
specifying the links, new exception clause if soupparser does not work
|
2023-11-07 14:55:05 +00:00 |
|
alpcentaur
|
a99881796a
|
first function works, actuall xml parser has still problems with certain xml types
|
2023-11-06 19:19:31 +00:00 |
|
alpcentaur
|
c078ee4b1b
|
first function works, actuall xml parser has still problems with certain xml types
|
2023-11-06 19:17:45 +00:00 |
|
alpcentaur
|
8b20bc178f
|
added multi pages configuration and code
|
2023-11-06 18:17:32 +00:00 |
|
alpcentaur
|
7aa903883b
|
update to config.yaml
|
2023-11-03 12:23:04 +00:00 |
|
alpcentaur
|
59838bb8e1
|
added main.py importing and using the spider functions
|
2023-11-02 10:54:16 +00:00 |
|
alpcentaur
|
5ac07d151a
|
added first config.yaml template and started creating folder structure
|
2023-10-31 17:41:44 +00:00 |
|
alpcentaur
|
b3011efc73
|
small change of naming in error message added
|
2023-10-30 16:43:32 +00:00 |
|
alpcentaur
|
687d40f156
|
first change of naming, first commit for the actual spider based on importPEP
|
2023-10-30 16:41:14 +00:00 |
|
alpcentaur
|
8783251133
|
first commit
|
2023-10-30 14:35:32 +00:00 |
|