-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Important: I will pause this issue and advance on #75 . Fetching negative-labelled articles might take at least 1 week.
Problem
The NGO's tagged data only contains positive labels (e.g., this tweet IS a public service report). At this point, we haven't included negative labels (e.g., this tweet is NOT a public service report).
Proposed Solution
Add negative labels from the 2020 tagged data.
Tasks
-
Use the data annotated last year in C4V for Negative Labels (and positive if quick)
- Compare last year's annotated data schema with positive label dataset to assess if it's possible to include it.
- If possible, add the negative labels to the development dataset.
The labels are not compatible with Add Positive labels [PSCDD] #56 , at this point in time it would take more effort to unify the schemas rather than web scraping from scratch.
-
If it's not possible to use last year's negative labels
- [ ] Webscrape el pitazo articles where the URLs are not within Add Positive labels [PSCDD] - elpitazo #48 dataset.
- This will give us articles that are not public services problems.
- [ ] Concatenate these articles with Add Positive labels [PSCDD] - elpitazo #48 dataset.
Negative labels web scraping strategy
- Loop over
elpitazo.net/category/<LOCATION>/page/<N>to get all the links fromPSCDDpositive labels dataset. - Select the links that aren't in
PSCDDpositive labels dataset. - Webscrape this links with
PSCDDelpitazo web scraper.
- Create elpitazo page discovery web scraper
- Extract links
- Extract news articles' date. This will take a bit more time than I thought.
- Fetch el pitazo links for
occidenteand store it within a list. - Find links that don't match with
PSCDDpositive labels links. - Web scrape non-matched positive label links.
Notes
- The dates from the Add Positive labels [PSCDD] #56 range from 2019-05-02 to 2020-06-10
count 2401
unique 397
top 2020-06-10 00:00:00
freq 27
first 2019-05-02 00:00:00
last 2020-10-30 00:00:00
News articles per location
| count | |
|---|---|
| occidente | 519 |
| gran-caracas | 403 |
| oriente | 396 |
| los-andes | 287 |
| los-llanos | 284 |
| centro | 196 |
| guayana | 93 |
| pitazo-en-la-calle | 88 |
| regiones | 64 |
| economia | 21 |
| infociudadanos | 16 |
| tecnologia | 10 |
| vista_2 | 8 |
| reportajes | 4 |
| radio | 3 |
| alianzas | 2 |
| sucesos | 2 |
| salud | 2 |
| sin-categoria | 1 |
| fotogalerias | 1 |
| cronicas | 1 |