Skip to content

Add Negative Labels [PSCDD] #57

@dieko95

Description

@dieko95

Important: I will pause this issue and advance on #75 . Fetching negative-labelled articles might take at least 1 week.

Problem

The NGO's tagged data only contains positive labels (e.g., this tweet IS a public service report). At this point, we haven't included negative labels (e.g., this tweet is NOT a public service report).

Proposed Solution

Add negative labels from the 2020 tagged data.

Tasks

  • Use the data annotated last year in C4V for Negative Labels (and positive if quick)

    • Compare last year's annotated data schema with positive label dataset to assess if it's possible to include it.
    • If possible, add the negative labels to the development dataset.
      The labels are not compatible with Add Positive labels [PSCDD] #56 , at this point in time it would take more effort to unify the schemas rather than web scraping from scratch.
  • If it's not possible to use last year's negative labels
    - [ ] Webscrape el pitazo articles where the URLs are not within Add Positive labels [PSCDD] - elpitazo #48 dataset.
    - This will give us articles that are not public services problems.
    - [ ] Concatenate these articles with Add Positive labels [PSCDD] - elpitazo #48 dataset.

Negative labels web scraping strategy

  1. Loop over elpitazo.net/category/<LOCATION>/page/<N> to get all the links from PSCDD positive labels dataset.
  2. Select the links that aren't in PSCDD positive labels dataset.
  3. Webscrape this links with PSCDD elpitazo web scraper.
  • Create elpitazo page discovery web scraper
    • Extract links
    • Extract news articles' date. This will take a bit more time than I thought.
  • Fetch el pitazo links for occidente and store it within a list.
  • Find links that don't match with PSCDD positive labels links.
  • Web scrape non-matched positive label links.

Notes

count                    2401
unique                    397
top       2020-06-10 00:00:00
freq                       27
first     2019-05-02 00:00:00
last      2020-10-30 00:00:00

News articles per location

count
occidente 519
gran-caracas 403
oriente 396
los-andes 287
los-llanos 284
centro 196
guayana 93
pitazo-en-la-calle 88
regiones 64
economia 21
infociudadanos 16
tecnologia 10
vista_2 8
reportajes 4
radio 3
alianzas 2
sucesos 2
salud 2
sin-categoria 1
fotogalerias 1
cronicas 1

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions