Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 176 additions & 0 deletions notebooks/webscraper/2.negative_labels_pscdd.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "broken-thought",
"metadata": {},
"source": [
"# Negative Labels\n",
"\n",
"\n",
"`elpitazo` news articles are categorized by geographical location. E.g., Gran Caracas, Occidente, Centro, Oriente, Los Llanos, Los Andes, Guayana. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "neural-perception",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import requests\n",
"from bs4 import BeautifulSoup\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "creative-string",
"metadata": {},
"outputs": [],
"source": [
"df_positivelabels_original = pd.read_csv(\"../../data/processed/webscraping/elpitazo_positivelabels_devdataset.csv\")"
]
},
{
"cell_type": "markdown",
"id": "civilian-horizon",
"metadata": {},
"source": [
"# Descriptive Analysis\n",
"\n",
"In this section I want to understand what kind of news articles should I web scrape.\n",
"\n",
"\n",
"| | count |\n",
"|:-------------------|--------:|\n",
"| occidente | 519 |\n",
"| gran-caracas | 403 |\n",
"| oriente | 396 |\n",
"| los-andes | 287 |\n",
"| los-llanos | 284 |\n",
"| centro | 196 |\n",
"| guayana | 93 |\n",
"| pitazo-en-la-calle | 88 |\n",
"| regiones | 64 |\n",
"| economia | 21 |\n",
"| infociudadanos | 16 |\n",
"| tecnologia | 10 |\n",
"| vista_2 | 8 |\n",
"| reportajes | 4 |\n",
"| radio | 3 |\n",
"| alianzas | 2 |\n",
"| sucesos | 2 |\n",
"| salud | 2 |\n",
"| sin-categoria | 1 |\n",
"| fotogalerias | 1 |\n",
"| cronicas | 1 |\n",
"\n",
"\n",
"```\n",
"count 2401\n",
"unique 397\n",
"top 2020-06-10 00:00:00\n",
"freq 27\n",
"first 2019-05-02 00:00:00\n",
"last 2020-10-30 00:00:00\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dominican-rover",
"metadata": {},
"outputs": [],
"source": [
"## Time\n",
"# pd.to_datetime(df_positivelabels_original.fecha, infer_datetime_format=True).describe()\n",
"\n",
"## Location\n",
"vcounts_location = df_positivelabels_original.link_de_la_noticia.str.split(\"/\", expand = True)[3].value_counts()\n",
"vcounts_location.name = \"count\"\n",
"# print(vcounts_location.to_markdown())"
]
},
{
"cell_type": "markdown",
"id": "fixed-binding",
"metadata": {},
"source": [
"# Web Scraper\n",
"\n",
"## Strategy\n",
"\n",
"1. Loop over `elpitazo.net/category/<LOCATION>/page/<N>` to get all the links from `PSCDD` positive labels dataset. \n",
"2. Select the links that aren't in `PSCDD` positive labels dataset.\n",
"3. Webscrape this links with `PSCDD` elpitazo web scraper.\n",
"\n",
"\n",
"- [] Create elpitazo page discovery web scraper\n",
" - [x] Extract links\n",
" - [] Extract news articles' date. _This will take a bit more time than I thought_.\n",
" \n",
"- [] Fetch el pitazo links for `occidente` and store it within a list.\n",
"- [] Find links that don't match with `PSCDD` positive labels links.\n",
"- [] Web scrape non-matched positive label links."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "amazing-angola",
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"def elpitazo_page_discovery(url:str):\n",
" headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36\"\n",
" }\n",
" \n",
" page = requests.get(url, headers=headers, timeout=20)\n",
" \n",
" soup = BeautifulSoup(page.content, \"html.parser\")\n",
" \n",
" ## TODO: Include date to find the articles that match the date of the positive labels PSCDD\n",
"# _date = soup.find_all(\"div\", {\"class\":\"td-editor-date\"})\n",
" \n",
" _links = soup.find_all(\"h3\", {\"class\": \"entry-title td-module-title\"})\n",
" \n",
" ls_links = []\n",
" \n",
" for i in range(len(_links)):\n",
" ls_links.append(_links[i].find_all(\"a\")[0].get(\"href\"))\n",
" \n",
" return ls_links\n",
"\n",
"test = elpitazo_page_discovery(\"https://elpitazo.net/category/occidente/90\")\n",
"test"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}