A comprehensive news parsing application that extracts articles from various Russian news sources, classifies events, extracts location information, and stores data in MongoDB. The application includes a web-based visualization component to display news events on an interactive map.
- Multi-source news scraping with Selenium and BeautifulSoup
- Support for both JavaScript-rendered and static pages
- Automatic location extraction using OpenAI GPT
- Geocoding with Yandex Maps API
- News event classification using OpenAI GPT
- Data storage in MongoDB
- Parallel processing for efficient scraping
- Interactive map visualization of news events
news-parser/
├── .env # Environment variables
├── .gitkeep # Git placeholder
├── .gitignore # Git ignore file
├── README.md # Project documentation
├── Dockerfile # Main application Docker configuration
├── Dockerfile.frontend # Frontend Docker configuration
├── docker-compose.yml # Docker Compose configuration
├── requirements.txt # Python dependencies for main app
├── requirements.frontend.txt # Python dependencies for frontend
├── backend_parsing_template.ipynb # Jupyter notebook template for parsing
├── data/
│ ├── .gitkeep # Git placeholder
│ ├── processed_articles.csv # Processed news articles
│ └── raw_articles.csv # Raw scraped articles
├── frontend/
│ ├── index.html # Main web interface
│ ├── make_sample.py # Sample data generator for frontend
│ └── server.py # Frontend web server
├── reports_and_documents/
│ ├── documents_backend_stage/
│ │ └── Report.md # Backend development reports
│ └── documents_planning/
│ ├── data_flow.svg # Data flow diagram
│ ├── detailed.svg # Detailed architecture diagram
│ ├── high-level.svg # High-level architecture diagram
│ ├── initial_work.ipynb # Initial project planning notebook
│ ├── mongodb_arch.svg # MongoDB architecture diagram
│ └── Regional News Sources.pdf # Documentation of news sources
├── src/
│ ├── __init__.py # Package initializer
│ ├── main.py # Main application entry point
│ ├── config/
│ │ ├── __init__.py # Config package initializer
│ │ └── sources.py # News sources configuration
│ ├── parsers/
│ │ ├── __init__.py # Parsers package initializer
│ │ ├── base_parser.py # Base parser class
│ │ └── news_parser.py # News parser implementation
│ └── utils/
│ └── __init__.py # Utilities package initializer
└── tests/ # Test directory
- Docker and Docker Compose
- OpenAI API key
- Yandex Maps API key (for geocoding)
- Mapbox API key (for frontend map visualization)
-
Clone the repository:
git clone https://github.com/yourusername/news-parser.git cd news-parser -
Create environment variables file: Create a
.envfile in the root directory with the following variables:OPENAI_API_KEY=your_openai_api_key YANDEX_API_KEY=your_yandex_api_key MONGO_URI=mongodb://mongodb:27017/ MONGO_DB=news_classification MONGO_COLLECTION=events MAX_WORKERS=4 -
Set up Mapbox API key: You need to obtain a Mapbox API key from Mapbox to enable the map visualization.
Open the file
frontend/index.htmland locate the following line:mapboxgl.accessToken = 'pk...';
Replace the token value with your own Mapbox access token. The line should look like:
mapboxgl.accessToken = 'your_mapbox_access_token';
-
Build and start the application using Docker Compose:
docker-compose build docker-compose up -d
The backend parser will run automatically when you start the Docker containers. It:
- Scrapes news from configured sources
- Processes and extracts relevant information
- Classifies news events
- Extracts and geocodes locations
- Stores the data in MongoDB
To check logs:
docker-compose logs -f news-parserAccess the news visualization map in your browser:
http://localhost:5000
The map provides:
- Visual representation of news events on a world map
- Filtering by news category
- Search functionality
- Detailed news information on click
The interactive map uses Mapbox GL JS to display news events with the following capabilities:
- Zoom in to see more detailed geographical features (towns, streets, buildings)
- Hover over markers to view brief information about the news event
- Click on markers to open detailed news articles
- Filter news events by category or search text
- Color-coded markers based on event categories
To add more news sources, modify the src/config/sources.py file:
NEWS_SOURCES = [
{
"name": "Source Name",
"url": "https://source-url.com",
"selector": "article selector",
"requires_js": True/False
},
# Add more sources here
]- Create a new file in the
src/parsersdirectory - Extend the
BaseParserclass frombase_parser.py - Implement the required methods
- Register your parser in the main application
Example:
from src.parsers.base_parser import BaseParser
class CustomParser(BaseParser):
def __init__(self, config):
super().__init__(config)
def parse(self, url):
# Implement your parsing logic
passRun tests using:
docker-compose exec news-parser python -m pytest- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.