- JSON Endpoint Scraper - Fast scraping using Reddit's
.json
endpoints (no authentication required) - Advanced Requests Scraper - Custom pagination and bulk scraping capabilities
- Proxy Rotation - Automatic proxy switching with health monitoring
- Captcha Solving - Automated captcha handling using Capsolver API
- User Agent Rotation - Realistic browser simulation
- Rate Limiting - Respectful request throttling
- Rich CLI Interface - Beautiful command-line interface with progress bars
- Multiple Export Formats - JSON and CSV output with full comment thread data
git clone https://github.com/proxidize/reddit-scraper.git
cd reddit-scraper
uv venv
source .venv/bin/activate
uv pip install -e .
git clone https://github.com/proxidize/reddit-scraper.git
cd reddit-scraper
python -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -e .[dev]
uv pip install -e .[dev]
python tests/run_tests.py
pytest tests/ -v --cov=reddit_scraper
pytest tests/unit/ -v -m unit
pytest tests/integration/ -v -m integration
pytest tests/ -v -m "not slow"
pytest tests/ --cov=reddit_scraper --cov-report=html
unit
- Fast unit testsintegration
- Integration tests that may hit external APIsslow
- Slow tests that should be skipped in CI
docker build -t reddit-scraper .
docker run -v $(pwd)/config.json:/app/config.json reddit-scraper interactive --config config.json
docker run -v $(pwd)/config.json:/app/config.json reddit-scraper json subreddit python --limit 10 --config config.json
docker run -v $(pwd)/config.json:/app/config.json -v $(pwd)/output:/app/output reddit-scraper json subreddit python --limit 10 --output output/posts.json --config config.json
python3 -m reddit_scraper.cli interactive
python3 -m reddit_scraper.cli interactive --config config.json
python3 -m reddit_scraper.cli json subreddit python --limit 10
python3 -m reddit_scraper.cli json subreddit technology --config config.json --limit 50
Note: If you've properly installed the package with pip install -e .
, you can use reddit-scraper
directly instead of python3 -m reddit_scraper.cli
The scraper uses a JSON configuration file to manage all settings including proxies, captcha solvers, and scraping preferences.
Copy config.example.json
to config.json
and edit:
{
"proxies": [
{
"host": "proxy1.example.com",
"port": 8080,
"username": "your_proxy_username",
"password": "your_proxy_password",
"proxy_type": "http"
},
{
"host": "proxy2.example.com",
"port": 1080,
"username": "your_proxy_username",
"password": "your_proxy_password",
"proxy_type": "socks5"
}
],
"captcha_solvers": [
{
"api_key": "CAP-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"provider": "capsolver",
"site_keys": {
"reddit.com": "6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-",
"www.reddit.com": "6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-"
}
}
],
"scraping": {
"default_delay": 1.0,
"max_retries": 3,
"requests_per_minute": 60,
"user_agent": "RedditScraper/1.0.0",
"rotate_user_agents": true
}
}
- Multiple Proxies: Add multiple HTTP and SOCKS5 proxies for automatic rotation
- Captcha Solving: Integrate with Capsolver for automated captcha handling with custom site keys
- Input Validation: Automatic validation of subreddit names, usernames, and other inputs
- Flexible Configuration: Easy JSON-based configuration management with validation
- Health Monitoring: Built-in proxy health checking and performance monitoring
cp config.example.json config.json
nano config.json
python3 -m reddit_scraper.cli status --config config.json
The scraper includes robust input validation and data processing capabilities:
- Subreddit Names: Validates format, length (1-21 chars), and checks for reserved names
- Usernames: Validates Reddit username format (3-20 chars, alphanumeric plus underscore/hyphen)
- Post IDs: Ensures proper Reddit post ID format
- URLs: Validates and normalizes Reddit URLs
- Comment Threading: Maintains proper parent-child relationships in comment trees
- Data Cleaning: Removes unnecessary metadata while preserving essential information
- Field Standardization: Consistent field names and data types across all scraped content
from reddit_scraper import ValidationError
try:
posts = scraper.scrape_subreddit("invalid-name!", "hot", 10)
except ValidationError as e:
print(f"Validation error: {e}")
python3 -m reddit_scraper.cli interactive [--config CONFIG_FILE]
python3 -m reddit_scraper.cli json subreddit SUBREDDIT_NAME [--config CONFIG_FILE] [options]
python3 -m reddit_scraper.cli json user USERNAME [options]
python3 -m reddit_scraper.cli json comments SUBREDDIT POST_ID [options]
python3 -m reddit_scraper.cli json subreddit-with-comments SUBREDDIT_NAME [options]
Extract rich comment data with full thread structure:
python3 -m reddit_scraper.cli json subreddit-with-comments python --limit 10 --include-comments --comment-limit 20 --output posts_with_comments.json
python3 -m reddit_scraper.cli json comments python POST_ID --sort best --output single_post_comments.json
python3 -m reddit_scraper.cli json user username --limit 25 --sort top --output user_posts.json
Comment Data Includes:
- Author information and scores
- Full comment text and timestamps
- Nested reply structure
- Thread hierarchy and relationships
- Community engagement metrics
Real Example (Actual Scraped Data):
{
"title": "A simple home server to wirelessly stream any video file",
"author": "Enzo10091",
"score": 8,
"num_comments": 1,
"comment_count_scraped": 1,
"comments": [
{
"id": "lwg8h3x",
"author": "ismail_the_whale",
"body": "nice, but you really have to clean this up. i guess you're not a python dev.\n\n- use snake_case\n- use a pyproject.toml file",
"score": 2,
"created_utc": 1755262448.0,
"parent_id": "t3_1mqw7zr",
"replies": []
}
]
}
python3 -m reddit_scraper.cli requests paginated SUBREDDIT_NAME [options]
python3 -m reddit_scraper.cli status --config config.json
python3 -m reddit_scraper.cli test-proxies --config config.json --test-urls 3
cp config.example.json config.json
nano config.json
python3 -m reddit_scraper.cli status --config config.json
python3 -m reddit_scraper.cli search "python tips" --subreddit python
python3 -m reddit_scraper.cli search "neural networks" --subreddit MachineLearning
Reddit has some protection against automated scraping:
- Some subreddits may trigger captcha challenges (r/webscraping, etc.)
- Large bulk requests may hit rate limits
- Search endpoints work but may be slower than direct scraping
Recommended approach:
- Use interactive mode for best success rate
- Start with popular, stable subreddits like
python
,technology
- Use proxies and captcha solving for reliable large-scale scraping
- Search functionality works well for targeted queries
python3 -m reddit_scraper.cli interactive --config config.json
python3 -m reddit_scraper.cli json subreddit python --limit 10
python3 -m reddit_scraper.cli json subreddit technology --config config.json --limit 50
python3 -m reddit_scraper.cli search "python tips" --subreddit python
python3 -m reddit_scraper.cli requests paginated python --max-posts 100
python3 -m reddit_scraper.cli status --config config.json
python3 -m reddit_scraper.cli test-proxies --config config.json
Subreddits that work well:
python
,programming
,technology
news
,todayilearned
entrepreneur
,startups
--config
,-c
- Path to configuration file--output
,-o
- Output file path--format
- Output format (json, csv)--limit
- Number of items to fetch--sort
- Sort method (hot, new, top, rising, etc.)--delay
- Delay between requests (seconds)
from reddit_scraper import JSONScraper, get_config_manager
scraper = JSONScraper()
posts = scraper.scrape_subreddit("python", "hot", 50)
config_manager = get_config_manager("config.json")
proxy_manager, captcha_solver = setup_advanced_features(config_manager)
advanced_scraper = JSONScraper(
proxy_manager=proxy_manager,
captcha_solver=captcha_solver,
delay=config_manager.get_scraping_config().default_delay
)
posts = advanced_scraper.scrape_subreddit("MachineLearning", "top", 1000)
from reddit_scraper import ProxyManager
proxy_manager = ProxyManager()
proxy_manager.add_proxy("proxy.example.com", 8080, "user", "pass", "http")
proxy_manager.health_check_all()
stats = proxy_manager.get_proxy_stats()
print(f"Healthy proxies: {stats['healthy_proxies']}/{stats['total_proxies']}")
from reddit_scraper import CaptchaSolverManager
solver = CaptchaSolverManager("YOUR_CAPSOLVER_API_KEY")
solution = solver.check_balance_and_solve(
solver.solver.solve_recaptcha_v2,
"https://reddit.com",
"site_key_here"
)
if solution.success:
print(f"Captcha solved: {solution.solution}")
- Always respect Reddit's Terms of Service
- Don't overload Reddit's servers
- Consider using the official API for commercial use
- Default: 1 second delay between requests
- Use appropriate delays between requests
- Increase delay for large-scale operations
- Monitor proxy health to avoid IP bans
- Store scraped data responsibly
- Respect user privacy
- Don't republish personal information
reddit-scraper test-proxies
reddit-scraper status
reddit-scraper status
- Increase
--delay
parameter - Use configuration file with multiple proxies
- Reduce
--limit
per request
This project integrates with capsolver for automated captcha solving, supporting:
- reCAPTCHA v2/v3
- hCaptcha
- FunCaptcha
- Image-to-text captchas
Compatible with Reddit's public JSON endpoints for FREE data access.
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
This project is for educational and research purposes. Please respect Reddit's Terms of Service and robots.txt.
For issues, questions, or feature requests, please open an issue on GitHub.
Note: This tool is designed for ethical data collection and research purposes. Always comply with Reddit's Terms of Service and respect rate limits.