-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Release v0.7.7 - The Self-Hosting & Monitoring Update #1613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Emmanuel Ferdman <[email protected]>
Implements comprehensive hooks functionality allowing users to provide custom Python functions as strings that execute at specific points in the crawling pipeline. Key Features: - Support for all 8 crawl4ai hook points: • on_browser_created: Initialize browser settings • on_page_context_created: Configure page context • before_goto: Pre-navigation setup • after_goto: Post-navigation processing • on_user_agent_updated: User agent modification handling • on_execution_started: Crawl execution initialization • before_retrieve_html: Pre-extraction processing • before_return_html: Final HTML processing Implementation Details: - Created UserHookManager for validation, compilation, and safe execution - Added IsolatedHookWrapper for error isolation and timeout protection - AST-based validation ensures code structure correctness - Sandboxed execution with restricted builtins for security - Configurable timeout (1-120 seconds) prevents infinite loops - Comprehensive error handling ensures hooks don't crash main process - Execution tracking with detailed statistics and logging API Changes: - Added HookConfig schema with code and timeout fields - Extended CrawlRequest with optional hooks parameter - Added /hooks/info endpoint for hook discovery - Updated /crawl and /crawl/stream endpoints to support hooks Safety Features: - Malformed hooks return clear validation errors - Hook errors are isolated and reported without stopping crawl - Execution statistics track success/failure/timeout rates - All hook results are JSON-serializable Testing: - Comprehensive test suite covering all 8 hooks - Error handling and timeout scenarios validated - Authentication, performance, and content extraction examples - 100% success rate in production testing Documentation: - Added extensive hooks section to docker-deployment.md - Security warnings about user-provided code risks - Real-world examples using httpbin.org, GitHub, BBC - Best practices and troubleshooting guide ref #1377
…ncation. ref #1253 Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.
- Wrap all AsyncUrlSeeder usage with async context managers - Update URL seeding adventure example to use "sitemap+cc" source, focus on course posts, and add stream=True parameter to fix runtime error
Update URL seeding examples to use proper async context managers
Fix examples in README.md
fix(docker-api): migrate to modern datetime library API
Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers due to a hardcoded api_key_env fallback in config.yml. This caused authentication errors when using non-OpenAI providers like Gemini. Changes: - Remove api_key_env from config.yml to let litellm handle provider-specific env vars - Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys - Update validate_llm_provider() to trust litellm's built-in key detection - Update documentation to reflect the new automatic key handling The fix leverages litellm's existing capability to automatically find the correct environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.) without manual configuration. ref #1291
fix(docker): Fix LLM API key handling for multi-provider support
…ted examples (#1330) - Replace BaseStrategy with CrawlStrategy in custom strategy examples (DomainSpecificStrategy, HybridStrategy) - Remove “Custom Link Scoring” and “Caching Strategy” sections no longer aligned with current library - Revise memory pruning example to use adaptive.get_relevant_content and index-based retention of top 500 docs - Correct Quickstart note: default cache mode is CacheMode.BYPASS; instruct enabling with CacheMode.ENABLED
…eserve '+' signs. ref #1332
This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface. Core Functionality - AI-powered web scraping with plain English queries - Dual scraping approaches: Schema-based (faster) and LLM-based (flexible) - Intelligent schema caching for improved performance - Custom LLM model support with API key management - Automatic duplicate request prevention Modern Frontend Interface - Minimalist black-and-white design inspired by modern web apps - Responsive layout with smooth animations and transitions - Three main pages: Scrape Data, Models Management, API Request History - Real-time results display with JSON formatting - Copy-to-clipboard functionality for extracted data - Toast notifications for user feedback - Auto-scroll to results when scraping starts Model Management System - Web-based model configuration interface - Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.) - Simplified configuration requiring only provider and API token - Add, list, and delete model configurations - Secure storage of API keys in local JSON files API Request History - Automatic saving of all API requests and responses - Display of request history with URL, query, and cURL commands - Duplicate prevention (same URL + query combinations) - Request deletion functionality - Clean, simplified display focusing on essential information Technical Implementation Backend (FastAPI) - RESTful API with comprehensive endpoints - Pydantic models for request/response validation - Async web scraping with crawl4ai library - Error handling with detailed error messages - File-based storage for models and request history Frontend (Vanilla JS/CSS/HTML) - No framework dependencies - pure HTML, CSS, JavaScript - Modern CSS Grid and Flexbox layouts - Custom dropdown styling with SVG arrows - Responsive design for mobile and desktop - Smooth scrolling and animations Core Library Integration - WebScraperAgent class for orchestration - ModelConfig class for LLM configuration management - Schema generation and caching system - LLM extraction strategy support - Browser configuration with headless mode
Fixes bug reported in issue #1405 [Bug]: Excluded selector (excluded_selector) doesn't work This commit reintroduces the cssselect library which was removed by PR (#1368) and merged via (437395e). Integration tested against 0.7.4 Docker container. Reintroducing cssselector package eliminated errors seen in logs and excluded_selector functionality was restored. Refs: #1405
… deep crawl strategy (ref #1419) - Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params - Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization - Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors - Ensure filter chains work correctly with Docker client and REST API The issue occurred because: 1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization 2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors Changes: - async_configs.py: Comment out __slots__ serialization logic (lines 100-109) - filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes - models.py: Convert property descriptors to strings in model_dump() instead of including them directly
…ration. ref #1035 Implement hierarchical configuration for LLM parameters with support for: - Temperature control (0.0-2.0) to adjust response creativity - Custom base_url for proxy servers and alternative endpoints - 4-tier priority: request params > provider env > global env > defaults Add helper functions in utils.py, update API schemas and handlers, support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.), and provide comprehensive documentation with examples.
feat(docker): Add temperature and base_url parameters for LLM configuration
…ptive-strategies-docs Update Quickstart and Adaptive Strategies documentation
- Return comprehensive error messages along with status codes for api internal errors. - Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints - Add sanitization to ensure fit_html is always JSON-serializable (string or None) - Add comprehensive error handling test suite.
fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy
…and enhance proxy string parsing - Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials. - Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility. - Added warnings for deprecated usage and clarified behavior when both parameters are provided. - Updated documentation and tests to reflect changes in proxy configuration handling.
…ate .gitignore to include test_scripts directory.
…ring crawling. Ref #1410 Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.
fix(docker): Remove environment variable overrides in docker-compose.yml
Fix remove_overlay_elements functionality by calling injected JS function.
execution, causing URLs to be processed sequentially instead of in parallel. Changes: - Added aperform_completion_with_backoff() using litellm.acompletion for async LLM calls - Implemented arun() method in ExtractionStrategy base class with thread pool fallback - Created async arun() and aextract() methods in LLMExtractionStrategy using asyncio.gather - Updated AsyncWebCrawler.arun() to detect and use arun() when available - Added comprehensive test suite to verify parallel execution Impact: - LLM extraction now runs truly in parallel across multiple URLs - Significant performance improvement for multi-URL crawls with LLM strategies - Backward compatible - existing extraction strategies continue to work - No breaking changes to public API Technical details: - Uses litellm.acompletion for non-blocking LLM calls - Leverages asyncio.gather for concurrent chunk processing - Maintains backward compatibility via asyncio.to_thread fallback - Works seamlessly with MemoryAdaptiveDispatcher and other dispatchers
…Many This commit resolves issue #1055 where LLM extraction was blocking async
feat(ManagedBrowser): add viewport size configuration for browser launch
…ve monitoring documentation Major documentation restructuring to emphasize self-hosting capabilities and fully document the real-time monitoring system. Changes: - Renamed docker-deployment.md → self-hosting.md to better reflect the value proposition - Updated mkdocs.yml navigation to "Self-Hosting Guide" - Completely rewrote introduction emphasizing self-hosting benefits: * Data privacy and ownership * Cost control and transparency * Performance and security advantages * Full customization capabilities - Expanded "Metrics & Monitoring" → "Real-time Monitoring & Operations" with: * Monitoring Dashboard section documenting the /monitor UI * Complete feature breakdown (system health, requests, browsers, janitor, errors) * Monitor API Endpoints with all REST endpoints and examples * WebSocket Streaming integration guide with Python examples * Control Actions for manual browser management * Production Integration patterns (Prometheus, custom dashboards, alerting) * Key production metrics to track - Enhanced summary section: * What users learned checklist * Why self-hosting matters * Clear next steps * Key resources with monitoring dashboard URL The monitoring dashboard built 2-3 weeks ago is now fully documented and discoverable. Users will understand they have complete operational visibility at http://localhost:11235/monitor with real-time updates, browser pool management, and programmatic control via REST/WebSocket APIs. This positions Crawl4AI as an enterprise-grade self-hosting solution with DevOps-level monitoring capabilities, not just a Docker deployment.
…ategy with seen URL tracking
#1551 : Fix casing and variable name consistency for LLMConfig in doc…
#1559 :Add tests for sitemap parsing and URL normalization in AsyncUr…
Add CDP endpoint verification with exponential backoff for managed browsers
feat: Add Nstproxy Proxies
#1591 enhance proxy configuration with security, SSL analysis, and rotation examples
Fix/dfs deep crawling
…ync_configs.py implementation Updated browser-crawler-config.md and parameters.md to ensure complete accuracy with the actual BrowserConfig and CrawlerRunConfig implementations. Changes: - Removed non-existent parameters from documentation: * enable_rate_limiting, rate_limit_config (never implemented) * memory_threshold_percent, check_interval, max_session_permit (internal to AsyncDispatcher) * display_mode (doesn't exist) - Added missing BrowserConfig parameters (14 total): * browser_mode, use_managed_browser, cdp_url, debugging_port, host * viewport, chrome_channel, channel * accept_downloads, downloads_path, storage_state, sleep_on_close * user_agent_mode, user_agent_generator_config, enable_stealth - Added missing CrawlerRunConfig parameters (29 total): * chunking_strategy, keep_attrs, parser_type, scraping_strategy * proxy_config, proxy_rotation_strategy * locale, timezone_id, geolocation, fetch_ssl_certificate * shared_data, wait_for_timeout * c4a_script, max_scroll_steps * exclude_all_images, table_score_threshold, table_extraction * exclude_internal_links, score_links * capture_network_requests, capture_console_messages * method, stream, url, user_agent, user_agent_mode, user_agent_generator_config * deep_crawl_strategy, link_preview_config, url_matcher, match_mode, experimental - Marked deprecated cache parameters (bypass_cache, disable_cache, no_cache_read, no_cache_write) - Reorganized parameters into logical sections (Content Processing, Browser Location & Identity, Caching & Session, Page Navigation & Timing, Page Interaction, Media Handling, Link/Domain Handling, Debug & Logging, Connection & HTTP, Virtual Scroll, URL Matching, Advanced Features) - Ensured all parameter descriptions match source code docstrings - Added proper default values from __init__ signatures
Update browser and crawler run config documentation to match async_configs.py implementation
- Updated version to 0.7.7 - Added comprehensive demo and release notes - Updated all documentation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR merges the v0.7.7 release into the main branch, introducing a complete self-hosting platform with enterprise-grade real-time monitoring. This release transforms Crawl4AI Docker from a simple containerized crawler into a production-ready platform with full operational transparency and control.
🚀 What's New
Major Feature: Real-time Monitoring & Self-Hosting Platform
Docker deployment now includes:
/monitor/logs/*
🐛 Critical Bug Fixes
Performance & Stability
Configuration & Features
viewport_can't work whenuse_persistent_context = True#1490)Docker & Infrastructure
Security
Documentation
📝 Documentation Updates
New Documentation
Updated Documentation
🔄 Breaking Changes
None! This release is fully backward compatible.
📦 Files Changed
New Files
Modified Files
Documentation Structure