Skip to content

Conversation

@ntohidi
Copy link
Collaborator

@ntohidi ntohidi commented Nov 14, 2025

Summary

This PR merges the v0.7.7 release into the main branch, introducing a complete self-hosting platform with enterprise-grade real-time monitoring. This release transforms Crawl4AI Docker from a simple containerized crawler into a production-ready platform with full operational transparency and control.

🚀 What's New

Major Feature: Real-time Monitoring & Self-Hosting Platform

Docker deployment now includes:

  • 📊 Interactive Monitoring Dashboard (/dashboard)
    • Live system metrics (CPU, memory, network, uptime)
    • Real-time request tracking (active & completed)
    • Browser pool visibility (permanent/hot/cold tiers)
    • Janitor cleanup event logs
    • Error monitoring with full context
  • 🔌 Comprehensive Monitor API
    • Complete REST API for programmatic access
    • Endpoints: /monitor/health, /monitor/requests, /monitor/browsers, /monitor/endpoints/stats, /monitor/timeline,
      /monitor/logs/*
    • Control actions: cleanup, kill browser, restart browser, reset stats
  • ⚡ WebSocket Streaming
    • Real-time updates every 2 seconds
    • Build custom dashboards with live data
    • Connection: ws://localhost:11235/monitor/ws
  • 🔥 Smart Browser Pool (3-tier architecture)
    • Permanent Browser: Always-on for default configs
    • Hot Pool: Frequently-used configs (promoted after 3+ uses)
    • Cold Pool: On-demand browsers for variant configs
    • Automatic promotion and cleanup
    • 10x memory efficiency improvement
  • 🧹 Janitor System
    • Automatic resource management
    • Event logging for cleanup activities
    • Configurable idle timeouts
  • 📈 Production-Ready
    • Prometheus integration examples
    • 6 critical metrics for operational excellence
    • Alerting patterns and thresholds
    • Log aggregation support

🐛 Critical Bug Fixes

Performance & Stability

Configuration & Features

Docker & Infrastructure

Security

  • Updated pyOpenSSL from >=24.3.0 to >=25.3.0 (security vulnerability fix)
  • Added verification tests for security updates

Documentation

📝 Documentation Updates

New Documentation

  • ✅ docs/blog/release-v0.7.7.md - Comprehensive release notes (~600 lines)
  • ✅ docs/releases_review/demo_v0.7.7.py - Complete demo showcasing all monitoring features
  • ✅ docs/md_v2/blog/index.md - Updated blog index with v0.7.7 as latest

Updated Documentation

  • ✅ README.md - Updated version references, Docker commands, and release highlights
  • ✅ deploy/docker/README.md - Updated all version references from 0.7.6 to 0.7.7
  • ✅ Dockerfile - Updated C4AI_VER argument to 0.7.7
  • ✅ Documentation links updated from "Docker Deployment" to "Self-Hosting Guide"

🔄 Breaking Changes

None! This release is fully backward compatible.

  • All existing Docker configurations continue to work
  • No API changes to existing endpoints
  • Monitoring is an additive functionality
  • No migration required

📦 Files Changed

New Files

  • docs/blog/release-v0.7.7.md - Release notes
  • docs/releases_review/demo_v0.7.7.py - Demo script

Modified Files

  • README.md - Version updates and feature highlights
  • Dockerfile - Version bump to 0.7.7
  • deploy/docker/README.md - Docker version references
  • docs/md_v2/blog/index.md - Blog index with latest release

Documentation Structure

  • Docker Deployment → Self-Hosting (rebranding for better positioning)
  • Added monitoring dashboard URL references throughout
  • Updated all Docker commands to use 0.7.7

emmanuel-ferdman and others added 30 commits May 13, 2025 00:04
Implements comprehensive hooks functionality allowing users to provide custom Python
functions as strings that execute at specific points in the crawling pipeline.

Key Features:
- Support for all 8 crawl4ai hook points:
  • on_browser_created: Initialize browser settings
  • on_page_context_created: Configure page context
  • before_goto: Pre-navigation setup
  • after_goto: Post-navigation processing
  • on_user_agent_updated: User agent modification handling
  • on_execution_started: Crawl execution initialization
  • before_retrieve_html: Pre-extraction processing
  • before_return_html: Final HTML processing

Implementation Details:
- Created UserHookManager for validation, compilation, and safe execution
- Added IsolatedHookWrapper for error isolation and timeout protection
- AST-based validation ensures code structure correctness
- Sandboxed execution with restricted builtins for security
- Configurable timeout (1-120 seconds) prevents infinite loops
- Comprehensive error handling ensures hooks don't crash main process
- Execution tracking with detailed statistics and logging

API Changes:
- Added HookConfig schema with code and timeout fields
- Extended CrawlRequest with optional hooks parameter
- Added /hooks/info endpoint for hook discovery
- Updated /crawl and /crawl/stream endpoints to support hooks

Safety Features:
- Malformed hooks return clear validation errors
- Hook errors are isolated and reported without stopping crawl
- Execution statistics track success/failure/timeout rates
- All hook results are JSON-serializable

Testing:
- Comprehensive test suite covering all 8 hooks
- Error handling and timeout scenarios validated
- Authentication, performance, and content extraction examples
- 100% success rate in production testing

Documentation:
- Added extensive hooks section to docker-deployment.md
- Security warnings about user-provided code risks
- Real-world examples using httpbin.org, GitHub, BBC
- Best practices and troubleshooting guide

ref #1377
…ncation. ref #1253

  Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.
- Wrap all AsyncUrlSeeder usage with async context managers
- Update URL seeding adventure example to use "sitemap+cc" source, focus on course posts, and add stream=True parameter to fix runtime error
Update URL seeding examples to use proper async context managers
Fix examples in README.md
fix(docker-api): migrate to modern datetime library API
Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers
due to a hardcoded api_key_env fallback in config.yml. This caused authentication
errors when using non-OpenAI providers like Gemini.

Changes:
- Remove api_key_env from config.yml to let litellm handle provider-specific env vars
- Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys
- Update validate_llm_provider() to trust litellm's built-in key detection
- Update documentation to reflect the new automatic key handling

The fix leverages litellm's existing capability to automatically find the correct
environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.)
without manual configuration.

ref #1291
fix(docker): Fix LLM API key handling for multi-provider support
…ted examples (#1330)

- Replace BaseStrategy with CrawlStrategy in custom strategy examples (DomainSpecificStrategy, HybridStrategy)
- Remove “Custom Link Scoring” and “Caching Strategy” sections no longer aligned with current library
- Revise memory pruning example to use adaptive.get_relevant_content and index-based retention of top 500 docs
- Correct Quickstart note: default cache mode is CacheMode.BYPASS; instruct enabling with CacheMode.ENABLED
This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface.

Core Functionality
- AI-powered web scraping with plain English queries
- Dual scraping approaches: Schema-based (faster) and LLM-based (flexible)
- Intelligent schema caching for improved performance
- Custom LLM model support with API key management
- Automatic duplicate request prevention

Modern Frontend Interface
- Minimalist black-and-white design inspired by modern web apps
- Responsive layout with smooth animations and transitions
- Three main pages: Scrape Data, Models Management, API Request History
- Real-time results display with JSON formatting
- Copy-to-clipboard functionality for extracted data
- Toast notifications for user feedback
- Auto-scroll to results when scraping starts

Model Management System
- Web-based model configuration interface
- Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.)
- Simplified configuration requiring only provider and API token
- Add, list, and delete model configurations
- Secure storage of API keys in local JSON files

API Request History
- Automatic saving of all API requests and responses
- Display of request history with URL, query, and cURL commands
- Duplicate prevention (same URL + query combinations)
- Request deletion functionality
- Clean, simplified display focusing on essential information

Technical Implementation

Backend (FastAPI)
- RESTful API with comprehensive endpoints
- Pydantic models for request/response validation
- Async web scraping with crawl4ai library
- Error handling with detailed error messages
- File-based storage for models and request history

Frontend (Vanilla JS/CSS/HTML)
- No framework dependencies - pure HTML, CSS, JavaScript
- Modern CSS Grid and Flexbox layouts
- Custom dropdown styling with SVG arrows
- Responsive design for mobile and desktop
- Smooth scrolling and animations

Core Library Integration
- WebScraperAgent class for orchestration
- ModelConfig class for LLM configuration management
- Schema generation and caching system
- LLM extraction strategy support
- Browser configuration with headless mode
Fixes bug reported in issue #1405
[Bug]: Excluded selector (excluded_selector) doesn't work

This commit reintroduces the cssselect library which was removed by PR (#1368) and merged via (437395e).

Integration tested against 0.7.4 Docker container. Reintroducing cssselector package eliminated errors seen in logs and excluded_selector functionality was restored.

Refs: #1405
… deep crawl strategy (ref #1419)

  - Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params
  - Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization
  - Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors
  - Ensure filter chains work correctly with Docker client and REST API

  The issue occurred because:
  1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization
  2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors

  Changes:
  - async_configs.py: Comment out __slots__ serialization logic (lines 100-109)
  - filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes
  - models.py: Convert property descriptors to strings in model_dump() instead of including them directly
…ration. ref #1035

  Implement hierarchical configuration for LLM parameters with support for:
  - Temperature control (0.0-2.0) to adjust response creativity
  - Custom base_url for proxy servers and alternative endpoints
  - 4-tier priority: request params > provider env > global env > defaults

  Add helper functions in utils.py, update API schemas and handlers,
  support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.),
  and provide comprehensive documentation with examples.
feat(docker): Add temperature and base_url parameters for LLM configuration
…ptive-strategies-docs

Update Quickstart and Adaptive Strategies documentation
- Return comprehensive error messages along with status codes for api internal errors.
- Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints
- Add sanitization to ensure fit_html is always JSON-serializable (string or None)
- Add comprehensive error handling test suite.
fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy
…and enhance proxy string parsing

- Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials.
- Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility.
- Added warnings for deprecated usage and clarified behavior when both parameters are provided.
- Updated documentation and tests to reflect changes in proxy configuration handling.
…ate .gitignore to include test_scripts directory.
…ring crawling. Ref #1410

Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.
ntohidi and others added 28 commits November 6, 2025 00:07
fix(docker): Remove environment variable overrides in docker-compose.yml
Fix remove_overlay_elements functionality by calling injected JS function.
  execution, causing URLs to be processed sequentially instead of in parallel.

  Changes:
  - Added aperform_completion_with_backoff() using litellm.acompletion for async LLM calls
  - Implemented arun() method in ExtractionStrategy base class with thread pool fallback
  - Created async arun() and aextract() methods in LLMExtractionStrategy using asyncio.gather
  - Updated AsyncWebCrawler.arun() to detect and use arun() when available
  - Added comprehensive test suite to verify parallel execution

  Impact:
  - LLM extraction now runs truly in parallel across multiple URLs
  - Significant performance improvement for multi-URL crawls with LLM strategies
  - Backward compatible - existing extraction strategies continue to work
  - No breaking changes to public API

  Technical details:
  - Uses litellm.acompletion for non-blocking LLM calls
  - Leverages asyncio.gather for concurrent chunk processing
  - Maintains backward compatibility via asyncio.to_thread fallback
  - Works seamlessly with MemoryAdaptiveDispatcher and other dispatchers
…Many

This commit resolves issue #1055 where LLM extraction was blocking async
feat(ManagedBrowser): add viewport size configuration for browser launch
…ve monitoring documentation

Major documentation restructuring to emphasize self-hosting capabilities and fully document the real-time monitoring system.

Changes:
- Renamed docker-deployment.md → self-hosting.md to better reflect the value proposition
- Updated mkdocs.yml navigation to "Self-Hosting Guide"
- Completely rewrote introduction emphasizing self-hosting benefits:
  * Data privacy and ownership
  * Cost control and transparency
  * Performance and security advantages
  * Full customization capabilities

- Expanded "Metrics & Monitoring" → "Real-time Monitoring & Operations" with:
  * Monitoring Dashboard section documenting the /monitor UI
  * Complete feature breakdown (system health, requests, browsers, janitor, errors)
  * Monitor API Endpoints with all REST endpoints and examples
  * WebSocket Streaming integration guide with Python examples
  * Control Actions for manual browser management
  * Production Integration patterns (Prometheus, custom dashboards, alerting)
  * Key production metrics to track

- Enhanced summary section:
  * What users learned checklist
  * Why self-hosting matters
  * Clear next steps
  * Key resources with monitoring dashboard URL

The monitoring dashboard built 2-3 weeks ago is now fully documented and discoverable.
Users will understand they have complete operational visibility at http://localhost:11235/monitor
with real-time updates, browser pool management, and programmatic control via REST/WebSocket APIs.

This positions Crawl4AI as an enterprise-grade self-hosting solution with DevOps-level
monitoring capabilities, not just a Docker deployment.
: Enhance proxy configuration documentation with security features, SSL analysis, and improved examples
#1551 : Fix casing and variable name consistency for LLMConfig in doc…
#1559 :Add tests for sitemap parsing and URL normalization in AsyncUr…
Add CDP endpoint verification with exponential backoff for managed browsers
#1591 enhance proxy configuration with security, SSL analysis, and rotation examples
…ync_configs.py implementation

Updated browser-crawler-config.md and parameters.md to ensure complete
accuracy with the actual BrowserConfig and CrawlerRunConfig implementations.

Changes:
- Removed non-existent parameters from documentation:
  * enable_rate_limiting, rate_limit_config (never implemented)
  * memory_threshold_percent, check_interval, max_session_permit (internal to AsyncDispatcher)
  * display_mode (doesn't exist)

- Added missing BrowserConfig parameters (14 total):
  * browser_mode, use_managed_browser, cdp_url, debugging_port, host
  * viewport, chrome_channel, channel
  * accept_downloads, downloads_path, storage_state, sleep_on_close
  * user_agent_mode, user_agent_generator_config, enable_stealth

- Added missing CrawlerRunConfig parameters (29 total):
  * chunking_strategy, keep_attrs, parser_type, scraping_strategy
  * proxy_config, proxy_rotation_strategy
  * locale, timezone_id, geolocation, fetch_ssl_certificate
  * shared_data, wait_for_timeout
  * c4a_script, max_scroll_steps
  * exclude_all_images, table_score_threshold, table_extraction
  * exclude_internal_links, score_links
  * capture_network_requests, capture_console_messages
  * method, stream, url, user_agent, user_agent_mode, user_agent_generator_config
  * deep_crawl_strategy, link_preview_config, url_matcher, match_mode, experimental

- Marked deprecated cache parameters (bypass_cache, disable_cache, no_cache_read, no_cache_write)
- Reorganized parameters into logical sections (Content Processing, Browser Location & Identity,
  Caching & Session, Page Navigation & Timing, Page Interaction, Media Handling, Link/Domain
  Handling, Debug & Logging, Connection & HTTP, Virtual Scroll, URL Matching, Advanced Features)
- Ensured all parameter descriptions match source code docstrings
- Added proper default values from __init__ signatures
Update browser and crawler run config documentation to match async_configs.py implementation
- Updated version to 0.7.7
- Added comprehensive demo and release notes
- Updated all documentation
@ntohidi ntohidi requested a review from unclecode November 14, 2025 09:33
@unclecode unclecode merged commit cb637fb into main Nov 16, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.