Release v0.7.7 - The Self-Hosting & Monitoring Update #1613

ntohidi · 2025-11-14T09:33:22Z

Summary

This PR merges the v0.7.7 release into the main branch, introducing a complete self-hosting platform with enterprise-grade real-time monitoring. This release transforms Crawl4AI Docker from a simple containerized crawler into a production-ready platform with full operational transparency and control.

🚀 What's New

Major Feature: Real-time Monitoring & Self-Hosting Platform

Docker deployment now includes:

📊 Interactive Monitoring Dashboard (/dashboard)
- Live system metrics (CPU, memory, network, uptime)
- Real-time request tracking (active & completed)
- Browser pool visibility (permanent/hot/cold tiers)
- Janitor cleanup event logs
- Error monitoring with full context
🔌 Comprehensive Monitor API
- Complete REST API for programmatic access
- Endpoints: /monitor/health, /monitor/requests, /monitor/browsers, /monitor/endpoints/stats, /monitor/timeline,
  /monitor/logs/*
- Control actions: cleanup, kill browser, restart browser, reset stats
⚡ WebSocket Streaming
- Real-time updates every 2 seconds
- Build custom dashboards with live data
- Connection: ws://localhost:11235/monitor/ws
🔥 Smart Browser Pool (3-tier architecture)
- Permanent Browser: Always-on for default configs
- Hot Pool: Frequently-used configs (promoted after 3+ uses)
- Cold Pool: On-demand browsers for variant configs
- Automatic promotion and cleanup
- 10x memory efficiency improvement
🧹 Janitor System
- Automatic resource management
- Event logging for cleanup activities
- Configurable idle timeouts
📈 Production-Ready
- Prometheus integration examples
- 6 critical metrics for operational excellence
- Alerting patterns and thresholds
- Log aggregation support

🐛 Critical Bug Fixes

Performance & Stability

This commit resolves issue #1055 where LLM extraction was blocking async #1590: Fixed async LLM extraction blocking issue ([Bug]: Concurrent processing using arun_many not really concurrent using LLM strategy #1055) - now supports true parallel processing
Add CDP endpoint verification with exponential backoff for managed browsers #1528: Fixed CDP endpoint verification with exponential backoff ([Bug]: Identity based browsing fails #1445)
Fix: run_urls() returns None, crashing arun_many() #1530: Fixed arun_many to always return a list, even on exception

Configuration & Features

Update browser and crawler run config documentation to match async_configs.py implementation #1609: Updated browser and crawler config documentation to match implementation
Fix/dfs deep crawling #1607: Enhanced DFS deep crawl strategy with seen URL tracking
#1559 :Add tests for sitemap parsing and URL normalization in AsyncUr… #1598: Fixed sitemap parsing and URL normalization in AsyncUrlSeeder ([Bug]: URL Seeding using Sitemap doesn't work in some cases #1559)
feat(ManagedBrowser): add viewport size configuration for browser launch #1495: Fixed viewport configuration in managed browsers ([Bug]: viewport_ can't work when use_persistent_context = True #1490)
Fix remove_overlay_elements functionality by calling injected JS function. #1529: Fixed remove_overlay_elements functionality ([Bug]: remove_overlay_elements not working (function never called) #1396)

Docker & Infrastructure

fix(docker): Remove environment variable overrides in docker-compose.yml #1537: Fixed LLM API key handling for multi-provider support
Standardized Docker port to 11235 across all configs
Improved error handling with comprehensive status codes
Fixed fit_html serialization in /crawl and /crawl/stream endpoints

Security

Updated pyOpenSSL from >=24.3.0 to >=25.3.0 (security vulnerability fix)
Added verification tests for security updates

Documentation

[Bug]: Outdated Doc For LLM-Strategy #1551: Fixed LLMConfig casing and variable name consistency
Fixed webhook serialization for Pydantic HttpUrl

📝 Documentation Updates

New Documentation

✅ docs/blog/release-v0.7.7.md - Comprehensive release notes (~600 lines)
✅ docs/releases_review/demo_v0.7.7.py - Complete demo showcasing all monitoring features
✅ docs/md_v2/blog/index.md - Updated blog index with v0.7.7 as latest

Updated Documentation

✅ README.md - Updated version references, Docker commands, and release highlights
✅ deploy/docker/README.md - Updated all version references from 0.7.6 to 0.7.7
✅ Dockerfile - Updated C4AI_VER argument to 0.7.7
✅ Documentation links updated from "Docker Deployment" to "Self-Hosting Guide"

🔄 Breaking Changes

None! This release is fully backward compatible.

All existing Docker configurations continue to work
No API changes to existing endpoints
Monitoring is an additive functionality
No migration required

📦 Files Changed

New Files

docs/blog/release-v0.7.7.md - Release notes
docs/releases_review/demo_v0.7.7.py - Demo script

Modified Files

README.md - Version updates and feature highlights
Dockerfile - Version bump to 0.7.7
deploy/docker/README.md - Docker version references
docs/md_v2/blog/index.md - Blog index with latest release

Documentation Structure

Docker Deployment → Self-Hosting (rebranding for better positioning)
Added monitoring dashboard URL references throughout
Updated all Docker commands to use 0.7.7

Signed-off-by: Emmanuel Ferdman <[email protected]>

Implements comprehensive hooks functionality allowing users to provide custom Python functions as strings that execute at specific points in the crawling pipeline. Key Features: - Support for all 8 crawl4ai hook points: • on_browser_created: Initialize browser settings • on_page_context_created: Configure page context • before_goto: Pre-navigation setup • after_goto: Post-navigation processing • on_user_agent_updated: User agent modification handling • on_execution_started: Crawl execution initialization • before_retrieve_html: Pre-extraction processing • before_return_html: Final HTML processing Implementation Details: - Created UserHookManager for validation, compilation, and safe execution - Added IsolatedHookWrapper for error isolation and timeout protection - AST-based validation ensures code structure correctness - Sandboxed execution with restricted builtins for security - Configurable timeout (1-120 seconds) prevents infinite loops - Comprehensive error handling ensures hooks don't crash main process - Execution tracking with detailed statistics and logging API Changes: - Added HookConfig schema with code and timeout fields - Extended CrawlRequest with optional hooks parameter - Added /hooks/info endpoint for hook discovery - Updated /crawl and /crawl/stream endpoints to support hooks Safety Features: - Malformed hooks return clear validation errors - Hook errors are isolated and reported without stopping crawl - Execution statistics track success/failure/timeout rates - All hook results are JSON-serializable Testing: - Comprehensive test suite covering all 8 hooks - Error handling and timeout scenarios validated - Authentication, performance, and content extraction examples - 100% success rate in production testing Documentation: - Added extensive hooks section to docker-deployment.md - Security warnings about user-provided code risks - Real-world examples using httpbin.org, GitHub, BBC - Best practices and troubleshooting guide ref #1377

…ncation. ref #1253 Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.

- Wrap all AsyncUrlSeeder usage with async context managers - Update URL seeding adventure example to use "sitemap+cc" source, focus on course posts, and add stream=True parameter to fix runtime error

…ble #1310

Update URL seeding examples to use proper async context managers

Fix examples in README.md

fix(docker-api): migrate to modern datetime library API

…develop

Previously, the system incorrectly used OPENAI_API_KEY for all LLM providers due to a hardcoded api_key_env fallback in config.yml. This caused authentication errors when using non-OpenAI providers like Gemini. Changes: - Remove api_key_env from config.yml to let litellm handle provider-specific env vars - Simplify get_llm_api_key() to return None, allowing litellm to auto-detect keys - Update validate_llm_provider() to trust litellm's built-in key detection - Update documentation to reflect the new automatic key handling The fix leverages litellm's existing capability to automatically find the correct environment variable for each provider (OPENAI_API_KEY, GEMINI_API_TOKEN, etc.) without manual configuration. ref #1291

fix(docker): Fix LLM API key handling for multi-provider support

…develop

…ted examples (#1330) - Replace BaseStrategy with CrawlStrategy in custom strategy examples (DomainSpecificStrategy, HybridStrategy) - Remove “Custom Link Scoring” and “Caching Strategy” sections no longer aligned with current library - Revise memory pruning example to use adaptive.get_relevant_content and index-based retention of top 500 docs - Correct Quickstart note: default cache mode is CacheMode.BYPASS; instruct enabling with CacheMode.ENABLED

…eserve '+' signs. ref #1332

This commit adds a complete, web scraping API example that demonstrates how to get structured data from any website and use it like an API using the crawl4ai library with a minimalist frontend interface. Core Functionality - AI-powered web scraping with plain English queries - Dual scraping approaches: Schema-based (faster) and LLM-based (flexible) - Intelligent schema caching for improved performance - Custom LLM model support with API key management - Automatic duplicate request prevention Modern Frontend Interface - Minimalist black-and-white design inspired by modern web apps - Responsive layout with smooth animations and transitions - Three main pages: Scrape Data, Models Management, API Request History - Real-time results display with JSON formatting - Copy-to-clipboard functionality for extracted data - Toast notifications for user feedback - Auto-scroll to results when scraping starts Model Management System - Web-based model configuration interface - Support for any LLM provider (OpenAI, Gemini, Anthropic, etc.) - Simplified configuration requiring only provider and API token - Add, list, and delete model configurations - Secure storage of API keys in local JSON files API Request History - Automatic saving of all API requests and responses - Display of request history with URL, query, and cURL commands - Duplicate prevention (same URL + query combinations) - Request deletion functionality - Clean, simplified display focusing on essential information Technical Implementation Backend (FastAPI) - RESTful API with comprehensive endpoints - Pydantic models for request/response validation - Async web scraping with crawl4ai library - Error handling with detailed error messages - File-based storage for models and request history Frontend (Vanilla JS/CSS/HTML) - No framework dependencies - pure HTML, CSS, JavaScript - Modern CSS Grid and Flexbox layouts - Custom dropdown styling with SVG arrows - Responsive design for mobile and desktop - Smooth scrolling and animations Core Library Integration - WebScraperAgent class for orchestration - ModelConfig class for LLM configuration management - Schema generation and caching system - LLM extraction strategy support - Browser configuration with headless mode

Fixes bug reported in issue #1405 [Bug]: Excluded selector (excluded_selector) doesn't work This commit reintroduces the cssselect library which was removed by PR (#1368) and merged via (437395e). Integration tested against 0.7.4 Docker container. Reintroducing cssselector package eliminated errors seen in logs and excluded_selector functionality was restored. Refs: #1405

… deep crawl strategy (ref #1419) - Fix URLPatternFilter serialization by preventing private __slots__ from being serialized as constructor params - Add public attributes to URLPatternFilter to store original constructor parameters for proper serialization - Handle property descriptors in CrawlResult.model_dump() to prevent JSON serialization errors - Ensure filter chains work correctly with Docker client and REST API The issue occurred because: 1. Private implementation details (_simple_suffixes, etc.) were being serialized and passed as constructor arguments during deserialization 2. Property descriptors were being included in the serialized output, causing "Object of type property is not JSON serializable" errors Changes: - async_configs.py: Comment out __slots__ serialization logic (lines 100-109) - filters.py: Add patterns, use_glob, reverse to URLPatternFilter __slots__ and store as public attributes - models.py: Convert property descriptors to strings in model_dump() instead of including them directly

…s. ref #1437

…ration. ref #1035 Implement hierarchical configuration for LLM parameters with support for: - Temperature control (0.0-2.0) to adjust response creativity - Custom base_url for proxy servers and alternative endpoints - 4-tier priority: request params > provider env > global env > defaults Add helper functions in utils.py, update API schemas and handlers, support environment variables (LLM_TEMPERATURE, OPENAI_TEMPERATURE, etc.), and provide comprehensive documentation with examples.

feat(docker): Add temperature and base_url parameters for LLM configuration

…ptive-strategies-docs Update Quickstart and Adaptive Strategies documentation

- Return comprehensive error messages along with status codes for api internal errors. - Fix fit_html property serialization issue in both /crawl and /crawl/stream endpoints - Add sanitization to ensure fit_html is always JSON-serializable (string or None) - Add comprehensive error handling test suite.

fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy

…and enhance proxy string parsing - Updated ProxyConfig.from_string to support multiple proxy formats, including URLs with credentials. - Deprecated the 'proxy' parameter in BrowserConfig, replacing it with 'proxy_config' for better flexibility. - Added warnings for deprecated usage and clarified behavior when both parameters are provided. - Updated documentation and tests to reflect changes in proxy configuration handling.

…ate .gitignore to include test_scripts directory.

…ring crawling. Ref #1410 Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.

…1410

fix(docker): Remove environment variable overrides in docker-compose.yml

Fix remove_overlay_elements functionality by calling injected JS function.

execution, causing URLs to be processed sequentially instead of in parallel. Changes: - Added aperform_completion_with_backoff() using litellm.acompletion for async LLM calls - Implemented arun() method in ExtractionStrategy base class with thread pool fallback - Created async arun() and aextract() methods in LLMExtractionStrategy using asyncio.gather - Updated AsyncWebCrawler.arun() to detect and use arun() when available - Added comprehensive test suite to verify parallel execution Impact: - LLM extraction now runs truly in parallel across multiple URLs - Significant performance improvement for multi-URL crawls with LLM strategies - Backward compatible - existing extraction strategies continue to work - No breaking changes to public API Technical details: - Uses litellm.acompletion for non-blocking LLM calls - Leverages asyncio.gather for concurrent chunk processing - Maintains backward compatibility via asyncio.to_thread fallback - Works seamlessly with MemoryAdaptiveDispatcher and other dispatchers

…Many This commit resolves issue #1055 where LLM extraction was blocking async

feat(ManagedBrowser): add viewport size configuration for browser launch

…ve monitoring documentation Major documentation restructuring to emphasize self-hosting capabilities and fully document the real-time monitoring system. Changes: - Renamed docker-deployment.md → self-hosting.md to better reflect the value proposition - Updated mkdocs.yml navigation to "Self-Hosting Guide" - Completely rewrote introduction emphasizing self-hosting benefits: * Data privacy and ownership * Cost control and transparency * Performance and security advantages * Full customization capabilities - Expanded "Metrics & Monitoring" → "Real-time Monitoring & Operations" with: * Monitoring Dashboard section documenting the /monitor UI * Complete feature breakdown (system health, requests, browsers, janitor, errors) * Monitor API Endpoints with all REST endpoints and examples * WebSocket Streaming integration guide with Python examples * Control Actions for manual browser management * Production Integration patterns (Prometheus, custom dashboards, alerting) * Key production metrics to track - Enhanced summary section: * What users learned checklist * Why self-hosting matters * Clear next steps * Key resources with monitoring dashboard URL The monitoring dashboard built 2-3 weeks ago is now fully documented and discoverable. Users will understand they have complete operational visibility at http://localhost:11235/monitor with real-time updates, browser pool management, and programmatic control via REST/WebSocket APIs. This positions Crawl4AI as an enterprise-grade self-hosting solution with DevOps-level monitoring capabilities, not just a Docker deployment.

: Enhance proxy configuration documentation with security features, SSL analysis, and improved examples

…lSeeder

…umentation

…ategy with seen URL tracking

#1551 : Fix casing and variable name consistency for LLMConfig in doc…

#1559 :Add tests for sitemap parsing and URL normalization in AsyncUr…

Add CDP endpoint verification with exponential backoff for managed browsers

feat: Add Nstproxy Proxies

#1591 enhance proxy configuration with security, SSL analysis, and rotation examples

Fix/dfs deep crawling

…ync_configs.py implementation Updated browser-crawler-config.md and parameters.md to ensure complete accuracy with the actual BrowserConfig and CrawlerRunConfig implementations. Changes: - Removed non-existent parameters from documentation: * enable_rate_limiting, rate_limit_config (never implemented) * memory_threshold_percent, check_interval, max_session_permit (internal to AsyncDispatcher) * display_mode (doesn't exist) - Added missing BrowserConfig parameters (14 total): * browser_mode, use_managed_browser, cdp_url, debugging_port, host * viewport, chrome_channel, channel * accept_downloads, downloads_path, storage_state, sleep_on_close * user_agent_mode, user_agent_generator_config, enable_stealth - Added missing CrawlerRunConfig parameters (29 total): * chunking_strategy, keep_attrs, parser_type, scraping_strategy * proxy_config, proxy_rotation_strategy * locale, timezone_id, geolocation, fetch_ssl_certificate * shared_data, wait_for_timeout * c4a_script, max_scroll_steps * exclude_all_images, table_score_threshold, table_extraction * exclude_internal_links, score_links * capture_network_requests, capture_console_messages * method, stream, url, user_agent, user_agent_mode, user_agent_generator_config * deep_crawl_strategy, link_preview_config, url_matcher, match_mode, experimental - Marked deprecated cache parameters (bypass_cache, disable_cache, no_cache_read, no_cache_write) - Reorganized parameters into logical sections (Content Processing, Browser Location & Identity, Caching & Session, Page Navigation & Timing, Page Interaction, Media Handling, Link/Domain Handling, Debug & Logging, Connection & HTTP, Virtual Scroll, URL Matching, Advanced Features) - Ensured all parameter descriptions match source code docstrings - Added proper default values from __init__ signatures

Update browser and crawler run config documentation to match async_configs.py implementation

- Updated version to 0.7.7 - Added comprehensive demo and release notes - Updated all documentation

emmanuel-ferdman and others added 30 commits May 13, 2025 00:04

fix(docker-api): migrate to modern datetime library API

1e1c887

Signed-off-by: Emmanuel Ferdman <[email protected]>

Merge branch 'main' into main

8e3c411

Fix examples in README.md

7a8190e

fix(deep-crawl): BestFirst priority inversion; remove pre-scoring tru…

88a9fbb

…ncation. ref #1253 Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.

docs: Update URL seeding examples to use proper async context managers

ecbe5ff

- Wrap all AsyncUrlSeeder usage with async context managers - Update URL seeding adventure example to use "sitemap+cc" source, focus on course posts, and add stream=True parameter to fix runtime error

fix(crawler): Removed the incorrect reference in browser_config varia…

f4a4328

…ble #1310

Merge pull request #1398 from unclecode/fix/update-url-seeding-docs

dad7c51

Update URL seeding examples to use proper async context managers

docs: update Docker instructions to use the latest release tag

9447054

Merge pull request #1369 from NezarAli/main

f4206d6

Fix examples in README.md

Merge pull request #1104 from emmanuel-ferdman/main

ef174a4

fix(docker-api): migrate to modern datetime library API

Merge branch 'develop' of https://github.com/unclecode/crawl4ai into …

69961cf

…develop

Merge pull request #1422 from unclecode/fix/docker-llmEnvFile

8bb0e68

fix(docker): Fix LLM API key handling for multi-provider support

Merge branch 'develop' of https://github.com/unclecode/crawl4ai into …

90af453

…develop

fix(utils): Improve URL normalization by avoiding quote/unquote to pr…

40ab287

…eserve '+' signs. ref #1332

fix(logger): ensure logger is a Logger instance in crawling strategie…

38f3ea4

…s. ref #1437

Merge pull request #1440 from unclecode/feature/docker-llm-parameters

4fe2d01

feat(docker): Add temperature and base_url parameters for LLM configuration

Merge pull request #1426 from unclecode/fix/update-quickstart-and-ada…

cce3390

…ptive-strategies-docs Update Quickstart and Adaptive Strategies documentation

Merge pull request #1436 from unclecode/fix/docker-filter

4e1c4bd

fix(docker): resolve filter serialization and JSON encoding errors in deep crawl strategy

Remove deprecated test for 'proxy' parameter in BrowserConfig and upd…

4ed33fc

…ate .gitignore to include test_scripts directory.

feat: add preserve_https_for_internal_links flag to maintain HTTPS du…

f566c5a

…ring crawling. Ref #1410 Added a new `preserve_https_for_internal_links` configuration flag that preserves the original HTTPS scheme for same-domain links even when the server redirects to HTTP.

feat: update documentation for preserve_https_for_internal_links. ref #…

bdacf61

…1410

ntohidi and others added 28 commits November 6, 2025 00:07

Merge pull request #1537 from unclecode/fix/docker-compose-llm-env

854694e

fix(docker): Remove environment variable overrides in docker-compose.yml

Merge pull request #1529 from unclecode/fix/remove_overlay_elements

2c91815

Fix remove_overlay_elements functionality by calling injected JS function.

Merge pull request #1590 from unclecode/fix/async-llm-extraction-arun…

66175e1

…Many This commit resolves issue #1055 where LLM extraction was blocking async

Merge pull request #1495 from unclecode/fix/viewport_in_managed_browser

d56b0eb

feat(ManagedBrowser): add viewport size configuration for browser launch

Update gitignore

81b5312

#1591

263ac89

: Enhance proxy configuration documentation with security features, SSL analysis, and improved examples

#1559 :Add tests for sitemap parsing and URL normalization in AsyncUr…

80745bc

…lSeeder

#1551 : Fix casing and variable name consistency for LLMConfig in doc…

2e8f8c9

…umentation

feat: Add Nstproxy Proxies

8045216

#1510 : Add DFS deep crawler demonstration script and enhance DFS str…

1bd3de6

…ategy with seen URL tracking

Merge pull request #1599 from unclecode/docs-llm-strategies-update

124ac58

#1551 : Fix casing and variable name consistency for LLMConfig in doc…

Merge pull request #1598 from unclecode/fix/sitemap_seeder

be00fc3

#1559 :Add tests for sitemap parsing and URL normalization in AsyncUr…

Merge pull request #1528 from unclecode/fix/managed-browser-cdp-timing

b207ae2

Add CDP endpoint verification with exponential backoff for managed browsers

Merge pull request #1605 from Nstproxy/feat/nstproxy

cdcb883

feat: Add Nstproxy Proxies

Merge branch 'fix/docker' into develop

89cc29f

Refactor proxy configuration documentation for clarity and consistency

fe353c4

Merge pull request #1596 from unclecode/docs-proxy-security

8116b15

#1591 enhance proxy configuration with security, SSL analysis, and rotation examples

Update proxy-security documentation

d0fb535

Rename folder name for NSTProxy integration examples for crawl4ai

998c809

Enhance DFSDeepCrawlStrategy documentation for clarity and detail

ceade85

Merge pull request #1607 from unclecode/fix/dfs_deep_crawling

466be69

Fix/dfs deep crawling

Bump version to 0.7.7 for stable release

b585795

Merge pull request #1609 from unclecode/fix/update-config-documentation

f3146de

Update browser and crawler run config documentation to match async_configs.py implementation

Merge branch 'develop' into release/v0.7.7

2c973b1

Release v0.7.7

6244f56

- Updated version to 0.7.7 - Added comprehensive demo and release notes - Updated all documentation

ntohidi requested a review from unclecode November 14, 2025 09:33

unclecode merged commit cb637fb into main Nov 16, 2025
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Release v0.7.7 - The Self-Hosting & Monitoring Update #1613

Release v0.7.7 - The Self-Hosting & Monitoring Update #1613

Uh oh!

ntohidi commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Uh oh!

Release v0.7.7 - The Self-Hosting & Monitoring Update #1613

Release v0.7.7 - The Self-Hosting & Monitoring Update #1613

Uh oh!

Conversation

ntohidi commented Nov 14, 2025

Summary

🚀 What's New

📝 Documentation Updates

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants