Skip to content

Conversation

@Ahmed-Tawfik94
Copy link
Collaborator

#1563 Fix memory leaks and race conditions in CDP managed browser crawling

Fix memory leaks and race conditions when using arun_many() with managed CDP browsers. Each crawl now gets proper page isolation with automatic cleanup while maintaining shared browser context.

Key fixes:

  • Close non-session pages after crawling to prevent tab accumulation
  • Add thread-safe page creation with locks to avoid concurrent access
  • Improve page lifecycle management for managed vs non-managed browsers
  • Keep session pages alive for authentication persistence
  • Prevent TOCTOU (time-of-check-time-of-use) race conditions

This ensures stable parallel crawling without memory growth or browser instability.

Summary

Fixes #1563

This PR resolves critical memory leaks and race conditions that occurred when using arun_many() with managed CDP browsers. The main issues were:

  1. Memory Leaks: Pages (tabs) were not being closed after crawling, causing unlimited tab accumulation in the managed browser
  2. Race Conditions: Concurrent page creation led to TOCTOU (time-of-check-time-of-use) issues where multiple coroutines could attempt to create pages simultaneously
  3. Page Lifecycle Confusion: The code didn't properly distinguish between managed browser pages (should stay open) and non-managed pages (should be closed)

The fix ensures that:

  • Each crawl operation in arun_many() gets its own isolated page/tab
  • Pages are properly cleaned up after crawling (except session pages)
  • Page creation is thread-safe with proper locking mechanisms
  • Managed browsers maintain stable performance during parallel crawling operations

List of files changed and why

  1. crawl4ai/async_crawler_strategy.py - Updated page cleanup logic to properly close pages after crawling when using non-managed browsers, while preserving session pages for authentication persistence

  2. crawl4ai/browser_manager.py - Added thread-safe page creation with locks to prevent race conditions, and improved page lifecycle management to distinguish between managed and non-managed browser contexts

  3. docs/md_v2/advanced/cdp-browser-crawling.md - Added comprehensive documentation for CDP browser crawling, including setup instructions, usage examples, and best practices for managed browser workflows

  4. tests/test_arun_many_cdp.py - Created new test suite with both parallel and sequential test cases to verify proper page isolation and cleanup in arun_many() operations with managed CDP browsers

How Has This Been Tested?

The changes have been tested with:

  1. Unit Tests: Created tests/test_arun_many_cdp.py with two test scenarios:

    • test_arun_many_with_cdp(): Tests parallel crawling of 3 URLs to verify proper page isolation
    • test_arun_many_with_cdp_sequential(): Tests sequential crawling to isolate potential issues
  2. Manual Testing:

    • Verified with a managed CDP browser running on localhost:9222
    • Monitored browser tab count during arun_many() operations to confirm tabs are created and cleaned up properly
    • Tested with multiple concurrent crawl operations to verify thread safety
    • Confirmed no memory growth during repeated crawling operations
  3. Test Requirements: Tests require a running CDP browser instance (can be started with crwl cdp -d 9222)

All tests pass successfully, confirming that memory leaks and race conditions are resolved.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

… crawling

Fix memory leaks and race conditions when using arun_many() with managed CDP browsers. Each crawl now gets proper page isolation with automatic cleanup while maintaining shared browser context.

Key fixes:
- Close non-session pages after crawling to prevent tab accumulation
- Add thread-safe page creation with locks to avoid concurrent access
- Improve page lifecycle management for managed vs non-managed browsers
- Keep session pages alive for authentication persistence
- Prevent TOCTOU (time-of-check-time-of-use) race conditions

This ensures stable parallel crawling without memory growth or browser instability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants