fix #1563 (cdp): resolve page leaks and race conditions in concurrent… #1592

Ahmed-Tawfik94 · 2025-11-07T09:27:16Z

#1563 Fix memory leaks and race conditions in CDP managed browser crawling

Fix memory leaks and race conditions when using arun_many() with managed CDP browsers. Each crawl now gets proper page isolation with automatic cleanup while maintaining shared browser context.

Key fixes:

Close non-session pages after crawling to prevent tab accumulation
Add thread-safe page creation with locks to avoid concurrent access
Improve page lifecycle management for managed vs non-managed browsers
Keep session pages alive for authentication persistence
Prevent TOCTOU (time-of-check-time-of-use) race conditions

This ensures stable parallel crawling without memory growth or browser instability.

Summary

Fixes #1563

This PR resolves critical memory leaks and race conditions that occurred when using arun_many() with managed CDP browsers. The main issues were:

Memory Leaks: Pages (tabs) were not being closed after crawling, causing unlimited tab accumulation in the managed browser
Race Conditions: Concurrent page creation led to TOCTOU (time-of-check-time-of-use) issues where multiple coroutines could attempt to create pages simultaneously
Page Lifecycle Confusion: The code didn't properly distinguish between managed browser pages (should stay open) and non-managed pages (should be closed)

The fix ensures that:

Each crawl operation in arun_many() gets its own isolated page/tab
Pages are properly cleaned up after crawling (except session pages)
Page creation is thread-safe with proper locking mechanisms
Managed browsers maintain stable performance during parallel crawling operations

List of files changed and why

crawl4ai/async_crawler_strategy.py - Updated page cleanup logic to properly close pages after crawling when using non-managed browsers, while preserving session pages for authentication persistence
crawl4ai/browser_manager.py - Added thread-safe page creation with locks to prevent race conditions, and improved page lifecycle management to distinguish between managed and non-managed browser contexts
docs/md_v2/advanced/cdp-browser-crawling.md - Added comprehensive documentation for CDP browser crawling, including setup instructions, usage examples, and best practices for managed browser workflows
tests/test_arun_many_cdp.py - Created new test suite with both parallel and sequential test cases to verify proper page isolation and cleanup in arun_many() operations with managed CDP browsers

How Has This Been Tested?

The changes have been tested with:

Unit Tests: Created tests/test_arun_many_cdp.py with two test scenarios:
- test_arun_many_with_cdp(): Tests parallel crawling of 3 URLs to verify proper page isolation
- test_arun_many_with_cdp_sequential(): Tests sequential crawling to isolate potential issues
Manual Testing:
- Verified with a managed CDP browser running on localhost:9222
- Monitored browser tab count during arun_many() operations to confirm tabs are created and cleaned up properly
- Tested with multiple concurrent crawl operations to verify thread safety
- Confirmed no memory growth during repeated crawling operations
Test Requirements: Tests require a running CDP browser instance (can be started with crwl cdp -d 9222)

All tests pass successfully, confirming that memory leaks and race conditions are resolved.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added/updated unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

… crawling Fix memory leaks and race conditions when using arun_many() with managed CDP browsers. Each crawl now gets proper page isolation with automatic cleanup while maintaining shared browser context. Key fixes: - Close non-session pages after crawling to prevent tab accumulation - Add thread-safe page creation with locks to avoid concurrent access - Improve page lifecycle management for managed vs non-managed browsers - Keep session pages alive for authentication persistence - Prevent TOCTOU (time-of-check-time-of-use) race conditions This ensures stable parallel crawling without memory growth or browser instability.

Ahmed-Tawfik94 assigned ntohidi Nov 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix #1563 (cdp): resolve page leaks and race conditions in concurrent… #1592

fix #1563 (cdp): resolve page leaks and race conditions in concurrent… #1592

Ahmed-Tawfik94 commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

fix #1563 (cdp): resolve page leaks and race conditions in concurrent… #1592

Are you sure you want to change the base?

fix #1563 (cdp): resolve page leaks and race conditions in concurrent… #1592

Conversation

Ahmed-Tawfik94 commented Nov 7, 2025

Summary

List of files changed and why

How Has This Been Tested?

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants