Skip to content

PlaywrightCrawler error_handler cannot access Page object #1482

@janbuchar

Description

@janbuchar

In the Javascript version, the error handler is able to access the Page object via PlaywrightCrawlingContext.page. I discovered that the Python version doesn't implement this when porting the ContextPipeline to Javascript.

Test case

async def test_error_handler_can_access_page(server_url: URL) -> None:
    crawler = PlaywrightCrawler(max_request_retries=2)

    request_handler = mock.AsyncMock(side_effect=RuntimeError('Intentional crash'))
    crawler.router.default_handler(request_handler)

    error_handler_calls: list[str | None] = []

    @crawler.error_handler
    async def error_handler(context: BasicCrawlingContext | PlaywrightCrawlingContext, _error: Exception) -> None:
        error_handler_calls.append(
            await context.page.content() if isinstance(context, PlaywrightCrawlingContext) else None
        )

    await crawler.run([str(server_url / 'hello-world')])

    assert error_handler_calls == [HELLO_WORLD, HELLO_WORLD, HELLO_WORLD]

Possible solutions

  1. Run the error handlers before the cleanup step of the context pipeline
  2. Add some "deferred cleanup" step to the context pipeline and call that after error handlers are done
    • it's unclear how this would fit in the current async generator based middleware model
    • considerable refactoring of the _run_request_handler and __run_task_function would still be necessary - error handlers are called by the latter and context pipeline is only handled in the former

Metadata

Metadata

Assignees

Labels

t-toolingIssues with this label are in the ownership of the tooling team.

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions