A Scrapy Download Handler which performs requests using Nodriver. It can be used to handle pages that require JavaScript (among other things), while adhering to the regular Scrapy workflow (i.e. without interfering with request scheduling, item processing, etc).
What makes this package different from package like Scrapy-Playwright, is the optimization to stay undetected for most anti-bot solutions. CDP communication provides even better resistance against web applicatinon firewalls (WAF’s), while performance gets a massive boost.
After the release of version 2.0,
which includes coroutine syntax support
and asyncio support, Scrapy allows
to integrate asyncio-based projects such as Nodriver.
Note: Chrome must be installed on the system.
- Python >= 3.8
- Scrapy >= 2.0 (!= 2.4.0)
scrapy-nodriver is available on PyPI and can be installed with pip:
pip install scrapy-nodriver
nodriver is defined as a dependency so it gets installed automatically,
Replace the default http and/or https Download Handlers through
DOWNLOAD_HANDLERS:
# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_nodriver.handler.ScrapyNodriverDownloadHandler",
"https": "scrapy_nodriver.handler.ScrapyNodriverDownloadHandler",
}Note that the ScrapyNodriverDownloadHandler class inherits from the default
http/https handler. Unless explicitly marked (see Basic usage),
requests will be processed by the regular Scrapy download handler.
Install the asyncio-based Twisted reactor:
# settings.py
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"This is the default in new projects since Scrapy 2.7.
Set the nodriver Request.meta
key to download a request using Nodriver:
import scrapy
class AwesomeSpider(scrapy.Spider):
name = "awesome"
def start_requests(self):
# GET request
yield scrapy.Request("https://httpbin.org/get", meta={"nodriver": True})
def parse(self, response, **kwargs):
# 'response' contains the page as seen by the browser
return {"url": response.url}Type Optional[int], defaults to the value of Scrapy's CONCURRENT_REQUESTS setting
Maximum amount of allowed concurrent Nodriver pages.
NODRIVER_MAX_CONCURRENT_PAGES = 8Type Optional[List], default None
Block resources on the page.
NODRIVER_BLOCKED_URLS = [
"*/*.jpg",
"*/*.png",
"*/*.gif",
"*/*.webp",
"*/*.svg",
"*/*.ico"
]Type Optional[bool], default True
NODRIVER_HEADLESS = TrueSupported Request.meta keys
Type bool, default False
If set to a value that evaluates to True the request will be processed by Nodriver.
return scrapy.Request("https://example.org", meta={"nodriver": True})Type bool, default False
If True, the [Nodriver page]
that was used to download the request will be available in the callback at
response.meta['nodriver_page']. If False (or unset) the page will be
closed immediately after processing the request.
Important!
This meta key is entirely optional, it's NOT necessary for the page to load or for any
asynchronous operation to be performed (specifically, it's NOT necessary for PageMethod
objects to be applied). Use it only if you need access to the Page object in the callback
that handles the response.
For more information and important notes see Receiving Page objects in callbacks.
return scrapy.Request(
url="https://example.org",
meta={"nodriver": True, "nodriver_include_page": True},
)Type Iterable[PageMethod], default ()
An iterable of scrapy_nodriver.page.PageMethod
objects to indicate actions to be performed on the page before returning the
final response. See Executing actions on pages.
Type Optional[nodriver.Tab], default None
A Nodriver page to be used to
download the request. If unspecified, a new page is created for each request.
This key could be used in conjunction with nodriver_include_page to make a chain of
requests using the same page. For instance:
from nodriver import Tab
def start_requests(self):
yield scrapy.Request(
url="https://httpbin.org/get",
meta={"nodriver": True, "nodriver_include_page": True},
)
def parse(self, response, **kwargs):
page: Tab = response.meta["nodriver_page"]
yield scrapy.Request(
url="https://httpbin.org/headers",
callback=self.parse_headers,
meta={"nodriver": True, "nodriver_page": page},
)from nodriver import Tab
import scrapy
class AwesomeSpiderWithPage(scrapy.Spider):
name = "page_spider"
def start_requests(self):
yield scrapy.Request(
url="https://example.org",
callback=self.parse_first,
meta={"nodriver": True, "nodriver_include_page": True},
errback=self.errback_close_page,
)
def parse_first(self, response):
page: Tab = response.meta["nodriver_page"]
return scrapy.Request(
url="https://example.com",
callback=self.parse_second,
meta={"nodriver": True, "nodriver_include_page": True, "nodriver_page": page},
errback=self.errback_close_page,
)
async def parse_second(self, response):
page: Tab = response.meta["nodriver_page"]
title = await page.title() # "Example Domain"
await page.close()
return {"title": title}
async def errback_close_page(self, failure):
page: Tab = failure.request.meta["nodriver_page"]
await page.close()Notes:
- When passing
nodriver_include_page=True, make sure pages are always closed when they are no longer used. It's recommended to set a Request errback to make sure pages are closed even if a request fails (ifnodriver_include_page=Falsepages are automatically closed upon encountering an exception). This is important, as open pages count towards the limit set byNODRIVER_MAX_CONCURRENT_PAGESand crawls could freeze if the limit is reached and pages remain open indefinitely. - Defining callbacks as
async defis only necessary if you need toawaitthings, it's NOT necessary if you just need to pass over the Page object from one callback to another (see the example above). - Any network operations resulting from awaiting a coroutine on a Page object
(
get, etc) will be executed directly by Nodriver, bypassing the Scrapy request workflow (Scheduler, Middlewares, etc).
A sorted iterable (e.g. list, tuple) of PageMethod objects
could be passed in the nodriver_page_methods
Request.meta
key to request methods to be invoked on the Page object before returning the final
Response to the callback.
This is useful when you need to perform certain actions on a page (like scrolling down or clicking links) and you want to handle only the final result in your callback.
Represents a method to be called (and awaited if necessary) on a
nodriver.Tab object (e.g. "select", "save_screenshot", "evaluate", etc).
method is the name of the method, *args and **kwargs
are passed when calling such method. The return value
will be stored in the PageMethod.result attribute.
For instance:
def start_requests(self):
yield Request(
url="https://example.org",
meta={
"nodriver": True,
"nodriver_page_methods": [
PageMethod("save_screenshot", filename="example.jpeg", full_page=True),
],
},
)
def parse(self, response, **kwargs):
screenshot = response.meta["nodriver_page_methods"][0]
# screenshot.result contains the image file pathproduces the same effect as:
def start_requests(self):
yield Request(
url="https://example.org",
meta={"nodriver": True, "nodriver_include_page": True},
)
async def parse(self, response, **kwargs):
page = response.meta["nodriver_page"]
filepath = await page.save_screenshot(filename="example.jpeg", full_page=True)
await page.close()Refer to the upstream docs for the Tab class
to see available methods.
Scroll down on an infinite scroll page, take a screenshot of the full page
class ScrollSpider(scrapy.Spider):
name = "scroll"
def start_requests(self):
yield scrapy.Request(
url="http://quotes.toscrape.com/scroll",
meta=dict(
nodriver=True,
nodriver_include_page=True,
nodriver_page_methods=[
PageMethod("wait_for", "div.quote"),
PageMethod("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
PageMethod("wait_for", "div.quote:nth-child(11)"), # 10 per page
],
),
)
async def parse(self, response, **kwargs):
page = response.meta["nodriver_page"]
await page.save_screenshot(filename="quotes.jpeg", full_page=True)
await page.close()
return {"quote_count": len(response.css("div.quote"))} # quotes from several pagesSpecifying a proxy via the proxy Request meta key is not supported.
Before opening an issue please make sure the unexpected behavior can only be observed by using this package and not with standalone Nodriver. To do this, translate your spider code to a reasonably close Nodriver script: if the issue also occurs this way, you should instead report it upstream. For instance:
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
def start_requests(self):
yield scrapy.Request(
url="https://example.org",
meta=dict(
nodriver=True,
nodriver_page_methods=[
PageMethod("save_screenshot", filename="example.jpeg", full_page=True),
],
),
)translates roughly to:
import asyncio
import nodriver as uc
async def main():
browser = await uc.start()
page = await browser.get("https://example.org")
await page.save_screenshot(filename="example.jpeg", full_page=True)
await page.close()
if __name__ == '__main__':
uc.loop().run_until_complete(main())