Skip to content
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
9a83471
create ZyteAPITextResponse and ZyteAPIResponse
BurnzZ Apr 26, 2022
8909473
update README and CHANGES with notes on new response classes
BurnzZ Apr 27, 2022
d0dc08d
set the encoding consistently to be 'utf-8'
BurnzZ Apr 28, 2022
109dbf0
improve example and docs
BurnzZ Apr 28, 2022
9695880
override replace() to prevent 'zyte_api_response' attribute from bein…
BurnzZ Apr 28, 2022
8812a05
fix mypy failures
BurnzZ Apr 28, 2022
ba64103
enforce 'utf-8' encoding on Text responses
BurnzZ Apr 28, 2022
84dac7d
update expectation for replacing zyte_api_response attribute
BurnzZ Apr 29, 2022
5b83443
update README regarding default params
BurnzZ Apr 29, 2022
fb0b412
remove 'Content-Encoding' header when returning responses
BurnzZ May 2, 2022
10a4603
remove the ZYTE_API_ENABLED setting
BurnzZ May 2, 2022
b7102fa
remove zyte_api_default_params in the spider
BurnzZ May 2, 2022
2b4a0fb
refactor TestAPI to have single producer of requests and responses
BurnzZ May 2, 2022
97ea1e4
implement ZYTE_API_DEFAULT_PARAMS in the settings
BurnzZ May 3, 2022
5dd1bec
fix failing tests
BurnzZ May 3, 2022
052d0d6
Merge pull request #14 from scrapy-plugins/fix-decompression-error
kmike May 11, 2022
48a4766
rename zyte_api_response into zyte_api
BurnzZ May 19, 2022
2455bdf
Merge pull request #13 from scrapy-plugins/default-settings
BurnzZ May 19, 2022
910085b
add tests for css/xpath selectors
BurnzZ May 25, 2022
e3214d8
enable css/xpath selectors on httpResponseBody
BurnzZ May 26, 2022
e530053
handle empty 'browserHtml' or 'httpResponseBody'
BurnzZ May 26, 2022
27c7a7d
Fix typos in docs
BurnzZ May 27, 2022
5b7cf6f
update how replace() works
BurnzZ May 27, 2022
2adc8a6
update README in line with the ZYTE_API_DEFAULT_PARAMS expectations
BurnzZ May 27, 2022
32faf3d
add test case to ensure zyte_api is intact when replacing other attribs
BurnzZ May 27, 2022
cec0677
make process_response() private
BurnzZ May 27, 2022
e0865e7
update tests to ensure other response attribs are not updated on .rep…
BurnzZ May 27, 2022
34a427f
raise an error if zyte_api is passed to .replace()
BurnzZ May 27, 2022
37a4cc7
rename '.zyte_api' attribute as '.raw_api_response'
BurnzZ May 27, 2022
f5a9bb0
refactor to accept 'True' and '{}' to trigger Zyte API Requests
BurnzZ May 30, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
Changes
=======

TBD
---

* Introduce ``ZyteAPIResponse`` and ``ZyteAPITextResponse`` which are subclasses
of ``scrapy.http.Response`` and ``scrapy.http.TextResponse`` respectively.
These new response classes hold the raw Zyte API response in its
``zyte_api`` attribute.

0.1.0 (2022-02-03)
------------------

Expand Down
96 changes: 67 additions & 29 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ Installation

This package requires Python 3.7+.

How to configure
----------------
Configuration
-------------

Replace the default ``http`` and ``https`` in Scrapy's
`DOWNLOAD_HANDLERS <https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-DOWNLOAD_HANDLERS>`_
Expand All @@ -46,7 +46,7 @@ Lastly, make sure to `install the asyncio-based Twisted reactor
<https://docs.scrapy.org/en/latest/topics/asyncio.html#installing-the-asyncio-reactor)>`_
in the ``settings.py`` file as well:

Here's example of the things needed inside a Scrapy project's ``settings.py`` file:
Here's an example of the things needed inside a Scrapy project's ``settings.py`` file:

.. code-block:: python

Expand All @@ -60,37 +60,75 @@ Here's example of the things needed inside a Scrapy project's ``settings.py`` fi

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

How to use
----------
Usage
-----

Set the ``zyte_api`` `Request.meta
<https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_
key to download a request using Zyte API. Full list of parameters is provided in the
`Zyte API Specification <https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_.
To enable every request to be sent through Zyte API, you can set the following
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the setting below makes every request to be sent through Zyte API. If I'm reading the code correctly, it first checks if zyte_api meta key is present and the value is true-ish, and uses ZYTE_API_DEFAULT_PARAMS only in this case. So, parameters are only applied for requests which are already marked explicitly as Zyte API requests.

That's actually the behavior I'm fine with :) It looks useful e.g. to set default geolocation for Zyte API requests made from a spider. On the other hand, making every request going through Zyte API by using this option looks quite problematic; it'd require more thought. For example,

  • would this example mean that robots.txt files will be downloaded through Zyte API with browserHtml: True?
  • would it mean that e.g. a POST request won't work properly, while looking like a regular request in the spider code, and without raising any kind of warning?

It seems some feature to "enable Zyte API transparently" makes sense after we put a bit more effort into "transparent" integration of scrapy Requests with Zyte API direct downloader, it doesn't look as straightforward as setting default parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the setting below makes every request to be sent through Zyte API. If I'm reading the code correctly, it first checks if zyte_api meta key is present and the value is true-ish, and uses ZYTE_API_DEFAULT_PARAMS only in this case. So, parameters are only applied for requests which are already marked explicitly as Zyte API requests.

That's right. #13 attempted to not change the behavior fully. It only addressed to prevent the user from repeating the same Zyte API parameters in different parts of the code. In any case, hopefully it should serve as a base point to enable all requests go through Zyte API. :)

would this example mean that robots.txt files will be downloaded through Zyte API with browserHtml: True?

That's a good point. This also extends to other things like downloading an image, file, etc.

I think one option is to prevent browserHtml from being set in the ZYTE_API_DEFAULT_PARAMS. The same is true for javascript, product, articles, etc. I think one way to think of it is having ZYTE_API_DEFAULT_PARAMS support only a handful of parameters, like geolocation or httpResponseBody (though it needs to toggled off when browserHtml is turned on), etc.

This could be a bit tricky since Zyte API offers a lot of features that doesn't only cater to being a simply proxy. I think for users to have a seamless experience when using Zyte API, we should have an interface that could cover general Scrapy use. This could mean only downloading the httpResponseBody and perhaps httpResponseHeaders by default which is available to all scrapy.Request unless overridden by the zyte_api param in meta.

We can open up another PR to explore the different options here.

would it mean that e.g. a POST request won't work properly, while looking like a regular request in the spider code, and without raising any kind of warning?

I think this could be alleviated somehow by the proposed interface in #20.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @kmike’s main point was that the documentation needs an update, as it is currently quite misleading. The other points are about issues if we decided to make the implementation match the current documentation, which I don’t think we want, at least not at this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for clarifying! What do you think about this doc update? 2adc8a6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation change looks great.

I do wonder if we should cause zyte_api: {} to enable Zyte Data API with default parameters. I think that is what I would expect to happen as a user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm a good point. There could be a few cases for this:

  1. The user is aware that ZYTE_API_DEFAULT_PARAMS is set in the settings and wish to use exactly the same values. Thus, zyte_api would be represented as an empty {}.
  2. ZYTE_API_DEFAULT_PARAMS is not in the settings. The user might've thought that it was set but was actually not. This would result in the requests to be executed without using Zyte API.

We could use meta={"zyte_api": {}} to represent it but it kinda feels unnatural as an empty-dict. Perhaps we could also support meta={"zyte_api": True} as another alternative. The {} is still a valid use-case though since the user might have a dynamic process to determine the contents of meta field. For example:

meta= {"zyte_api": {}}  # default
if does_it_look_like_we_need_javascript():
    meta["zyte_api"].update({"javascript": True})
if how_about_using_fr_region():
    meta["zyte_api"].update({"geolocation": "FR"})
yield scrapy.Request(url, self.callback_func, meta=meta)

To summarize, we could use {} and True to enable Zyte API Requests bearing in mind the need for ZYTE_API_DEFAULT_PARAMS be set.

Any thoughts on this?

Copy link
Contributor

@Gallaecio Gallaecio May 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to having {} and True work and behave the same. At some point Zyte Data API may (should? 🙂) work without parameter other than the URL (e.g. consider browserHtml or httpResponseBody enabled if no other output is enabled), in which case zyte_api=True may become a common pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the only remaining fix to make, the PR looks good to me otherwise 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the code to accept True and {} in f5a9bb0.

in the ``settings.py`` file or `any other settings within Scrapy
<https://docs.scrapy.org/en/latest/topics/settings.html#populating-the-settings>`_:

.. code-block:: python

import scrapy

ZYTE_API_DEFAULT_PARAMS = {
"browserHtml": True,
"geolocation": "US",
}

class TestSpider(scrapy.Spider):
name = "test"
You can see the full list of parameters in the `Zyte API Specification
<https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_.

def start_requests(self):
On the other hand, you could also control it on a per request basis by setting the
``zyte_api`` key in `Request.meta <https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_.
When doing so, it will override any parameters that was set in the
``ZYTE_API_DEFAULT_PARAMS`` setting.

yield scrapy.Request(
url="http://books.toscrape.com/",
callback=self.parse,
meta={
"zyte_api": {
"browserHtml": True,
# You can set any GEOLocation region you want.
"geolocation": "US",
"javascript": True,
"echoData": {"something": True},
}
},
)
.. code-block:: python

def parse(self, response):
yield {"URL": response.url, "status": response.status, "HTML": response.body}
import scrapy


class SampleQuotesSpider(scrapy.Spider):
name = "sample_quotes"

def start_requests(self):

yield scrapy.Request(
url="http://books.toscrape.com/",
callback=self.parse,
meta={
"zyte_api": {
"browserHtml": True,
"geolocation": "US", # You can set any Geolocation region you want.
"javascript": True,
"echoData": {"some_value_I_could_track": 123},
}
},
)

def parse(self, response):
yield {"URL": response.url, "status": response.status, "HTML": response.body}

print(response.zyte_api)
# {
# 'url': 'https://quotes.toscrape.com/',
# 'browserHtml': '<html> ... </html>',
# 'echoData': {'some_value_I_could_track': 123},
# }

print(response.request.meta)
# {
# 'zyte_api': {
# 'browserHtml': True,
# 'geolocation': 'US',
# 'javascript': True,
# 'echoData': {'some_value_I_could_track': 123}
# },
# 'download_timeout': 180.0,
# 'download_slot': 'quotes.toscrape.com'
# }

The raw Zyte API Response can be accessed via the ``zyte_api`` attribute
of the response object. Note that such responses are of ``ZyteAPIResponse`` and
``ZyteAPITextResponse`` which are respectively subclasses of ``scrapy.http.Response``
and ``scrapy.http.TextResponse``. Such classes are needed to hold the raw Zyte API
responses.
56 changes: 17 additions & 39 deletions scrapy_zyte_api/handler.py
Original file line number Diff line number Diff line change
@@ -1,21 +1,22 @@
import json
import logging
import os
from base64 import b64decode
from typing import Any, Dict, Generator, List, Optional
from typing import Any, Dict, Generator, Optional, Union

from scrapy import Spider
from scrapy.core.downloader.handlers.http import HTTPDownloadHandler
from scrapy.crawler import Crawler
from scrapy.exceptions import IgnoreRequest, NotConfigured
from scrapy.http import Request, Response, TextResponse
from scrapy.http import Request
from scrapy.settings import Settings
from scrapy.utils.defer import deferred_from_coro
from scrapy.utils.reactor import verify_installed_reactor
from twisted.internet.defer import Deferred, inlineCallbacks
from zyte_api.aio.client import AsyncClient, create_session
from zyte_api.aio.errors import RequestError

from .responses import ZyteAPIResponse, ZyteAPITextResponse, process_response

logger = logging.getLogger(__name__)


Expand All @@ -30,8 +31,8 @@ def __init__(
)
self._stats = crawler.stats
self._job_id = crawler.settings.get("JOB")
self._zyte_api_default_params = settings.getdict("ZYTE_API_DEFAULT_PARAMS")
self._session = create_session()
self._encoding = "utf-8"

@classmethod
def from_crawler(cls, crawler):
Expand All @@ -53,12 +54,17 @@ def download_request(self, request: Request, spider: Spider) -> Deferred:
else:
return super().download_request(request, spider)

async def _download_request(self, request: Request, spider: Spider) -> Response:
api_params: Dict[str, Any] = request.meta["zyte_api"]
if not isinstance(api_params, dict):
async def _download_request(
self, request: Request, spider: Spider
) -> Optional[Union[ZyteAPITextResponse, ZyteAPIResponse]]:
api_params: Dict[str, Any] = self._zyte_api_default_params or {}
try:
api_params.update(request.meta.get("zyte_api") or {})
except TypeError:
logger.error(
"zyte_api parameters in the request meta should be "
f"provided as dictionary, got {type(api_params)} instead ({request.url})."
f"zyte_api parameters in the request meta should be "
f"provided as dictionary, got {type(request.meta.get('zyte_api'))} "
f"instead ({request.url})."
)
raise IgnoreRequest()
# Define url by default
Expand All @@ -80,31 +86,9 @@ async def _download_request(self, request: Request, spider: Spider) -> Response:
f"Got an error when processing Zyte API request ({request.url}): {er}"
)
raise IgnoreRequest()

self._stats.inc_value("scrapy-zyte-api/request_count")
headers = self._prepare_headers(api_response.get("httpResponseHeaders"))
# browserHtml and httpResponseBody are not allowed at the same time,
# but at least one of them should be present
if api_response.get("browserHtml"):
# Using TextResponse because browserHtml always returns a browser-rendered page
# even when requesting files (like images)
return TextResponse(
url=api_response["url"],
status=200,
body=api_response["browserHtml"].encode(self._encoding),
encoding=self._encoding,
request=request,
flags=["zyte-api"],
headers=headers,
)
else:
return Response(
url=api_response["url"],
status=200,
body=b64decode(api_response["httpResponseBody"]),
request=request,
flags=["zyte-api"],
headers=headers,
)
return process_response(api_response, request)

@inlineCallbacks
def close(self) -> Generator:
Expand All @@ -129,9 +113,3 @@ def _get_request_error_message(error: RequestError) -> str:
if error_data.get("detail"):
return error_data["detail"]
return base_message

@staticmethod
def _prepare_headers(init_headers: Optional[List[Dict[str, str]]]):
if not init_headers:
return None
return {h["name"]: h["value"] for h in init_headers}
123 changes: 123 additions & 0 deletions scrapy_zyte_api/responses.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
from base64 import b64decode
from typing import Dict, List, Optional, Union

from scrapy import Request
from scrapy.http import Response, TextResponse
from scrapy.responsetypes import responsetypes

_DEFAULT_ENCODING = "utf-8"


class ZyteAPIMixin:

REMOVE_HEADERS = {
# Zyte API already decompresses the HTTP Response Body. Scrapy's
# HttpCompressionMiddleware will error out when it attempts to
# decompress an already decompressed body based on this header.
"content-encoding"
}

def __init__(self, *args, zyte_api: Dict = None, **kwargs):
super().__init__(*args, **kwargs)
self._zyte_api = zyte_api

def replace(self, *args, **kwargs):
"""Create a new response with the same attributes except for those given
new values.
"""
return super().replace(*args, **kwargs)

@property
def zyte_api(self) -> Optional[Dict]:
"""Contains the raw API response from Zyte API.
To see the full list of parameters and their description, kindly refer to the
`Zyte API Specification <https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_.
"""
return self._zyte_api

@classmethod
def _prepare_headers(cls, init_headers: Optional[List[Dict[str, str]]]):
if not init_headers:
return None
return {
h["name"]: h["value"]
for h in init_headers
if h["name"].lower() not in cls.REMOVE_HEADERS
}


class ZyteAPITextResponse(ZyteAPIMixin, TextResponse):
@classmethod
def from_api_response(cls, api_response: Dict, *, request: Request = None):
"""Alternative constructor to instantiate the response from the raw
Zyte API response.
"""
body = None
encoding = None

if api_response.get("browserHtml"):
encoding = _DEFAULT_ENCODING # Zyte API has "utf-8" by default
body = api_response["browserHtml"].encode(encoding)
elif api_response.get("httpResponseBody"):
body = b64decode(api_response["httpResponseBody"])

return cls(
url=api_response["url"],
status=200,
body=body,
encoding=encoding,
request=request,
flags=["zyte-api"],
headers=cls._prepare_headers(api_response.get("httpResponseHeaders")),
zyte_api=api_response,
)


class ZyteAPIResponse(ZyteAPIMixin, Response):
@classmethod
def from_api_response(cls, api_response: Dict, *, request: Request = None):
"""Alternative constructor to instantiate the response from the raw
Zyte API response.
"""
return cls(
url=api_response["url"],
status=200,
body=b64decode(api_response.get("httpResponseBody") or ""),
request=request,
flags=["zyte-api"],
headers=cls._prepare_headers(api_response.get("httpResponseHeaders")),
zyte_api=api_response,
)


def process_response(
api_response: Dict[str, Union[List[Dict], str]], request: Request
) -> Optional[Union[ZyteAPITextResponse, ZyteAPIResponse]]:
"""Given a Zyte API Response and the ``scrapy.Request`` that asked for it,
this returns either a ``ZyteAPITextResponse`` or ``ZyteAPIResponse`` depending
on which if it can properly decode the HTTP Body or have access to browserHtml.
"""

# NOTES: Currently, Zyte API does NOT only allow both 'browserHtml' and
# 'httpResponseBody' to be present at the same time. The support for both
# will be addressed in the future. Reference:
# - https://github.com/scrapy-plugins/scrapy-zyte-api/pull/10#issuecomment-1131406460
# For now, at least one of them should be present.

if api_response.get("browserHtml"):
# Using TextResponse because browserHtml always returns a browser-rendered page
# even when requesting files (like images)
return ZyteAPITextResponse.from_api_response(api_response, request=request)

if api_response.get("httpResponseHeaders") and api_response.get("httpResponseBody"):
response_cls = responsetypes.from_args(
headers=api_response["httpResponseHeaders"],
url=api_response["url"],
# FIXME: update this when python-zyte-api supports base64 decoding
body=b64decode(api_response["httpResponseBody"]), # type: ignore
)
if issubclass(response_cls, TextResponse):
return ZyteAPITextResponse.from_api_response(api_response, request=request)

return ZyteAPIResponse.from_api_response(api_response, request=request)
2 changes: 1 addition & 1 deletion tests/mockserver.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def do_POST(self): # NOQA
{"name": "test_header", "value": "test_value"}
]
if post_data.get("jobId") is None:
browser_html = "<html></html>"
browser_html = "<html><body>Hello<h1>World!</h1></body></html>"
else:
browser_html = f"<html>{post_data['jobId']}</html>"
if post_data.get("browserHtml"):
Expand Down
Loading