Skip to content

Commit a188fcc

Browse files
authored
Merge pull request #10 from scrapy-plugins/zyte-api-response
introduce ZyteAPITextResponse and ZyteAPIResponse to store raw Zyte Data API Response
2 parents 001e201 + f5a9bb0 commit a188fcc

File tree

7 files changed

+773
-138
lines changed

7 files changed

+773
-138
lines changed

CHANGES.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,14 @@
11
Changes
22
=======
33

4+
TBD
5+
---
6+
7+
* Introduce ``ZyteAPIResponse`` and ``ZyteAPITextResponse`` which are subclasses
8+
of ``scrapy.http.Response`` and ``scrapy.http.TextResponse`` respectively.
9+
These new response classes hold the raw Zyte Data API response in the
10+
``raw_api_response`` attribute.
11+
412
0.1.0 (2022-02-03)
513
------------------
614

README.rst

Lines changed: 75 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,8 @@ Installation
3333
3434
This package requires Python 3.7+.
3535

36-
How to configure
37-
----------------
36+
Configuration
37+
-------------
3838

3939
Replace the default ``http`` and ``https`` in Scrapy's
4040
`DOWNLOAD_HANDLERS <https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-DOWNLOAD_HANDLERS>`_
@@ -46,7 +46,7 @@ Lastly, make sure to `install the asyncio-based Twisted reactor
4646
<https://docs.scrapy.org/en/latest/topics/asyncio.html#installing-the-asyncio-reactor)>`_
4747
in the ``settings.py`` file as well:
4848

49-
Here's example of the things needed inside a Scrapy project's ``settings.py`` file:
49+
Here's an example of the things needed inside a Scrapy project's ``settings.py`` file:
5050

5151
.. code-block:: python
5252
@@ -60,37 +60,83 @@ Here's example of the things needed inside a Scrapy project's ``settings.py`` fi
6060
6161
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
6262
63-
How to use
64-
----------
63+
Usage
64+
-----
6565

66-
Set the ``zyte_api`` `Request.meta
67-
<https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_
68-
key to download a request using Zyte API. Full list of parameters is provided in the
69-
`Zyte API Specification <https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_.
66+
To enable a ``scrapy.Request`` to go through Zyte Data API, the ``zyte_api`` key in
67+
`Request.meta <https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_
68+
must be present and has dict-like contents.
7069

71-
.. code-block:: python
70+
To set the default parameters for Zyte API enabled requests, you can set the
71+
following in the ``settings.py`` file or `any other settings within Scrapy
72+
<https://docs.scrapy.org/en/latest/topics/settings.html#populating-the-settings>`_:
7273

73-
import scrapy
74+
.. code-block:: python
7475
76+
ZYTE_API_DEFAULT_PARAMS = {
77+
"browserHtml": True,
78+
"geolocation": "US",
79+
}
7580
76-
class TestSpider(scrapy.Spider):
77-
name = "test"
81+
You can see the full list of parameters in the `Zyte Data API Specification
82+
<https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_.
7883

79-
def start_requests(self):
84+
Note that the ``ZYTE_API_DEFAULT_PARAMS`` would only work if the ``zyte_api``
85+
key in `Request.meta <https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_
86+
is set. When doing so, it will override any parameters set in the
87+
``ZYTE_API_DEFAULT_PARAMS`` setting.
8088

81-
yield scrapy.Request(
82-
url="http://books.toscrape.com/",
83-
callback=self.parse,
84-
meta={
85-
"zyte_api": {
86-
"browserHtml": True,
87-
# You can set any GEOLocation region you want.
88-
"geolocation": "US",
89-
"javascript": True,
90-
"echoData": {"something": True},
91-
}
92-
},
93-
)
89+
.. code-block:: python
9490
95-
def parse(self, response):
96-
yield {"URL": response.url, "status": response.status, "HTML": response.body}
91+
import scrapy
92+
93+
94+
class SampleQuotesSpider(scrapy.Spider):
95+
name = "sample_quotes"
96+
97+
custom_settings = {
98+
"ZYTE_API_DEFAULT_PARAMS": {
99+
"geolocation": "US", # You can set any Geolocation region you want.
100+
}
101+
}
102+
103+
def start_requests(self):
104+
yield scrapy.Request(
105+
url="http://books.toscrape.com/",
106+
callback=self.parse,
107+
meta={
108+
"zyte_api": {
109+
"browserHtml": True,
110+
"javascript": True,
111+
"echoData": {"some_value_I_could_track": 123},
112+
}
113+
},
114+
)
115+
116+
def parse(self, response):
117+
yield {"URL": response.url, "status": response.status, "HTML": response.body}
118+
119+
print(response.raw_api_response)
120+
# {
121+
# 'url': 'https://quotes.toscrape.com/',
122+
# 'browserHtml': '<html> ... </html>',
123+
# 'echoData': {'some_value_I_could_track': 123},
124+
# }
125+
126+
print(response.request.meta)
127+
# {
128+
# 'zyte_api': {
129+
# 'browserHtml': True,
130+
# 'geolocation': 'US',
131+
# 'javascript': True,
132+
# 'echoData': {'some_value_I_could_track': 123}
133+
# },
134+
# 'download_timeout': 180.0,
135+
# 'download_slot': 'quotes.toscrape.com'
136+
# }
137+
138+
The raw Zyte Data API response can be accessed via the ``raw_api_response`` attribute
139+
of the response object. Note that such responses are of ``ZyteAPIResponse`` and
140+
``ZyteAPITextResponse`` types, which are respectively subclasses of ``scrapy.http.Response``
141+
and ``scrapy.http.TextResponse``. Such classes are needed to hold the raw Zyte Data API
142+
responses.

scrapy_zyte_api/handler.py

Lines changed: 33 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,22 @@
11
import json
22
import logging
33
import os
4-
from base64 import b64decode
5-
from typing import Any, Dict, Generator, List, Optional
4+
from typing import Any, Dict, Generator, Optional, Union
65

76
from scrapy import Spider
87
from scrapy.core.downloader.handlers.http import HTTPDownloadHandler
98
from scrapy.crawler import Crawler
109
from scrapy.exceptions import IgnoreRequest, NotConfigured
11-
from scrapy.http import Request, Response, TextResponse
10+
from scrapy.http import Request
1211
from scrapy.settings import Settings
1312
from scrapy.utils.defer import deferred_from_coro
1413
from scrapy.utils.reactor import verify_installed_reactor
1514
from twisted.internet.defer import Deferred, inlineCallbacks
1615
from zyte_api.aio.client import AsyncClient, create_session
1716
from zyte_api.aio.errors import RequestError
1817

18+
from .responses import ZyteAPIResponse, ZyteAPITextResponse, _process_response
19+
1920
logger = logging.getLogger(__name__)
2021

2122

@@ -30,8 +31,8 @@ def __init__(
3031
)
3132
self._stats = crawler.stats
3233
self._job_id = crawler.settings.get("JOB")
34+
self._zyte_api_default_params = settings.getdict("ZYTE_API_DEFAULT_PARAMS")
3335
self._session = create_session()
34-
self._encoding = "utf-8"
3536

3637
@classmethod
3738
def from_crawler(cls, crawler):
@@ -48,19 +49,36 @@ def from_crawler(cls, crawler):
4849
return cls(crawler.settings, crawler, client)
4950

5051
def download_request(self, request: Request, spider: Spider) -> Deferred:
51-
if request.meta.get("zyte_api"):
52-
return deferred_from_coro(self._download_request(request, spider))
53-
else:
54-
return super().download_request(request, spider)
52+
api_params = self._prepare_api_params(request)
53+
if api_params:
54+
return deferred_from_coro(
55+
self._download_request(api_params, request, spider)
56+
)
57+
return super().download_request(request, spider)
58+
59+
def _prepare_api_params(self, request: Request) -> Optional[dict]:
60+
meta_params = request.meta.get("zyte_api")
61+
if not meta_params and meta_params != {}:
62+
return None
63+
64+
if meta_params is True:
65+
meta_params = {}
5566

56-
async def _download_request(self, request: Request, spider: Spider) -> Response:
57-
api_params: Dict[str, Any] = request.meta["zyte_api"]
58-
if not isinstance(api_params, dict):
67+
api_params: Dict[str, Any] = self._zyte_api_default_params or {}
68+
try:
69+
api_params.update(meta_params)
70+
except TypeError:
5971
logger.error(
60-
"zyte_api parameters in the request meta should be "
61-
f"provided as dictionary, got {type(api_params)} instead ({request.url})."
72+
f"zyte_api parameters in the request meta should be "
73+
f"provided as dictionary, got {type(request.meta.get('zyte_api'))} "
74+
f"instead ({request.url})."
6275
)
6376
raise IgnoreRequest()
77+
return api_params
78+
79+
async def _download_request(
80+
self, api_params: dict, request: Request, spider: Spider
81+
) -> Optional[Union[ZyteAPITextResponse, ZyteAPIResponse]]:
6482
# Define url by default
6583
api_data = {**{"url": request.url}, **api_params}
6684
if self._job_id is not None:
@@ -80,31 +98,9 @@ async def _download_request(self, request: Request, spider: Spider) -> Response:
8098
f"Got an error when processing Zyte API request ({request.url}): {er}"
8199
)
82100
raise IgnoreRequest()
101+
83102
self._stats.inc_value("scrapy-zyte-api/request_count")
84-
headers = self._prepare_headers(api_response.get("httpResponseHeaders"))
85-
# browserHtml and httpResponseBody are not allowed at the same time,
86-
# but at least one of them should be present
87-
if api_response.get("browserHtml"):
88-
# Using TextResponse because browserHtml always returns a browser-rendered page
89-
# even when requesting files (like images)
90-
return TextResponse(
91-
url=api_response["url"],
92-
status=200,
93-
body=api_response["browserHtml"].encode(self._encoding),
94-
encoding=self._encoding,
95-
request=request,
96-
flags=["zyte-api"],
97-
headers=headers,
98-
)
99-
else:
100-
return Response(
101-
url=api_response["url"],
102-
status=200,
103-
body=b64decode(api_response["httpResponseBody"]),
104-
request=request,
105-
flags=["zyte-api"],
106-
headers=headers,
107-
)
103+
return _process_response(api_response, request)
108104

109105
@inlineCallbacks
110106
def close(self) -> Generator:
@@ -129,9 +125,3 @@ def _get_request_error_message(error: RequestError) -> str:
129125
if error_data.get("detail"):
130126
return error_data["detail"]
131127
return base_message
132-
133-
@staticmethod
134-
def _prepare_headers(init_headers: Optional[List[Dict[str, str]]]):
135-
if not init_headers:
136-
return None
137-
return {h["name"]: h["value"] for h in init_headers}

scrapy_zyte_api/responses.py

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
from base64 import b64decode
2+
from typing import Dict, List, Optional, Tuple, Union
3+
4+
from scrapy import Request
5+
from scrapy.http import Response, TextResponse
6+
from scrapy.responsetypes import responsetypes
7+
8+
_DEFAULT_ENCODING = "utf-8"
9+
10+
11+
class ZyteAPIMixin:
12+
13+
REMOVE_HEADERS = {
14+
# Zyte API already decompresses the HTTP Response Body. Scrapy's
15+
# HttpCompressionMiddleware will error out when it attempts to
16+
# decompress an already decompressed body based on this header.
17+
"content-encoding"
18+
}
19+
20+
def __init__(self, *args, raw_api_response: Dict = None, **kwargs):
21+
super().__init__(*args, **kwargs)
22+
self._raw_api_response = raw_api_response
23+
24+
def replace(self, *args, **kwargs):
25+
if kwargs.get("raw_api_response"):
26+
raise ValueError("Replacing the value of 'raw_api_response' isn't allowed.")
27+
return super().replace(*args, **kwargs)
28+
29+
@property
30+
def raw_api_response(self) -> Optional[Dict]:
31+
"""Contains the raw API response from Zyte API.
32+
33+
To see the full list of parameters and their description, kindly refer to the
34+
`Zyte API Specification <https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_.
35+
"""
36+
return self._raw_api_response
37+
38+
@classmethod
39+
def _prepare_headers(cls, init_headers: Optional[List[Dict[str, str]]]):
40+
if not init_headers:
41+
return None
42+
return {
43+
h["name"]: h["value"]
44+
for h in init_headers
45+
if h["name"].lower() not in cls.REMOVE_HEADERS
46+
}
47+
48+
49+
class ZyteAPITextResponse(ZyteAPIMixin, TextResponse):
50+
51+
attributes: Tuple[str, ...] = TextResponse.attributes + ("raw_api_response",)
52+
53+
@classmethod
54+
def from_api_response(cls, api_response: Dict, *, request: Request = None):
55+
"""Alternative constructor to instantiate the response from the raw
56+
Zyte API response.
57+
"""
58+
body = None
59+
encoding = None
60+
61+
if api_response.get("browserHtml"):
62+
encoding = _DEFAULT_ENCODING # Zyte API has "utf-8" by default
63+
body = api_response["browserHtml"].encode(encoding)
64+
elif api_response.get("httpResponseBody"):
65+
body = b64decode(api_response["httpResponseBody"])
66+
67+
return cls(
68+
url=api_response["url"],
69+
status=200,
70+
body=body,
71+
encoding=encoding,
72+
request=request,
73+
flags=["zyte-api"],
74+
headers=cls._prepare_headers(api_response.get("httpResponseHeaders")),
75+
raw_api_response=api_response,
76+
)
77+
78+
79+
class ZyteAPIResponse(ZyteAPIMixin, Response):
80+
81+
attributes: Tuple[str, ...] = Response.attributes + ("raw_api_response",)
82+
83+
@classmethod
84+
def from_api_response(cls, api_response: Dict, *, request: Request = None):
85+
"""Alternative constructor to instantiate the response from the raw
86+
Zyte API response.
87+
"""
88+
return cls(
89+
url=api_response["url"],
90+
status=200,
91+
body=b64decode(api_response.get("httpResponseBody") or ""),
92+
request=request,
93+
flags=["zyte-api"],
94+
headers=cls._prepare_headers(api_response.get("httpResponseHeaders")),
95+
raw_api_response=api_response,
96+
)
97+
98+
99+
def _process_response(
100+
api_response: Dict[str, Union[List[Dict], str]], request: Request
101+
) -> Optional[Union[ZyteAPITextResponse, ZyteAPIResponse]]:
102+
"""Given a Zyte API Response and the ``scrapy.Request`` that asked for it,
103+
this returns either a ``ZyteAPITextResponse`` or ``ZyteAPIResponse`` depending
104+
on which if it can properly decode the HTTP Body or have access to browserHtml.
105+
"""
106+
107+
# NOTES: Currently, Zyte API does NOT only allow both 'browserHtml' and
108+
# 'httpResponseBody' to be present at the same time. The support for both
109+
# will be addressed in the future. Reference:
110+
# - https://github.com/scrapy-plugins/scrapy-zyte-api/pull/10#issuecomment-1131406460
111+
# For now, at least one of them should be present.
112+
113+
if api_response.get("browserHtml"):
114+
# Using TextResponse because browserHtml always returns a browser-rendered page
115+
# even when requesting files (like images)
116+
return ZyteAPITextResponse.from_api_response(api_response, request=request)
117+
118+
if api_response.get("httpResponseHeaders") and api_response.get("httpResponseBody"):
119+
response_cls = responsetypes.from_args(
120+
headers=api_response["httpResponseHeaders"],
121+
url=api_response["url"],
122+
# FIXME: update this when python-zyte-api supports base64 decoding
123+
body=b64decode(api_response["httpResponseBody"]), # type: ignore
124+
)
125+
if issubclass(response_cls, TextResponse):
126+
return ZyteAPITextResponse.from_api_response(api_response, request=request)
127+
128+
return ZyteAPIResponse.from_api_response(api_response, request=request)

0 commit comments

Comments
 (0)