-
Notifications
You must be signed in to change notification settings - Fork 22
introduce ZyteAPITextResponse and ZyteAPIResponse to store raw Zyte Data API Response #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 21 commits
9a83471
8909473
d0dc08d
109dbf0
9695880
8812a05
ba64103
84dac7d
5b83443
fb0b412
10a4603
b7102fa
2b4a0fb
97ea1e4
5dd1bec
052d0d6
48a4766
2455bdf
910085b
e3214d8
e530053
27c7a7d
5b7cf6f
2adc8a6
32faf3d
cec0677
e0865e7
34a427f
37a4cc7
f5a9bb0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -33,8 +33,8 @@ Installation | |
|
|
||
| This package requires Python 3.7+. | ||
|
|
||
| How to configure | ||
| ---------------- | ||
| Configuration | ||
| ------------- | ||
|
|
||
| Replace the default ``http`` and ``https`` in Scrapy's | ||
| `DOWNLOAD_HANDLERS <https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-DOWNLOAD_HANDLERS>`_ | ||
|
|
@@ -46,7 +46,7 @@ Lastly, make sure to `install the asyncio-based Twisted reactor | |
| <https://docs.scrapy.org/en/latest/topics/asyncio.html#installing-the-asyncio-reactor)>`_ | ||
| in the ``settings.py`` file as well: | ||
|
|
||
| Here's example of the things needed inside a Scrapy project's ``settings.py`` file: | ||
| Here's an example of the things needed inside a Scrapy project's ``settings.py`` file: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
|
|
@@ -60,37 +60,75 @@ Here's example of the things needed inside a Scrapy project's ``settings.py`` fi | |
|
|
||
| TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" | ||
|
|
||
| How to use | ||
| ---------- | ||
| Usage | ||
| ----- | ||
|
|
||
| Set the ``zyte_api`` `Request.meta | ||
| <https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_ | ||
| key to download a request using Zyte API. Full list of parameters is provided in the | ||
| `Zyte API Specification <https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_. | ||
| To enable every request to be sent through Zyte API, you can set the following | ||
|
||
| in the ``settings.py`` file or `any other settings within Scrapy | ||
| <https://docs.scrapy.org/en/latest/topics/settings.html#populating-the-settings>`_: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| import scrapy | ||
|
|
||
| ZYTE_API_DEFAULT_PARAMS = { | ||
| "browserHtml": True, | ||
| "geolocation": "US", | ||
| } | ||
|
|
||
| class TestSpider(scrapy.Spider): | ||
| name = "test" | ||
| You can see the full list of parameters in the `Zyte API Specification | ||
| <https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_. | ||
|
|
||
| def start_requests(self): | ||
| On the other hand, you could also control it on a per request basis by setting the | ||
BurnzZ marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ``zyte_api`` key in `Request.meta <https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_. | ||
| When doing so, it will override any parameters that was set in the | ||
BurnzZ marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ``ZYTE_API_DEFAULT_PARAMS`` setting. | ||
|
|
||
| yield scrapy.Request( | ||
| url="http://books.toscrape.com/", | ||
| callback=self.parse, | ||
| meta={ | ||
| "zyte_api": { | ||
| "browserHtml": True, | ||
| # You can set any GEOLocation region you want. | ||
| "geolocation": "US", | ||
| "javascript": True, | ||
| "echoData": {"something": True}, | ||
| } | ||
| }, | ||
| ) | ||
| .. code-block:: python | ||
|
|
||
| def parse(self, response): | ||
| yield {"URL": response.url, "status": response.status, "HTML": response.body} | ||
| import scrapy | ||
|
|
||
|
|
||
| class SampleQuotesSpider(scrapy.Spider): | ||
| name = "sample_quotes" | ||
|
|
||
| def start_requests(self): | ||
|
|
||
| yield scrapy.Request( | ||
| url="http://books.toscrape.com/", | ||
| callback=self.parse, | ||
| meta={ | ||
| "zyte_api": { | ||
| "browserHtml": True, | ||
| "geolocation": "US", # You can set any Geolocation region you want. | ||
| "javascript": True, | ||
| "echoData": {"some_value_I_could_track": 123}, | ||
| } | ||
| }, | ||
| ) | ||
|
|
||
| def parse(self, response): | ||
| yield {"URL": response.url, "status": response.status, "HTML": response.body} | ||
|
|
||
| print(response.zyte_api) | ||
| # { | ||
| # 'url': 'https://quotes.toscrape.com/', | ||
| # 'browserHtml': '<html> ... </html>', | ||
| # 'echoData': {'some_value_I_could_track': 123}, | ||
| # } | ||
|
|
||
| print(response.request.meta) | ||
| # { | ||
| # 'zyte_api': { | ||
| # 'browserHtml': True, | ||
| # 'geolocation': 'US', | ||
| # 'javascript': True, | ||
| # 'echoData': {'some_value_I_could_track': 123} | ||
| # }, | ||
| # 'download_timeout': 180.0, | ||
| # 'download_slot': 'quotes.toscrape.com' | ||
| # } | ||
|
|
||
| The raw Zyte API Response can be accessed via the ``zyte_api`` attribute | ||
BurnzZ marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| of the response object. Note that such responses are of ``ZyteAPIResponse`` and | ||
| ``ZyteAPITextResponse`` which are respectively subclasses of ``scrapy.http.Response`` | ||
BurnzZ marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| and ``scrapy.http.TextResponse``. Such classes are needed to hold the raw Zyte API | ||
BurnzZ marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| responses. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,123 @@ | ||
| from base64 import b64decode | ||
| from typing import Dict, List, Optional, Union | ||
|
|
||
| from scrapy import Request | ||
| from scrapy.http import Response, TextResponse | ||
| from scrapy.responsetypes import responsetypes | ||
|
|
||
| _DEFAULT_ENCODING = "utf-8" | ||
|
|
||
|
|
||
| class ZyteAPIMixin: | ||
kmike marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| REMOVE_HEADERS = { | ||
| # Zyte API already decompresses the HTTP Response Body. Scrapy's | ||
| # HttpCompressionMiddleware will error out when it attempts to | ||
| # decompress an already decompressed body based on this header. | ||
| "content-encoding" | ||
| } | ||
|
|
||
| def __init__(self, *args, zyte_api: Dict = None, **kwargs): | ||
| super().__init__(*args, **kwargs) | ||
| self._zyte_api = zyte_api | ||
|
|
||
| def replace(self, *args, **kwargs): | ||
| """Create a new response with the same attributes except for those given | ||
| new values. | ||
| """ | ||
| return super().replace(*args, **kwargs) | ||
|
|
||
| @property | ||
| def zyte_api(self) -> Optional[Dict]: | ||
kmike marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| """Contains the raw API response from Zyte API. | ||
| To see the full list of parameters and their description, kindly refer to the | ||
| `Zyte API Specification <https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_. | ||
| """ | ||
| return self._zyte_api | ||
|
|
||
| @classmethod | ||
| def _prepare_headers(cls, init_headers: Optional[List[Dict[str, str]]]): | ||
| if not init_headers: | ||
| return None | ||
| return { | ||
| h["name"]: h["value"] | ||
| for h in init_headers | ||
| if h["name"].lower() not in cls.REMOVE_HEADERS | ||
| } | ||
|
|
||
|
|
||
| class ZyteAPITextResponse(ZyteAPIMixin, TextResponse): | ||
| @classmethod | ||
| def from_api_response(cls, api_response: Dict, *, request: Request = None): | ||
| """Alternative constructor to instantiate the response from the raw | ||
| Zyte API response. | ||
| """ | ||
| body = None | ||
| encoding = None | ||
|
|
||
| if api_response.get("browserHtml"): | ||
| encoding = _DEFAULT_ENCODING # Zyte API has "utf-8" by default | ||
| body = api_response["browserHtml"].encode(encoding) | ||
| elif api_response.get("httpResponseBody"): | ||
| body = b64decode(api_response["httpResponseBody"]) | ||
|
|
||
| return cls( | ||
| url=api_response["url"], | ||
| status=200, | ||
Gallaecio marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| body=body, | ||
| encoding=encoding, | ||
| request=request, | ||
| flags=["zyte-api"], | ||
| headers=cls._prepare_headers(api_response.get("httpResponseHeaders")), | ||
| zyte_api=api_response, | ||
| ) | ||
|
|
||
|
|
||
| class ZyteAPIResponse(ZyteAPIMixin, Response): | ||
| @classmethod | ||
| def from_api_response(cls, api_response: Dict, *, request: Request = None): | ||
| """Alternative constructor to instantiate the response from the raw | ||
| Zyte API response. | ||
| """ | ||
| return cls( | ||
| url=api_response["url"], | ||
| status=200, | ||
| body=b64decode(api_response.get("httpResponseBody") or ""), | ||
| request=request, | ||
| flags=["zyte-api"], | ||
| headers=cls._prepare_headers(api_response.get("httpResponseHeaders")), | ||
| zyte_api=api_response, | ||
| ) | ||
|
|
||
|
|
||
| def process_response( | ||
BurnzZ marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| api_response: Dict[str, Union[List[Dict], str]], request: Request | ||
| ) -> Optional[Union[ZyteAPITextResponse, ZyteAPIResponse]]: | ||
| """Given a Zyte API Response and the ``scrapy.Request`` that asked for it, | ||
| this returns either a ``ZyteAPITextResponse`` or ``ZyteAPIResponse`` depending | ||
| on which if it can properly decode the HTTP Body or have access to browserHtml. | ||
| """ | ||
|
|
||
| # NOTES: Currently, Zyte API does NOT only allow both 'browserHtml' and | ||
| # 'httpResponseBody' to be present at the same time. The support for both | ||
| # will be addressed in the future. Reference: | ||
| # - https://github.com/scrapy-plugins/scrapy-zyte-api/pull/10#issuecomment-1131406460 | ||
| # For now, at least one of them should be present. | ||
|
|
||
| if api_response.get("browserHtml"): | ||
| # Using TextResponse because browserHtml always returns a browser-rendered page | ||
| # even when requesting files (like images) | ||
| return ZyteAPITextResponse.from_api_response(api_response, request=request) | ||
|
|
||
| if api_response.get("httpResponseHeaders") and api_response.get("httpResponseBody"): | ||
| response_cls = responsetypes.from_args( | ||
| headers=api_response["httpResponseHeaders"], | ||
| url=api_response["url"], | ||
| # FIXME: update this when python-zyte-api supports base64 decoding | ||
| body=b64decode(api_response["httpResponseBody"]), # type: ignore | ||
| ) | ||
| if issubclass(response_cls, TextResponse): | ||
| return ZyteAPITextResponse.from_api_response(api_response, request=request) | ||
|
|
||
| return ZyteAPIResponse.from_api_response(api_response, request=request) | ||
Uh oh!
There was an error while loading. Please reload this page.