-
Notifications
You must be signed in to change notification settings - Fork 35
Add support for extra options in the fullDomFetcher #1173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work on this.
However, look at my comments, but it seems to me that the core issue hasn't been fully addressed.
Could you update the PR to include this, or let me know if I’m misunderstanding something.
Updated logic has been implemented in the latest commits and replicated into htmlonlyfetcher as well. |
Hi @LVerneyEC, I went ahead and refactored some parts of the code and moved the proxy credentials into the browser object instead of keeping them in the module. Since I’m a bit tied up at the moment, it was quicker to implement than to explain. |
I can confirm it works as expected. |
@LVerneyEC, just one more step before we can deliver your contribution. Could you lint the commit messages and add a changelog entry? |
Should be rebased and cleaned. I have no way to test the CHANGELOG locally and tried to adapt to the best I understood. Please feel free to manually edit the CHANGELOG if not matching your practices before merging, as I don't want to be delaying this PR too much and this is becoming a blocking issue on our side. Thanks! |
Add support for proxy, disabling the sandboxing (which is required by some Docker setups) and disabling headless mode in the fullDomFetcher.
As stated in the CONTRIBUTING guidelines, each changelog release must include sponsor information. For example, when a release is developed with the support of the French Ministry for Foreign Affairs, we include:
You can find other examples in the CHANGELOG.md. Could you provide the sponsor information for this release? |
Based on #1135, the sponsor information should be:
@LVerneyEC let us know if something different should be used. @Ndpnt I think it's fair that the core team takes responsibility for ensuring the changelog quality and makes the changes directly. Contributors can focus on their code changes, the publishing responsibility stays with the gatekeepers 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems stalled for too long. @Ndpnt let's have a quick call about how to unblock this.
CHANGELOG.md
Outdated
- Add extra options for fetchers, including support for (authentifying) proxies and switchin on/off headless mode and | ||
sandboxing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Formatting is inappropriate. We don't use line wrapping.
- There is a typo (
switchin
). - One line per entry please, with details on how to use 🙂
- This seems to be
Added
and notChanged
.
@Ndpnt please take over changelog copywriting. Contributors are responsible for use case description, implementation, following requirements and documenting their contributions. We are responsible for quality control, packaging and distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had already planned to do it but was waiting for the funder information to complete everything at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care of aligning the style here! As a contributor, I would ideally expect anything under my control and responsibility to be covered by unit tests and PR pipelines (or a PR template/a bot auto-replying to my PR and pointing to the checks that I have to do on my own). Hopefully, this way I can iterate on my own without having to deal with back and forth reviews for stylistic issues :)
CHANGELOG.md
Outdated
|
||
All changes that impact users of this module are documented in this file, in the [Common Changelog](https://common-changelog.org) format with some additional specifications defined in the CONTRIBUTING file. This codebase adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). | ||
|
||
## Unreleased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Unreleased | |
## Unreleased [minor] |
browser = await puppeteer.launch({ headless: true }); | ||
const options = { | ||
args: [], | ||
headless: !process.env.FETCHER_NO_HEADLESS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I understand this might be frustrating to handle at this stage of the PR, but I don't understand: why are env variables used instead of config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using environment variables for http_proxy/https_proxy feature (by design, this is the way they are defined in Unix).
SANDBOX/HEADLESS is a different use case, and they could have been moved into configuration indeed. My main argument for keeping them as environment variables is that they are often used for debugging purposes. For instance, I would use often the FETCHER_NO_HEADLESS
to have visual feedback of the page (for things as simple as checking that resources are correctly loaded, that the page is visually OK, that we don't hit an antibot, etc.) and the scraping pipeline when the output is not the one expected when developing a new declaration.
Similarly, SANDBOX control is useful when switching to a particular environment (e.g. Docker) and is easy to define as an environment variable on the fly.
Not sure if it makes sense to you, but having an environment variable makes it easy to configure on the fly across various environments/setups (e.g. running with node locally, and switching to Docker to confirm the issue in a different context) and when debugging. Configuration files, on the other end, seemed more like a static configuration, to describe the pipeline as a whole (and something which you would likely want to share easily across your environments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @LVerneyEC's approach.
The current environment variables make sense to me because:
http_proxy/https_proxy
follow Unix conventionsFETCHER_NO_HEADLESS
is a debugging tool that can need frequent toggling during development (I also have this need)FETCHER_NO_SANDBOX
is deployment-context specific (not dev/prod/test environments)
If you agree @MattiSG, I'll add this decision to our contributing guidelines with this simple rule for future choices:
- Environment variables for secrets (ex: API keys, passwords), debugging flags (ex: visual feedback toggles), Unix standards (ex: proxy settings), runtime overrides (ex: Docker-specific settings)
- Config files for engine behavior (ex: timeouts, schedules), service settings (ex: API endpoints, repository URLs), static infrastructure (ex: file paths, database names, storage locations)
We could also consider allowing some environment variables to override some config values for flexibility, but given our current needs are well-served by the existing approach, this enhancement does not seem necessary right now.
Totally agree. I just needed the sponsor info to update it myself. |
Co-authored-by: Matti Schneider <[email protected]>
I'm not sure to fully grasp the notion of "sponsor". Also it is quite hard to understand from an external contributor point of view as I would be contributing a feature and not a release (and the sponsor is attached to the release). Also given that in most open-source projects the release cycle is out of hands of individual contributors and OTA is very specific in this regards with a very short release cycle often with a single contribution. Also, the wording "sponsor" is quite ambiguous and can imply a notion of money or advertisement (e.g. think of sponsored chairs in academia, where one would just pay for branding) beyond the mere contribution on a position paid by a third party. Anyways, this contribution was made for the need of the EC instance running at https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/ on work time at the EC. I would rather not attach any sponsor wording but I guess the proposed sponsor wording could work (provided you switch "Union" to "Commission") if you need one. I would rather go for a more neutral wording such as
|
* main: Release v9.1.1 Improve changelog Add test when mime type contains charset Fix PDF extraction when MIME type contains charset
Hey,
This is a follow-up of #1144 as discussed there.
This PR makes the fullDomFetcher configurable through environment variables to support a wider range of setups:
Thanks