Skip to content

Conversation

LVerneyEC
Copy link

@LVerneyEC LVerneyEC commented Jul 11, 2025

Hey,

This is a follow-up of #1144 as discussed there.

This PR makes the fullDomFetcher configurable through environment variables to support a wider range of setups:

  • Add support for proxy
  • Sandboxing can be disabled through environment variables (which is required by some Docker setups - e.g. AWS Fargate)
  • Disabling headless mode can be done through environment variables (which might be useful for some visual debugging of declarations or for additional stealth).

Thanks

Copy link
Member

@Ndpnt Ndpnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this.
However, look at my comments, but it seems to me that the core issue hasn't been fully addressed.
Could you update the PR to include this, or let me know if I’m misunderstanding something.

@LVerneyEC
Copy link
Author

Updated logic has been implemented in the latest commits and replicated into htmlonlyfetcher as well.

@Ndpnt
Copy link
Member

Ndpnt commented Sep 10, 2025

Hi @LVerneyEC,

I went ahead and refactored some parts of the code and moved the proxy credentials into the browser object instead of keeping them in the module. Since I’m a bit tied up at the moment, it was quicker to implement than to explain.
I haven’t had a chance to test it yet — could you run it on your end and confirm that everything still works?

@LVerneyEC
Copy link
Author

LVerneyEC commented Sep 22, 2025

I went ahead and refactored some parts of the code and moved the proxy credentials into the browser object instead of keeping them in the module. Since I’m a bit tied up at the moment, it was quicker to implement than to explain. I haven’t had a chance to test it yet — could you run it on your end and confirm that everything still works?

I can confirm it works as expected.

@Ndpnt
Copy link
Member

Ndpnt commented Sep 30, 2025

@LVerneyEC, just one more step before we can deliver your contribution. Could you lint the commit messages and add a changelog entry?

@LVerneyEC
Copy link
Author

Should be rebased and cleaned. I have no way to test the CHANGELOG locally and tried to adapt to the best I understood. Please feel free to manually edit the CHANGELOG if not matching your practices before merging, as I don't want to be delaying this PR too much and this is becoming a blocking issue on our side. Thanks!

Add support for proxy, disabling the sandboxing (which is required by
some Docker setups) and disabling headless mode in the fullDomFetcher.
@Ndpnt
Copy link
Member

Ndpnt commented Oct 6, 2025

As stated in the CONTRIBUTING guidelines, each changelog release must include sponsor information.

For example, when a release is developed with the support of the French Ministry for Foreign Affairs, we include:

Development of this release was supported by the French Ministry for Foreign Affairs through its ministerial State Startups incubator, under the aegis of the Ambassador for Digital Affairs.

You can find other examples in the CHANGELOG.md.

Could you provide the sponsor information for this release?

@MattiSG
Copy link
Member

MattiSG commented Oct 7, 2025

Could you provide the sponsor information for this release?

Based on #1135, the sponsor information should be:

> Development of this release was supported by the [European Union](https://commission.europa.eu/).

@LVerneyEC let us know if something different should be used.

@Ndpnt I think it's fair that the core team takes responsibility for ensuring the changelog quality and makes the changes directly. Contributors can focus on their code changes, the publishing responsibility stays with the gatekeepers 🙂

Copy link
Member

@MattiSG MattiSG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems stalled for too long. @Ndpnt let's have a quick call about how to unblock this.

CHANGELOG.md Outdated
Comment on lines 9 to 10
- Add extra options for fetchers, including support for (authentifying) proxies and switchin on/off headless mode and
sandboxing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Formatting is inappropriate. We don't use line wrapping.
  • There is a typo (switchin).
  • One line per entry please, with details on how to use 🙂
  • This seems to be Added and not Changed.

@Ndpnt please take over changelog copywriting. Contributors are responsible for use case description, implementation, following requirements and documenting their contributions. We are responsible for quality control, packaging and distribution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had already planned to do it but was waiting for the funder information to complete everything at once.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of aligning the style here! As a contributor, I would ideally expect anything under my control and responsibility to be covered by unit tests and PR pipelines (or a PR template/a bot auto-replying to my PR and pointing to the checks that I have to do on my own). Hopefully, this way I can iterate on my own without having to deal with back and forth reviews for stylistic issues :)

CHANGELOG.md Outdated

All changes that impact users of this module are documented in this file, in the [Common Changelog](https://common-changelog.org) format with some additional specifications defined in the CONTRIBUTING file. This codebase adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## Unreleased
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Unreleased
## Unreleased [minor]

browser = await puppeteer.launch({ headless: true });
const options = {
args: [],
headless: !process.env.FETCHER_NO_HEADLESS,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I understand this might be frustrating to handle at this stage of the PR, but I don't understand: why are env variables used instead of config?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using environment variables for http_proxy/https_proxy feature (by design, this is the way they are defined in Unix).

SANDBOX/HEADLESS is a different use case, and they could have been moved into configuration indeed. My main argument for keeping them as environment variables is that they are often used for debugging purposes. For instance, I would use often the FETCHER_NO_HEADLESS to have visual feedback of the page (for things as simple as checking that resources are correctly loaded, that the page is visually OK, that we don't hit an antibot, etc.) and the scraping pipeline when the output is not the one expected when developing a new declaration.

Similarly, SANDBOX control is useful when switching to a particular environment (e.g. Docker) and is easy to define as an environment variable on the fly.

Not sure if it makes sense to you, but having an environment variable makes it easy to configure on the fly across various environments/setups (e.g. running with node locally, and switching to Docker to confirm the issue in a different context) and when debugging. Configuration files, on the other end, seemed more like a static configuration, to describe the pipeline as a whole (and something which you would likely want to share easily across your environments).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @LVerneyEC's approach.
The current environment variables make sense to me because:

  • http_proxy/https_proxy follow Unix conventions
  • FETCHER_NO_HEADLESS is a debugging tool that can need frequent toggling during development (I also have this need)
  • FETCHER_NO_SANDBOX is deployment-context specific (not dev/prod/test environments)

If you agree @MattiSG, I'll add this decision to our contributing guidelines with this simple rule for future choices:

  • Environment variables for secrets (ex: API keys, passwords), debugging flags (ex: visual feedback toggles), Unix standards (ex: proxy settings), runtime overrides (ex: Docker-specific settings)
  • Config files for engine behavior (ex: timeouts, schedules), service settings (ex: API endpoints, repository URLs), static infrastructure (ex: file paths, database names, storage locations)

We could also consider allowing some environment variables to override some config values for flexibility, but given our current needs are well-served by the existing approach, this enhancement does not seem necessary right now.

@Ndpnt
Copy link
Member

Ndpnt commented Oct 8, 2025

@Ndpnt I think it's fair that the core team takes responsibility for ensuring the changelog quality and makes the changes directly. Contributors can focus on their code changes, the publishing responsibility stays with the gatekeepers 🙂

Totally agree. I just needed the sponsor info to update it myself.

Co-authored-by: Matti Schneider <[email protected]>
@LVerneyEC
Copy link
Author

@LVerneyEC let us know if something different should be used.

I'm not sure to fully grasp the notion of "sponsor". Also it is quite hard to understand from an external contributor point of view as I would be contributing a feature and not a release (and the sponsor is attached to the release). Also given that in most open-source projects the release cycle is out of hands of individual contributors and OTA is very specific in this regards with a very short release cycle often with a single contribution.

Also, the wording "sponsor" is quite ambiguous and can imply a notion of money or advertisement (e.g. think of sponsored chairs in academia, where one would just pay for branding) beyond the mere contribution on a position paid by a third party.

Anyways, this contribution was made for the need of the EC instance running at https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/ on work time at the EC. I would rather not attach any sponsor wording but I guess the proposed sponsor wording could work (provided you switch "Union" to "Commission") if you need one. I would rather go for a more neutral wording such as

Development of this feature was contributed by the European Commission for its VLOP/VLOSE instance.

Ndpnt added 5 commits October 8, 2025 17:50
* main:
  Release v9.1.1
  Improve changelog
  Add test when mime type contains charset
  Fix PDF extraction when MIME type contains charset
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants