Added GCP logs in Household and Metadata services to assist further investigation of the 502 errors #2682

NareshThotakuri · 2025-07-31T18:30:43Z

…etadata Service. Further investigating the cause of the 502 errors.

anth-volk

Hi @NareshThotakuri, I've highlighted a few blocking issues/questions. Also, please check out the tests, our integration-ish simulation test currently fails.

anth-volk · 2025-08-04T14:44:21Z

policyengine_api/endpoints/household.py

+    # Look in computed_household cache table
+    try:
+        row = local_database.query(
+            "SELECT * FROM computed_household WHERE household_id = ? AND policy_id = ? AND api_version = ?",


issue, blocking: This code removes the f-string designation, which breaks this database query

I've fixed the issue by restoring the f-string formatting in the query

anth-volk · 2025-08-04T14:45:23Z

policyengine_api/services/metadata_service.py

+
+        try:
+            country = COUNTRIES.get(country_id)
+            if country is None:


question, blocking: Are you sure this functions the same way as the original code?

This code base is not the most type-safe or the most well-tested, so want to confirm this behavior.

Just to clarify — I had previously updated if country == None to if country is None to follow Python best practices. However, to avoid introducing any subtle behavior changes in this legacy codebase, I’ve reverted it back to match the original code (== None).

The only addition now is a logger, which I added to help with observability during metadata fetch. The control flow and output remain exactly the same as before, and I’ve tested it locally to confirm.

anth-volk · 2025-08-04T14:47:45Z

policyengine_api/endpoints/household.py

@@ -88,14 +89,66 @@ def get_household_under_policy(

    api_version = COUNTRY_PACKAGE_VERSIONS.get(country_id)

-    # Look in computed_households to see if already computed
+    # Log start of request
+    logger.log_struct(


question, blocking: Is it possible to use more technology-agnostic logging here?

We created gcp_logging.logger as a workaround to the fact that we couldn't find a way to emit JSON-structured logs to non-GCP environments. However, if you rolled this API yourself in, say, AWS, this logging wouldn't work correctly. Is there an implementation-agnostic logging method that has this logging depth, but isn't tied to GCP? Ideally within Python's native logger code.

Yes, we've updated the logging implementation to be cloud-agnostic while retaining structured, JSON-formatted logs using Python’s native logging module.

codecov · 2025-08-08T11:01:17Z

Codecov Report

❌ Patch coverage is 75.55556% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.42%. Comparing base (cd6e653) to head (047ad3a).
⚠️ Report is 42 commits behind head on master.

Files with missing lines	Patch %	Lines
policyengine_api/endpoints/household.py	60.86%	9 Missing ⚠️
policyengine_api/structured_logger.py	90.90%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2682      +/-   ##
==========================================
- Coverage   81.16%   80.42%   -0.75%     
==========================================
  Files          49       50       +1     
  Lines        1609     1655      +46     
  Branches      208      212       +4     
==========================================
+ Hits         1306     1331      +25     
- Misses        255      274      +19     
- Partials       48       50       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

anth-volk

Thanks @NareshThotakuri. I've flagged a couple blocking issues and some suggestions to make the logs more readable/maintainable. Looking forward to your modifications.

anth-volk · 2025-08-14T20:07:24Z

policyengine_api/endpoints/household.py

+    log_struct(
+        event="get_household_under_policy_start",
+        input_data={
+            "country_id": country_id,


issue, blocking: Could you create a request_id value (or something like that) that we can then pass all the way through the logs to make debugging the entire flow easier?

This comment applies throughout the PR.

Yes, we can generate a request_id

anth-volk · 2025-08-14T20:10:11Z

policyengine_api/endpoints/household.py

+                "household_id": household_id,
+                "country_id": country_id,


suggestion: I'd prefer if we defined all of the values to be included here in one place, then constantly passed them into all logs. E.g., here we have only household_id and country_id, but at another point, some logs don't have country_id and do have policy_id and household_id.

Also, doing so and just deconstructing an object of values into input_data would remove some duplicative code and improve readability.

This comment applies throughout the changes

Noted - we’ve implemented a centralized context object containing all relevant IDs, which we now pass into input_data for every log, eliminating duplication and improving readability.

anth-volk · 2025-08-14T20:13:00Z

policyengine_api/endpoints/household.py

@@ -227,7 +338,25 @@ def get_calculate(country_id: str, add_missing: bool = False) -> dict:

    try:
        result = country.calculate(household_json, policy_json)
+        log_struct(
+            event="calculation_success",


suggestion: Distinguish between endpoint types in the event name

E.g., calculation_success refers to two different events, one with database storage underneath and one without

This comment applies throughout the PR.

Agreed — we’ve updated the event names to clearly distinguish endpoint types

anth-volk · 2025-08-14T20:15:45Z

policyengine_api/services/metadata_service.py

+                    severity="ERROR",
+                )
+
+                raise RuntimeError(error_msg)


issue, blocking: I'm not sure I'd raise here. This will return a 500 SERVER ERROR instead of the more apt 404 NOT FOUND. I'd recommend returning a structured response instead.

Understood - for now, we’ll revert to the existing production code, since changes in /metadata_routes.py are also required; we can implement and test the update once the lower environment is ready.

anth-volk · 2025-08-14T20:15:57Z

policyengine_api/services/metadata_service.py

+                        "country_id": country_id,
+                        "error": error_msg,
+                    },
+                    message=f"Metadata successfully retrieved for country_id '{country_id}'",


issue, blocking: I believe this error message is incorrect.

anth-volk · 2025-08-14T20:17:42Z

policyengine_api/structured_logger.py

+        # If using GCP logging, add a CloudLoggingHandler
+        # For more advanced GCP integration, consider enabling CloudLoggingHandler.
+        # if gcp_logging and CloudLoggingHandler:
+        #         client = gcp_logging.Client()
+        #         gcp_handler = CloudLoggingHandler(client, name=name)
+        #         gcp_handler.setFormatter(JsonFormatter())  # Optional
+        #         logger.addHandler(gcp_handler)


comment: I'd recommend getting rid of this if not reusing somehow.

anth-volk · 2025-08-14T20:18:25Z

policyengine_api/structured_logger.py

+
+    # If no handlers are set, add a StreamHandler with JSON formatting
+    if not logger.handlers:
+        handler = logging.StreamHandler(sys.stdout)


question: I'm curious if this worked in GCP correctly. I believe you said you tested locally and it does; is that correct?

I’ve tested it locally and confirmed the logger outputs JSON in the terminal; if you want, we can also verify the logs in GCP Logs Explorer once the lower environment is set up.

I would argue that creating and sending a GCP log to the prod server is a low-risk activity for the following reasons:

It shouldn't impact the service itself

If configured properly, it shouldn't delete any logs

Prior to the deployment of any QA environments, and assuming you have the necessary permissions, could you write a Python script using the relevant log-writing snippet to confirm that this structure logs correctly to GCP? I'd have my money on it logging everything as a massive piece of text.

@anth-volk - I tested it by sending logs from local and it’s working correctly — the logs are showing up as structured JSON with the expected fields instead of one large text string in log explorer

anth-volk

Thanks @NareshThotakuri. I have one minor blocking change. I've also asked if you could test to ensure that the logs properly update the production GCP instance. I myself, in testing code, have configured logs to properly output JSON to the terminal, only to find them formatted as a text block in GCP.

anth-volk · 2025-08-19T13:58:21Z

policyengine_api/endpoints/household.py

    if row is not None:
        household = dict(row)
        household["household_json"] = json.loads(household["household_json"])
+        log_struct(
+            event="household_data_loaded",
+            input_data=log_context,
+            message="Loaded household data from DB.",
+            severity="INFO",
+        )
+
    else:
+        log_struct(
+            event="household_not_found",
+            input_data=log_context,
+            message=f"Household #{household_id} not found.",
+            severity="WARNING",
+        )
+


question, not blocking: Why not add the try/catch on this and the policy table below?

I'd rather get this over the line, so please don't block on this, just curious on thinking.

anth-volk · 2025-08-19T14:00:40Z

policyengine_api/endpoints/household.py

+    log_context = {
+        "request_id": request_id,
+        "country_id": country_id,
+        "request_path": request.path,
+    }


suggestion, blocking: Don't compose this log_context until the payload has been fully parsed. Add the household_json and policy_json structures into the log_context, if possible.

Without those, we don't know what we're actually debugging.

anth-volk · 2025-08-19T14:04:31Z

policyengine_api/structured_logger.py

+
+    # If no handlers are set, add a StreamHandler with JSON formatting
+    if not logger.handlers:
+        handler = logging.StreamHandler(sys.stdout)


I would argue that creating and sending a GCP log to the prod server is a low-risk activity for the following reasons:

It shouldn't impact the service itself

If configured properly, it shouldn't delete any logs

Prior to the deployment of any QA environments, and assuming you have the necessary permissions, could you write a Python script using the relevant log-writing snippet to confirm that this structure logs correctly to GCP? I'd have my money on it logging everything as a massive piece of text.

Naresh Thotakuri added 3 commits July 31, 2025 12:40

Committed changes including added logs to the Household Service and M…

ba788ab

…etadata Service. Further investigating the cause of the 502 errors.

added the changelog

155edb9

changelog entry added

0f95cb3

NareshThotakuri requested a review from anth-volk July 31, 2025 18:30

Naresh Thotakuri added 2 commits July 31, 2025 18:56

Reformatted code with Black for metadata and household services

214e951

Fix formatting with Black

217b48f

NareshThotakuri self-assigned this Aug 1, 2025

anth-volk requested changes Aug 4, 2025

View reviewed changes

Naresh Thotakuri added 6 commits August 8, 2025 08:36

Agnostic Logging Implemented

a131e0f

Format household.py with Black (line length 105)

8a629f8

Fix formatting: household.py

899ad20

Fix formatting: household.py

846f38f

added logger for get calculate method

03a69bd

added logger for get calculate method

cf7b86f

anth-volk self-requested a review August 14, 2025 20:06

anth-volk requested changes Aug 14, 2025

View reviewed changes

Naresh Thotakuri added 2 commits August 18, 2025 12:16

changes related PR comments

721a6e1

changes related to metadata

e86d3fa

anth-volk requested changes Aug 19, 2025

View reviewed changes

Fixed the PR comments

047ad3a

Added GCP logs in Household and Metadata services to assist further investigation of the 502 errors #2682

Are you sure you want to change the base?

Added GCP logs in Household and Metadata services to assist further investigation of the 502 errors #2682

Uh oh!

Conversation

NareshThotakuri commented Jul 31, 2025

Uh oh!

anth-volk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NareshThotakuri Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

anth-volk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anth-volk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NareshThotakuri Aug 8, 2025 •

edited

Loading

codecov bot commented Aug 8, 2025 •

edited

Loading