Skip to content

Conversation

NareshThotakuri
Copy link
Collaborator

fixes #2668

Naresh Thotakuri added 3 commits July 31, 2025 12:40
@NareshThotakuri NareshThotakuri self-assigned this Aug 1, 2025
Copy link
Collaborator

@anth-volk anth-volk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @NareshThotakuri, I've highlighted a few blocking issues/questions. Also, please check out the tests, our integration-ish simulation test currently fails.

# Look in computed_household cache table
try:
row = local_database.query(
"SELECT * FROM computed_household WHERE household_id = ? AND policy_id = ? AND api_version = ?",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue, blocking: This code removes the f-string designation, which breaks this database query

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed the issue by restoring the f-string formatting in the query


try:
country = COUNTRIES.get(country_id)
if country is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question, blocking: Are you sure this functions the same way as the original code?

This code base is not the most type-safe or the most well-tested, so want to confirm this behavior.

Copy link
Collaborator Author

@NareshThotakuri NareshThotakuri Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify — I had previously updated if country == None to if country is None to follow Python best practices. However, to avoid introducing any subtle behavior changes in this legacy codebase, I’ve reverted it back to match the original code (== None).

The only addition now is a logger, which I added to help with observability during metadata fetch. The control flow and output remain exactly the same as before, and I’ve tested it locally to confirm.

@@ -88,14 +89,66 @@ def get_household_under_policy(

api_version = COUNTRY_PACKAGE_VERSIONS.get(country_id)

# Look in computed_households to see if already computed
# Log start of request
logger.log_struct(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question, blocking: Is it possible to use more technology-agnostic logging here?

We created gcp_logging.logger as a workaround to the fact that we couldn't find a way to emit JSON-structured logs to non-GCP environments. However, if you rolled this API yourself in, say, AWS, this logging wouldn't work correctly. Is there an implementation-agnostic logging method that has this logging depth, but isn't tied to GCP? Ideally within Python's native logger code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we've updated the logging implementation to be cloud-agnostic while retaining structured, JSON-formatted logs using Python’s native logging module.

Copy link

codecov bot commented Aug 8, 2025

Codecov Report

❌ Patch coverage is 75.55556% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.42%. Comparing base (cd6e653) to head (047ad3a).
⚠️ Report is 42 commits behind head on master.

Files with missing lines Patch % Lines
policyengine_api/endpoints/household.py 60.86% 9 Missing ⚠️
policyengine_api/structured_logger.py 90.90% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2682      +/-   ##
==========================================
- Coverage   81.16%   80.42%   -0.75%     
==========================================
  Files          49       50       +1     
  Lines        1609     1655      +46     
  Branches      208      212       +4     
==========================================
+ Hits         1306     1331      +25     
- Misses        255      274      +19     
- Partials       48       50       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@anth-volk anth-volk self-requested a review August 14, 2025 20:06
Copy link
Collaborator

@anth-volk anth-volk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NareshThotakuri. I've flagged a couple blocking issues and some suggestions to make the logs more readable/maintainable. Looking forward to your modifications.

log_struct(
event="get_household_under_policy_start",
input_data={
"country_id": country_id,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue, blocking: Could you create a request_id value (or something like that) that we can then pass all the way through the logs to make debugging the entire flow easier?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment applies throughout the PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can generate a request_id

Comment on lines 178 to 179
"household_id": household_id,
"country_id": country_id,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I'd prefer if we defined all of the values to be included here in one place, then constantly passed them into all logs. E.g., here we have only household_id and country_id, but at another point, some logs don't have country_id and do have policy_id and household_id.

Also, doing so and just deconstructing an object of values into input_data would remove some duplicative code and improve readability.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment applies throughout the changes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted - we’ve implemented a centralized context object containing all relevant IDs, which we now pass into input_data for every log, eliminating duplication and improving readability.

@@ -227,7 +338,25 @@ def get_calculate(country_id: str, add_missing: bool = False) -> dict:

try:
result = country.calculate(household_json, policy_json)
log_struct(
event="calculation_success",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Distinguish between endpoint types in the event name

E.g., calculation_success refers to two different events, one with database storage underneath and one without

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment applies throughout the PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — we’ve updated the event names to clearly distinguish endpoint types

severity="ERROR",
)

raise RuntimeError(error_msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue, blocking: I'm not sure I'd raise here. This will return a 500 SERVER ERROR instead of the more apt 404 NOT FOUND. I'd recommend returning a structured response instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood - for now, we’ll revert to the existing production code, since changes in /metadata_routes.py are also required; we can implement and test the update once the lower environment is ready.

"country_id": country_id,
"error": error_msg,
},
message=f"Metadata successfully retrieved for country_id '{country_id}'",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue, blocking: I believe this error message is incorrect.

Comment on lines 39 to 45
# If using GCP logging, add a CloudLoggingHandler
# For more advanced GCP integration, consider enabling CloudLoggingHandler.
# if gcp_logging and CloudLoggingHandler:
# client = gcp_logging.Client()
# gcp_handler = CloudLoggingHandler(client, name=name)
# gcp_handler.setFormatter(JsonFormatter()) # Optional
# logger.addHandler(gcp_handler)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment: I'd recommend getting rid of this if not reusing somehow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


# If no handlers are set, add a StreamHandler with JSON formatting
if not logger.handlers:
handler = logging.StreamHandler(sys.stdout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I'm curious if this worked in GCP correctly. I believe you said you tested locally and it does; is that correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve tested it locally and confirmed the logger outputs JSON in the terminal; if you want, we can also verify the logs in GCP Logs Explorer once the lower environment is set up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that creating and sending a GCP log to the prod server is a low-risk activity for the following reasons:

  • It shouldn't impact the service itself
  • If configured properly, it shouldn't delete any logs

Prior to the deployment of any QA environments, and assuming you have the necessary permissions, could you write a Python script using the relevant log-writing snippet to confirm that this structure logs correctly to GCP? I'd have my money on it logging everything as a massive piece of text.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anth-volk - I tested it by sending logs from local and it’s working correctly — the logs are showing up as structured JSON with the expected fields instead of one large text string in log explorer
log

Copy link
Collaborator

@anth-volk anth-volk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NareshThotakuri. I have one minor blocking change. I've also asked if you could test to ensure that the logs properly update the production GCP instance. I myself, in testing code, have configured logs to properly output JSON to the terminal, only to find them formatted as a text block in GCP.

Comment on lines 171 to +188
if row is not None:
household = dict(row)
household["household_json"] = json.loads(household["household_json"])
log_struct(
event="household_data_loaded",
input_data=log_context,
message="Loaded household data from DB.",
severity="INFO",
)

else:
log_struct(
event="household_not_found",
input_data=log_context,
message=f"Household #{household_id} not found.",
severity="WARNING",
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question, not blocking: Why not add the try/catch on this and the policy table below?

I'd rather get this over the line, so please don't block on this, just curious on thinking.

Comment on lines 314 to 318
log_context = {
"request_id": request_id,
"country_id": country_id,
"request_path": request.path,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion, blocking: Don't compose this log_context until the payload has been fully parsed. Add the household_json and policy_json structures into the log_context, if possible.

Without those, we don't know what we're actually debugging.


# If no handlers are set, add a StreamHandler with JSON formatting
if not logger.handlers:
handler = logging.StreamHandler(sys.stdout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that creating and sending a GCP log to the prod server is a low-risk activity for the following reasons:

  • It shouldn't impact the service itself
  • If configured properly, it shouldn't delete any logs

Prior to the deployment of any QA environments, and assuming you have the necessary permissions, could you write a Python script using the relevant log-writing snippet to confirm that this structure logs correctly to GCP? I'd have my money on it logging everything as a massive piece of text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better logs to help diagnose the frequent 502 errors
2 participants