-
Notifications
You must be signed in to change notification settings - Fork 34
Added GCP logs in Household and Metadata services to assist further investigation of the 502 errors #2682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…etadata Service. Further investigating the cause of the 502 errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @NareshThotakuri, I've highlighted a few blocking issues/questions. Also, please check out the tests, our integration-ish simulation test currently fails.
# Look in computed_household cache table | ||
try: | ||
row = local_database.query( | ||
"SELECT * FROM computed_household WHERE household_id = ? AND policy_id = ? AND api_version = ?", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue, blocking: This code removes the f-string designation, which breaks this database query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed the issue by restoring the f-string formatting in the query
|
||
try: | ||
country = COUNTRIES.get(country_id) | ||
if country is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question, blocking: Are you sure this functions the same way as the original code?
This code base is not the most type-safe or the most well-tested, so want to confirm this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify — I had previously updated if country == None
to if country is None
to follow Python best practices. However, to avoid introducing any subtle behavior changes in this legacy codebase, I’ve reverted it back to match the original code (== None
).
The only addition now is a logger, which I added to help with observability during metadata fetch. The control flow and output remain exactly the same as before, and I’ve tested it locally to confirm.
@@ -88,14 +89,66 @@ def get_household_under_policy( | |||
|
|||
api_version = COUNTRY_PACKAGE_VERSIONS.get(country_id) | |||
|
|||
# Look in computed_households to see if already computed | |||
# Log start of request | |||
logger.log_struct( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question, blocking: Is it possible to use more technology-agnostic logging here?
We created gcp_logging.logger
as a workaround to the fact that we couldn't find a way to emit JSON-structured logs to non-GCP environments. However, if you rolled this API yourself in, say, AWS, this logging wouldn't work correctly. Is there an implementation-agnostic logging method that has this logging depth, but isn't tied to GCP? Ideally within Python's native logger
code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we've updated the logging implementation to be cloud-agnostic while retaining structured, JSON-formatted logs using Python’s native logging module.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2682 +/- ##
==========================================
- Coverage 81.16% 80.42% -0.75%
==========================================
Files 49 50 +1
Lines 1609 1655 +46
Branches 208 212 +4
==========================================
+ Hits 1306 1331 +25
- Misses 255 274 +19
- Partials 48 50 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NareshThotakuri. I've flagged a couple blocking issues and some suggestions to make the logs more readable/maintainable. Looking forward to your modifications.
log_struct( | ||
event="get_household_under_policy_start", | ||
input_data={ | ||
"country_id": country_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue, blocking: Could you create a request_id
value (or something like that) that we can then pass all the way through the logs to make debugging the entire flow easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment applies throughout the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can generate a request_id
"household_id": household_id, | ||
"country_id": country_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I'd prefer if we defined all of the values to be included here in one place, then constantly passed them into all logs. E.g., here we have only household_id
and country_id
, but at another point, some logs don't have country_id
and do have policy_id
and household_id
.
Also, doing so and just deconstructing an object of values into input_data
would remove some duplicative code and improve readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment applies throughout the changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted - we’ve implemented a centralized context object containing all relevant IDs, which we now pass into input_data for every log, eliminating duplication and improving readability.
@@ -227,7 +338,25 @@ def get_calculate(country_id: str, add_missing: bool = False) -> dict: | |||
|
|||
try: | |||
result = country.calculate(household_json, policy_json) | |||
log_struct( | |||
event="calculation_success", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Distinguish between endpoint types in the event name
E.g., calculation_success
refers to two different events, one with database storage underneath and one without
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment applies throughout the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed — we’ve updated the event names to clearly distinguish endpoint types
severity="ERROR", | ||
) | ||
|
||
raise RuntimeError(error_msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue, blocking: I'm not sure I'd raise here. This will return a 500 SERVER ERROR instead of the more apt 404 NOT FOUND. I'd recommend returning a structured response instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood - for now, we’ll revert to the existing production code, since changes in /metadata_routes.py are also required; we can implement and test the update once the lower environment is ready.
"country_id": country_id, | ||
"error": error_msg, | ||
}, | ||
message=f"Metadata successfully retrieved for country_id '{country_id}'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue, blocking: I believe this error message is incorrect.
# If using GCP logging, add a CloudLoggingHandler | ||
# For more advanced GCP integration, consider enabling CloudLoggingHandler. | ||
# if gcp_logging and CloudLoggingHandler: | ||
# client = gcp_logging.Client() | ||
# gcp_handler = CloudLoggingHandler(client, name=name) | ||
# gcp_handler.setFormatter(JsonFormatter()) # Optional | ||
# logger.addHandler(gcp_handler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment: I'd recommend getting rid of this if not reusing somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
# If no handlers are set, add a StreamHandler with JSON formatting | ||
if not logger.handlers: | ||
handler = logging.StreamHandler(sys.stdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: I'm curious if this worked in GCP correctly. I believe you said you tested locally and it does; is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve tested it locally and confirmed the logger outputs JSON in the terminal; if you want, we can also verify the logs in GCP Logs Explorer once the lower environment is set up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that creating and sending a GCP log to the prod server is a low-risk activity for the following reasons:
- It shouldn't impact the service itself
- If configured properly, it shouldn't delete any logs
Prior to the deployment of any QA environments, and assuming you have the necessary permissions, could you write a Python script using the relevant log-writing snippet to confirm that this structure logs correctly to GCP? I'd have my money on it logging everything as a massive piece of text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anth-volk - I tested it by sending logs from local and it’s working correctly — the logs are showing up as structured JSON with the expected fields instead of one large text string in log explorer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NareshThotakuri. I have one minor blocking change. I've also asked if you could test to ensure that the logs properly update the production GCP instance. I myself, in testing code, have configured logs to properly output JSON to the terminal, only to find them formatted as a text block in GCP.
if row is not None: | ||
household = dict(row) | ||
household["household_json"] = json.loads(household["household_json"]) | ||
log_struct( | ||
event="household_data_loaded", | ||
input_data=log_context, | ||
message="Loaded household data from DB.", | ||
severity="INFO", | ||
) | ||
|
||
else: | ||
log_struct( | ||
event="household_not_found", | ||
input_data=log_context, | ||
message=f"Household #{household_id} not found.", | ||
severity="WARNING", | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question, not blocking: Why not add the try/catch on this and the policy table below?
I'd rather get this over the line, so please don't block on this, just curious on thinking.
log_context = { | ||
"request_id": request_id, | ||
"country_id": country_id, | ||
"request_path": request.path, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion, blocking: Don't compose this log_context
until the payload has been fully parsed. Add the household_json
and policy_json
structures into the log_context
, if possible.
Without those, we don't know what we're actually debugging.
|
||
# If no handlers are set, add a StreamHandler with JSON formatting | ||
if not logger.handlers: | ||
handler = logging.StreamHandler(sys.stdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that creating and sending a GCP log to the prod server is a low-risk activity for the following reasons:
- It shouldn't impact the service itself
- If configured properly, it shouldn't delete any logs
Prior to the deployment of any QA environments, and assuming you have the necessary permissions, could you write a Python script using the relevant log-writing snippet to confirm that this structure logs correctly to GCP? I'd have my money on it logging everything as a massive piece of text.
fixes #2668