-
Notifications
You must be signed in to change notification settings - Fork 1.3k
PanicOnWarnings option to detect SQL warnings and fail the copy process #1500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ran gh-ost locally on a Migrate cmd:
Generic
Then, ran these explains on the copy query (
Looks like there will be no significant change in perf? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new CLI flag (panic-on-warnings) that causes gh-ost to fail when SQL warnings are detected during row copying and also logs all SQL warnings.
- A new flag "--panic-on-warnings" is added in the main CLI and corresponding context fields.
- The iteration range calculation now also returns an expected row count with adjusted logic to process SQL warnings.
- Various SQL builder functions and tests have been updated to support the enhanced query structure and error handling.
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
dev.yml | Adds configuration for testing environments and dependency management. |
go/logic/applier.go | Updates range end calculation and SQL warning processing logic. |
go/sql/builder_test.go | Adjusts expected queries and argument lists in tests to match new behavior. |
go/logic/migrator.go | Incorporates expected range size into chunk iteration and warning checks. |
go/sql/builder.go | Modifies SQL builder queries with minor refactorings and marks an unused variable. |
go/cmd/gh-ost/main.go | Introduces the new CLI flag for panic-on-warnings. |
go/base/context.go | Adds new context fields for managing SQL warnings and panic configuration. |
Comments suppressed due to low confidence (1)
go/sql/builder.go:278
- [nitpick] The variable 'uniqueKeyColumnDescending' is declared but not used anywhere in the function. Removing it could improve code clarity.
uniqueKeyColumnDescending := make([]string, len(uniqueKeyColumnNames)) // TODO unused variable
Looks good! I will test the performance of this internally (hopefully this week) |
go/logic/applier.go
Outdated
continue | ||
} | ||
pkDuplicateSuffix := fmt.Sprintf("for key '%s.PRIMARY'", this.migrationContext.GetGhostTableName()) | ||
if strings.HasPrefix(message, "Duplicate entry") && strings.HasSuffix(message, pkDuplicateSuffix) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may have a different shared unique key than the primary key (migrationContext.UniqueKey
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Based on how the shared key is selected in inspectOriginalAndGhostTables
, it seems the correct behavior here would be to ignore any warnings including migrationContext.UniqueKey.Name
instead of just hardcoding the pk. This is a logic spill from LHM - I'll fix that and see if I can add a unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 36eaf27 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost ready to merge, just a couple comments 🚀
There is another edge case where the warning check doesn't work properly. If the shared unique key is being renamed, the INSERT IGNORE
warning will say Duplicate entry ... for key _${table}_gho.${renamed_unique_shared_key}
, because migrationContext.UniqueKey.Name
is the name of the unique key on the original table.
For example, localtests/swap-uk-uk
test case fails with --panic-on-warnings
.
I believe we can fix it by changing migrationContext.UniqueKey.Name
to be the name of the shared key on the ghost table.
Second, the main use of panic on warnings is to prevent data loss on add unique key
migrations, but we should note that this PR doesn't detect dataloss from the binlog applier, which uses replace into
. We can address this in a followup PR.
- log insert warnings always - terminate if row count doesn't match and the PanicOnWarnings flag is set
…-type-change/extra_args Co-authored-by: Bastian Bartmann <[email protected]>
- documentation - remove dev.yml - remove unused variable
36eaf27
to
1487b5c
Compare
Working on a fix for this ⏳ It gets somewhat more involved as we seem to rely on the original table key name e.g. here. Edit: proposed fix in 1ca8ffc
That's a good point. I believe https://github.com/Shopify/gh-ost/blob/1487b5c84887e3753257eed3a897a0167513fb3e/go/logic/applier.go#L1372-L1379 would need to re-use the Edit: Tried to write a new unit test for the |
Nice! It looks like the exact |
…s when renaming unique keys Error message formats are different across mysql distributions and versions
fe68db4
to
71a63c4
Compare
Nice, glad that we caught it 😌 Fixed this in the latest commit 👍 |
Yes
We can consider it out of scope for this PR |
Thanks for your work on this! 🚀 |
The split between Temptable and Offset query builders was originally introduced in github#471. Then, both query builders have been changed to calculate actual chunk sizes in github#1500. The updated implementation of the Offset query introduced a bug, where a missing `order by` clause in `select_osc_chunk` can result in end values potentially surpassing the chunk_size. This wasn't detected during our initial testing where the query just returned rows in their original order, but may happen in real-world scenarios in case the db returns data in an undefined order. An obvious fix would be to just add an `order by` to the Offset builder subquery, however since both builders use temptables now, it makes more sense to simplify and use only one of them. Alternatively, the builder could only use the less performant count query variants when `--panic-on-warnings` is enabled and otherwise use the simpler ones from pull/471. We decided to not follow this path for now, hoping `--panic-on-warnings` becomes an updated and safer default in the future. Co-authored-by: Bastian Bartmann <[email protected]>
The split between Temptable and Offset query builders was originally introduced in github#471. Then, both query builders have been changed to calculate actual chunk sizes in github#1500. The updated implementation of the Offset query introduced a bug, where a missing `order by` clause in `select_osc_chunk` can result in end values potentially surpassing the chunk_size. This wasn't detected during our initial testing where the query just returned rows in their original order, but may happen in real-world scenarios in case the db returns data in an undefined order. An obvious fix would be to just add an `order by` to the Offset builder subquery, however since both builders use temptables now, it makes more sense to simplify and use only one of them. Alternatively, the builder could only use the less performant count query variants when `--panic-on-warnings` is enabled and otherwise use the simpler ones from pull/471. We decided to not follow this path for now, hoping `--panic-on-warnings` becomes an updated and safer default in the future. Co-authored-by: Bastian Bartmann <[email protected]>
…roblems with --panic-on-warnings (#12) Problem 1: The split between Temptable and Offset query builders was originally introduced in github#471. Then, both query builders have been changed to calculate actual chunk sizes in github#1500. The updated implementation of the Offset query introduced a bug, where a missing `order by` clause in `select_osc_chunk` can result in end values potentially surpassing the chunk_size. This wasn't detected during our initial testing where the query just returned rows in their original order, but may happen in real-world scenarios in case the db returns data in an undefined order. An obvious fix would be to just add an `order by` to the Offset builder subquery. Problem 2: Between the execution of `CalculateNextIterationRangeEndValues` and `ApplyIterationInsertQuery` there could be new rows inserted/deleted into the range causing `expectedRowCount` to be inaccurate. With `--panic-on-warnings` this could lead to the migration erroneously panicking. Changes summary: - Removed count subqueries from range builders to restore the more performant approach from PR github#471 - Modified panic-on-warnings logic to trigger errors based solely on SQL warnings, not on row count mismatches - This addresses potential race conditions where row count comparisons could produce false positives due to concurrent table modifications Co-authored-by: Bastian Bartmann <[email protected]>
…roblems with --panic-on-warnings (#1557) * Fix CalculateNextIterationRangeEndValues order by. The split between Temptable and Offset query builders was originally introduced in #471. Then, both query builders have been changed to calculate actual chunk sizes in #1500. The updated implementation of the Offset query introduced a bug, where a missing `order by` clause in `select_osc_chunk` can result in end values potentially surpassing the chunk_size. This wasn't detected during our initial testing where the query just returned rows in their original order, but may happen in real-world scenarios in case the db returns data in an undefined order. An obvious fix would be to just add an `order by` to the Offset builder subquery, however since both builders use temptables now, it makes more sense to simplify and use only one of them. Alternatively, the builder could only use the less performant count query variants when `--panic-on-warnings` is enabled and otherwise use the simpler ones from pull/471. We decided to not follow this path for now, hoping `--panic-on-warnings` becomes an updated and safer default in the future. Co-authored-by: Bastian Bartmann <[email protected]> * Fix builder's chunk order and drop panic-on-warnings row count check - Removed count subqueries from range builders to restore the more performant approach from PR #471 - Modified panic-on-warnings logic to trigger errors based solely on SQL warnings, not on row count mismatches - This addresses potential race conditions where row count comparisons could produce false positives due to concurrent table modifications --------- Co-authored-by: Bastian Bartmann <[email protected]>
Hello |
Related issue: #1498
Description
This PR adds a CLI flag
--panic-on-warnings
that makesgh-ost
fail after copying each batch when:affectedRows
)I also decided to always log all SQL warnings, which could be useful to find issues with data truncation or not null constraints even when the row count matches.
Opening as draft with a few remaining items:
review documentationreview test coveragecheck explain analyze on the updated range select queryscript/cibuild
returns with no formatting errors, build errors or unit test errors.