PanicOnWarnings option to detect SQL warnings and fail the copy process #1500

grodowski · 2025-02-28T17:01:04Z

Related issue: #1498

Description

This PR adds a CLI flag --panic-on-warnings that makes gh-ost fail after copying each batch when:

the expected, selected row count differs from actual copied row count (affectedRows)
SQL warnings are present, indicating why some rows were dropped

I also decided to always log all SQL warnings, which could be useful to find issues with data truncation or not null constraints even when the row count matches.

Opening as draft with a few remaining items:

~~review documentation~~
~~review test coverage~~
~~check explain analyze on the updated range select query~~

In case this PR introduced Go code changes:

contributed code is using same conventions as original code
script/cibuild returns with no formatting errors, build errors or unit test errors.

grodowski · 2025-03-11T14:54:19Z

Ran gh-ost locally on a 1.7GiB table with 110k rows with generated blob data to check if there's any noticable impact of the added count(*) for each batch.

Migrate cmd:

gh-ost --allow-on-master --allow-master-master --table="large_books" --alter="add index author_index(author(255))" --database="schema_migrations_test_app_development" --user="root" --password="" --initially-drop-old-table --initially-drop-ghost-table --default-retries=3 --verbose --debug --host=$MYSQL_HOST --port=$MYSQL_PORT --execute

Generic time benchmark on two binaries built on master and grodowski/raise_on_warnings (gh-ost-new)

gh-ost-master --allow-on-master --allow-master-master ... 0.99s user 1.45s system 7% cpu 30.708 total
gh-ost-master --allow-on-master --allow-master-master ... 0.98s user 1.43s system 7% cpu 31.321 total

gh-ost-new --allow-on-master --allow-master-master ... 1.00s user 1.46s system 7% cpu 32.743 total
gh-ost-new --allow-on-master --allow-master-master ... 0.98s user 1.41s system 7% cpu 31.117 total

Then, ran these explains on the copy query (BuildUniqueKeyRangeEndPreparedQueryViaTemptable):

-- before
explain analyze select /* gh-ost mydb.tbl test */ id
from (
    select
        id
    from
        large_books
    where id >= 50000 and id <= 52000
    order by
        id asc
    limit 1000) select_osc_chunk
order by
    id desc
limit 1

| -> Limit: 1 row(s)  (cost=1021..1021 rows=1) (actual time=0.943..0.943 rows=1 loops=1)
    -> Sort: select_osc_chunk.id DESC, limit input to 1 row(s) per chunk  (cost=1021..1021 rows=1) (actual time=0.942..0.942 rows=1 loops=1)
        -> Table scan on select_osc_chunk  (cost=906..921 rows=1000) (actual time=0.746..0.843 rows=1000 loops=1)
            -> Materialize  (cost=906..906 rows=1000) (actual time=0.744..0.744 rows=1000 loops=1)
                -> Limit: 1000 row(s)  (cost=806 rows=1000) (actual time=0.137..0.64 rows=1000 loops=1)
                    -> Filter: ((large_books.id >= 50000) and (large_books.id <= 52000))  (cost=806 rows=3992) (actual time=0.136..0.579 rows=1000 loops=1)
                        -> Covering index range scan on large_books using PRIMARY over (50000 <= id <= 52000)  (cost=806 rows=3992) (actual time=0.132..0.446 rows=1000 loops=1)

-- after
explain analyze select /* gh-ost mydb.tbl test */
    id,
    (select count(*) from (
        select
            id
        from
            large_books
        where id >= 50000 and id <= 52000
        order by
            id asc
        limit 1000
    ) select_osc_chunk)
from (
    select
        id
    from
        large_books
    where id >= 50000 and id <= 52000
    order by
        id asc
    limit 1000
) select_osc_chunk
order by
    id desc
limit 1;

| -> Limit: 1 row(s)  (cost=1021..1021 rows=1) (actual time=1.21..1.21 rows=1 loops=1)
    -> Sort: select_osc_chunk.id DESC, limit input to 1 row(s) per chunk  (cost=1021..1021 rows=1) (actual time=1.21..1.21 rows=1 loops=1)
        -> Table scan on select_osc_chunk  (cost=906..921 rows=1000) (actual time=0.955..1.08 rows=1000 loops=1)
            -> Materialize  (cost=906..906 rows=1000) (actual time=0.953..0.953 rows=1000 loops=1)
                -> Limit: 1000 row(s)  (cost=806 rows=1000) (actual time=0.161..0.813 rows=1000 loops=1)
                    -> Filter: ((large_books.id >= 50000) and (large_books.id <= 52000))  (cost=806 rows=3992) (actual time=0.16..0.732 rows=1000 loops=1)
                        -> Covering index range scan on large_books using PRIMARY over (50000 <= id <= 52000)  (cost=806 rows=3992) (actual time=0.157..0.574 rows=1000 loops=1)
-> Select #2 (subquery in projection; run only once)
    -> Aggregate: count(0)  (cost=1021..1021 rows=1) (actual time=1.04..1.04 rows=1 loops=1)
        -> Table scan on select_osc_chunk  (cost=906..921 rows=1000) (actual time=0.894..0.992 rows=1000 loops=1)
            -> Materialize  (cost=906..906 rows=1000) (actual time=0.894..0.894 rows=1000 loops=1)
                -> Limit: 1000 row(s)  (cost=806 rows=1000) (actual time=0.14..0.774 rows=1000 loops=1)
                    -> Filter: ((large_books.id >= 50000) and (large_books.id <= 52000))  (cost=806 rows=3992) (actual time=0.14..0.694 rows=1000 loops=1)
                        -> Covering index range scan on large_books using PRIMARY over (50000 <= id <= 52000)  (cost=806 rows=3992) (actual time=0.139..0.535 rows=1000 loops=1)

Looks like there will be no significant change in perf?

Copilot

Pull Request Overview

This PR introduces a new CLI flag (panic-on-warnings) that causes gh-ost to fail when SQL warnings are detected during row copying and also logs all SQL warnings.

A new flag "--panic-on-warnings" is added in the main CLI and corresponding context fields.
The iteration range calculation now also returns an expected row count with adjusted logic to process SQL warnings.
Various SQL builder functions and tests have been updated to support the enhanced query structure and error handling.

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
dev.yml	Adds configuration for testing environments and dependency management.
go/logic/applier.go	Updates range end calculation and SQL warning processing logic.
go/sql/builder_test.go	Adjusts expected queries and argument lists in tests to match new behavior.
go/logic/migrator.go	Incorporates expected range size into chunk iteration and warning checks.
go/sql/builder.go	Modifies SQL builder queries with minor refactorings and marks an unused variable.
go/cmd/gh-ost/main.go	Introduces the new CLI flag for panic-on-warnings.
go/base/context.go	Adds new context fields for managing SQL warnings and panic configuration.

Comments suppressed due to low confidence (1)

go/sql/builder.go:278

[nitpick] The variable 'uniqueKeyColumnDescending' is declared but not used anywhere in the function. Removing it could improve code clarity.

uniqueKeyColumnDescending := make([]string, len(uniqueKeyColumnNames)) // TODO unused variable

go/logic/migrator.go

meiji163 · 2025-03-12T23:03:42Z

Looks good! I will test the performance of this internally (hopefully this week)

meiji163 · 2025-03-18T02:49:18Z

go/logic/applier.go

+					continue
+				}
+				pkDuplicateSuffix := fmt.Sprintf("for key '%s.PRIMARY'", this.migrationContext.GetGhostTableName())
+				if strings.HasPrefix(message, "Duplicate entry") && strings.HasSuffix(message, pkDuplicateSuffix) {


We may have a different shared unique key than the primary key (migrationContext.UniqueKey)

Good catch! Based on how the shared key is selected in inspectOriginalAndGhostTables, it seems the correct behavior here would be to ignore any warnings including migrationContext.UniqueKey.Name instead of just hardcoding the pk. This is a logic spill from LHM - I'll fix that and see if I can add a unit test.

Addressed in 36eaf27 👍

go/logic/applier.go

meiji163

Almost ready to merge, just a couple comments 🚀

There is another edge case where the warning check doesn't work properly. If the shared unique key is being renamed, the INSERT IGNORE warning will say Duplicate entry ... for key _${table}_gho.${renamed_unique_shared_key}, because migrationContext.UniqueKey.Name is the name of the unique key on the original table.
For example, localtests/swap-uk-uk test case fails with --panic-on-warnings.
I believe we can fix it by changing migrationContext.UniqueKey.Name to be the name of the shared key on the ghost table.

Second, the main use of panic on warnings is to prevent data loss on add unique key migrations, but we should note that this PR doesn't detect dataloss from the binlog applier, which uses replace into. We can address this in a followup PR.

dev.yml

go/sql/builder.go

go/cmd/gh-ost/main.go

…ancy

- log insert warnings always - terminate if row count doesn't match and the PanicOnWarnings flag is set

…-type-change/extra_args Co-authored-by: Bastian Bartmann <[email protected]>

…d test coverage.

- documentation - remove dev.yml - remove unused variable

grodowski · 2025-03-24T17:10:27Z

If the shared unique key is being renamed, the INSERT IGNORE warning will say Duplicate entry ... for key _${table}_gho.${renamed_unique_shared_key}, because migrationContext.UniqueKey.Name is the name of the unique key on the original table.

Working on a fix for this ⏳ It gets somewhat more involved as we seem to rely on the original table key name e.g. here.

Edit: proposed fix in 1ca8ffc

Second, the main use of panic on warnings is to prevent data loss on add unique key migrations, but we should note that this PR doesn't detect dataloss from the binlog applier, which uses replace into. We can address this in a followup PR.

That's a good point. I believe https://github.com/Shopify/gh-ost/blob/1487b5c84887e3753257eed3a897a0167513fb3e/go/logic/applier.go#L1372-L1379 would need to re-use the show warnings code. Need to check if it would works with the multi-statement query.

Edit: Tried to write a new unit test for the Applier and didn't see any SQL warnings written - it just overwrites on events by design. So is this about new rows being written by clients while a migration is running (is it a responsibility of the client to not do that?), or is there a race condition I don't see? An example scenario would be really helpful.

meiji163 · 2025-03-25T00:09:24Z

proposed fix in 1ca8ffc

Nice! It looks like the exact Duplicate key warning messages differs between versions (the peril of parsing error messages 😅 ). Perhaps strings.HasPrefix(message, "Duplicate entry") && strings.Contains(message, this.migrationContext.UniqueKey.NameInGhostTable) would work?

…s when renaming unique keys Error message formats are different across mysql distributions and versions

grodowski · 2025-03-25T10:32:36Z

Perhaps strings.HasPrefix(message, "Duplicate entry") && strings.Contains(message, this.migrationContext.UniqueKey.NameInGhostTable) would work?

Nice, glad that we caught it 😌 Fixed this in the latest commit 👍

meiji163 · 2025-03-25T17:52:40Z

Edit: Tried to write a new unit test for the Applier and didn't see any SQL warnings written - it just overwrites on events by design. So is this about new rows being written by clients while a migration is running (is it a responsibility of the client to not do that?), or is there a race condition I don't see? An example scenario would be really helpful.

Yes replace into doesn't raise warning by design, so we'd have to change it to insert to detect data loss. e.g.

migration adding a unique key
user inserts row that's has duplicate key, where the row was already copied to ghost table
binlog applier sees the insert and replaces the row, without raising any warnings

We can consider it out of scope for this PR

meiji163 · 2025-03-26T01:34:21Z

Thanks for your work on this! 🚀

The split between Temptable and Offset query builders was originally introduced in github#471. Then, both query builders have been changed to calculate actual chunk sizes in github#1500. The updated implementation of the Offset query introduced a bug, where a missing `order by` clause in `select_osc_chunk` can result in end values potentially surpassing the chunk_size. This wasn't detected during our initial testing where the query just returned rows in their original order, but may happen in real-world scenarios in case the db returns data in an undefined order. An obvious fix would be to just add an `order by` to the Offset builder subquery, however since both builders use temptables now, it makes more sense to simplify and use only one of them. Alternatively, the builder could only use the less performant count query variants when `--panic-on-warnings` is enabled and otherwise use the simpler ones from pull/471. We decided to not follow this path for now, hoping `--panic-on-warnings` becomes an updated and safer default in the future. Co-authored-by: Bastian Bartmann <[email protected]>

…roblems with --panic-on-warnings (#12) Problem 1: The split between Temptable and Offset query builders was originally introduced in github#471. Then, both query builders have been changed to calculate actual chunk sizes in github#1500. The updated implementation of the Offset query introduced a bug, where a missing `order by` clause in `select_osc_chunk` can result in end values potentially surpassing the chunk_size. This wasn't detected during our initial testing where the query just returned rows in their original order, but may happen in real-world scenarios in case the db returns data in an undefined order. An obvious fix would be to just add an `order by` to the Offset builder subquery. Problem 2: Between the execution of `CalculateNextIterationRangeEndValues` and `ApplyIterationInsertQuery` there could be new rows inserted/deleted into the range causing `expectedRowCount` to be inaccurate. With `--panic-on-warnings` this could lead to the migration erroneously panicking. Changes summary: - Removed count subqueries from range builders to restore the more performant approach from PR github#471 - Modified panic-on-warnings logic to trigger errors based solely on SQL warnings, not on row count mismatches - This addresses potential race conditions where row count comparisons could produce false positives due to concurrent table modifications Co-authored-by: Bastian Bartmann <[email protected]>

…roblems with --panic-on-warnings (#1557) * Fix CalculateNextIterationRangeEndValues order by. The split between Temptable and Offset query builders was originally introduced in #471. Then, both query builders have been changed to calculate actual chunk sizes in #1500. The updated implementation of the Offset query introduced a bug, where a missing `order by` clause in `select_osc_chunk` can result in end values potentially surpassing the chunk_size. This wasn't detected during our initial testing where the query just returned rows in their original order, but may happen in real-world scenarios in case the db returns data in an undefined order. An obvious fix would be to just add an `order by` to the Offset builder subquery, however since both builders use temptables now, it makes more sense to simplify and use only one of them. Alternatively, the builder could only use the less performant count query variants when `--panic-on-warnings` is enabled and otherwise use the simpler ones from pull/471. We decided to not follow this path for now, hoping `--panic-on-warnings` becomes an updated and safer default in the future. Co-authored-by: Bastian Bartmann <[email protected]> * Fix builder's chunk order and drop panic-on-warnings row count check - Removed count subqueries from range builders to restore the more performant approach from PR #471 - Modified panic-on-warnings logic to trigger errors based solely on SQL warnings, not on row count mismatches - This addresses potential race conditions where row count comparisons could produce false positives due to concurrent table modifications --------- Co-authored-by: Bastian Bartmann <[email protected]>

xiaoxuanfeng · 2025-08-18T08:27:40Z

Edit: Tried to write a new unit test for the Applier and didn't see any SQL warnings written - it just overwrites on events by design. So is this about new rows being written by clients while a migration is running (is it a responsibility of the client to not do that?), or is there a race condition I don't see? An example scenario would be really helpful.

Yes replace into doesn't raise warning by design, so we'd have to change it to insert to detect data loss. e.g.

migration adding a unique key

user inserts row that's has duplicate key, where the row was already copied to ghost table

binlog applier sees the insert and replaces the row, without raising any warnings

We can consider it out of scope for this PR

Hello
Ghost applies binlog incremental data:
The update event in binlog corresponds to the ghost updatte statement,
The delete event corresponds to the ghost delete statement,
The insert event corresponds to the ghost replace statement,
The reason for using 'replace' is because the existing data migration has already migrated this row of data to the shadow table. In fact, we can use 'insert IGNORE into' to solve all the problems easily, because we now have a solution to the conflict issue of 'insert IGNORE into'
@meiji163

grodowski marked this pull request as ready for review March 11, 2025 14:57

Copilot AI review requested due to automatic review settings March 11, 2025 14:57

grodowski requested review from rashiq, meiji163 and timvaillancourt as code owners March 11, 2025 14:57

Copilot AI reviewed Mar 11, 2025

View reviewed changes

go/logic/migrator.go Show resolved Hide resolved

meiji163 reviewed Mar 18, 2025

View reviewed changes

meiji163 added this to the v1.1.8 milestone Mar 18, 2025

meiji163 mentioned this pull request Mar 18, 2025

[DO NOT MERGE] Testing #1500 #1518

Closed

grodowski commented Mar 19, 2025

View reviewed changes

go/logic/applier.go Outdated Show resolved Hide resolved

meiji163 mentioned this pull request Mar 22, 2025

fix data-integrity problem when add unique key #912

Open

meiji163 reviewed Mar 23, 2025

View reviewed changes

dev.yml Outdated Show resolved Hide resolved

go/sql/builder.go Outdated Show resolved Hide resolved

go/cmd/gh-ost/main.go Show resolved Hide resolved

grodowski and others added 13 commits March 24, 2025 16:53

First pass at raise_on_warnings

6c2e10f

Update dev.yml

4af1cb0

add CLI option, ignore expected warnings, check for row count discrep…

7a4fc8f

…ancy

Count rows in each insert range prepared query

c6c8f24

- log insert warnings always - terminate if row count doesn't match and the PanicOnWarnings flag is set

Lint comments - ensure errors are handled in show warnings

9395272

update TestBuildUniqueKeyRangeEndPreparedQuery

4d7e856

Localtests for PanicOnWarnings with data loss

30d78bd

Unwrap CTE (mysql 5)

d56fca9

Update localtests/panic-on-warnings-duplicate-unique-values-on-column…

3c9d224

…-type-change/extra_args Co-authored-by: Bastian Bartmann <[email protected]>

limit BuildUniqueKeyRangeEndPreparedQueryViaOffset subquery properly

d9cba90

Update Applier to support all unique indices with PanicOnWarnings. Ad…

7f614e0

…d test coverage.

Impl code review feedback for PanicOnWarnings

74f60a9

- documentation - remove dev.yml - remove unused variable

bump golangci-lint for local dev

1487b5c

grodowski force-pushed the grodowski/raise_on_warnings branch from 36eaf27 to 1487b5c Compare March 24, 2025 16:02

Support altering index names with PanicOnWarnings

1ca8ffc

Fix string matching for PanicOnWarnings to correctly suppress warning…

71a63c4

…s when renaming unique keys Error message formats are different across mysql distributions and versions

grodowski force-pushed the grodowski/raise_on_warnings branch from fe68db4 to 71a63c4 Compare March 25, 2025 10:18

meiji163 approved these changes Mar 26, 2025

View reviewed changes

meiji163 merged commit eedac87 into github:master Mar 26, 2025
8 checks passed

meiji163 mentioned this pull request Apr 23, 2025

gh-ost lose data with UNIQUE KEY #1526

Open

grodowski mentioned this pull request May 22, 2025

Remove counts from CalculateNextIterationRangeEndValues that caused problems with --panic-on-warnings Shopify/gh-ost#12

Merged

2 tasks

grodowski mentioned this pull request May 22, 2025

Remove counts from CalculateNextIterationRangeEndValues that caused problems with --panic-on-warnings #1557

Merged

2 tasks

PanicOnWarnings option to detect SQL warnings and fail the copy process #1500

PanicOnWarnings option to detect SQL warnings and fail the copy process #1500

Uh oh!

Conversation

grodowski commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

grodowski commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

meiji163 commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meiji163 Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

grodowski Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

grodowski Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

meiji163 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

grodowski commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meiji163 commented Mar 25, 2025

Uh oh!

grodowski commented Mar 25, 2025

Uh oh!

meiji163 commented Mar 25, 2025

Uh oh!

meiji163 commented Mar 26, 2025

Uh oh!

Uh oh!

xiaoxuanfeng commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

grodowski commented Feb 28, 2025 •

edited

Loading

grodowski commented Mar 11, 2025 •

edited

Loading

meiji163 commented Mar 12, 2025 •

edited

Loading

meiji163 left a comment •

edited

Loading

grodowski commented Mar 24, 2025 •

edited

Loading

xiaoxuanfeng commented Aug 18, 2025 •

edited

Loading