Skip to content

PanicOnWarnings option to detect SQL warnings and fail the copy process #1500

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 26, 2025

Conversation

grodowski
Copy link
Contributor

@grodowski grodowski commented Feb 28, 2025

Related issue: #1498

Description

This PR adds a CLI flag --panic-on-warnings that makes gh-ost fail after copying each batch when:

  • the expected, selected row count differs from actual copied row count (affectedRows)
  • SQL warnings are present, indicating why some rows were dropped

I also decided to always log all SQL warnings, which could be useful to find issues with data truncation or not null constraints even when the row count matches.

Opening as draft with a few remaining items:

  • review documentation
  • review test coverage
  • check explain analyze on the updated range select query

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

@grodowski
Copy link
Contributor Author

grodowski commented Mar 11, 2025

Ran gh-ost locally on a 1.7GiB table with 110k rows with generated blob data to check if there's any noticable impact of the added count(*) for each batch.

Migrate cmd:

gh-ost --allow-on-master --allow-master-master --table="large_books" --alter="add index author_index(author(255))" --database="schema_migrations_test_app_development" --user="root" --password="" --initially-drop-old-table --initially-drop-ghost-table --default-retries=3 --verbose --debug --host=$MYSQL_HOST --port=$MYSQL_PORT --execute

Generic time benchmark on two binaries built on master and grodowski/raise_on_warnings (gh-ost-new)

gh-ost-master --allow-on-master --allow-master-master ... 0.99s user 1.45s system 7% cpu 30.708 total
gh-ost-master --allow-on-master --allow-master-master ... 0.98s user 1.43s system 7% cpu 31.321 total
gh-ost-new --allow-on-master --allow-master-master ... 1.00s user 1.46s system 7% cpu 32.743 total
gh-ost-new --allow-on-master --allow-master-master ... 0.98s user 1.41s system 7% cpu 31.117 total

Then, ran these explains on the copy query (BuildUniqueKeyRangeEndPreparedQueryViaTemptable):

-- before
explain analyze select /* gh-ost mydb.tbl test */ id
from (
    select
        id
    from
        large_books
    where id >= 50000 and id <= 52000
    order by
        id asc
    limit 1000) select_osc_chunk
order by
    id desc
limit 1

| -> Limit: 1 row(s)  (cost=1021..1021 rows=1) (actual time=0.943..0.943 rows=1 loops=1)
    -> Sort: select_osc_chunk.id DESC, limit input to 1 row(s) per chunk  (cost=1021..1021 rows=1) (actual time=0.942..0.942 rows=1 loops=1)
        -> Table scan on select_osc_chunk  (cost=906..921 rows=1000) (actual time=0.746..0.843 rows=1000 loops=1)
            -> Materialize  (cost=906..906 rows=1000) (actual time=0.744..0.744 rows=1000 loops=1)
                -> Limit: 1000 row(s)  (cost=806 rows=1000) (actual time=0.137..0.64 rows=1000 loops=1)
                    -> Filter: ((large_books.id >= 50000) and (large_books.id <= 52000))  (cost=806 rows=3992) (actual time=0.136..0.579 rows=1000 loops=1)
                        -> Covering index range scan on large_books using PRIMARY over (50000 <= id <= 52000)  (cost=806 rows=3992) (actual time=0.132..0.446 rows=1000 loops=1)
-- after
explain analyze select /* gh-ost mydb.tbl test */
    id,
    (select count(*) from (
        select
            id
        from
            large_books
        where id >= 50000 and id <= 52000
        order by
            id asc
        limit 1000
    ) select_osc_chunk)
from (
    select
        id
    from
        large_books
    where id >= 50000 and id <= 52000
    order by
        id asc
    limit 1000
) select_osc_chunk
order by
    id desc
limit 1;

| -> Limit: 1 row(s)  (cost=1021..1021 rows=1) (actual time=1.21..1.21 rows=1 loops=1)
    -> Sort: select_osc_chunk.id DESC, limit input to 1 row(s) per chunk  (cost=1021..1021 rows=1) (actual time=1.21..1.21 rows=1 loops=1)
        -> Table scan on select_osc_chunk  (cost=906..921 rows=1000) (actual time=0.955..1.08 rows=1000 loops=1)
            -> Materialize  (cost=906..906 rows=1000) (actual time=0.953..0.953 rows=1000 loops=1)
                -> Limit: 1000 row(s)  (cost=806 rows=1000) (actual time=0.161..0.813 rows=1000 loops=1)
                    -> Filter: ((large_books.id >= 50000) and (large_books.id <= 52000))  (cost=806 rows=3992) (actual time=0.16..0.732 rows=1000 loops=1)
                        -> Covering index range scan on large_books using PRIMARY over (50000 <= id <= 52000)  (cost=806 rows=3992) (actual time=0.157..0.574 rows=1000 loops=1)
-> Select #2 (subquery in projection; run only once)
    -> Aggregate: count(0)  (cost=1021..1021 rows=1) (actual time=1.04..1.04 rows=1 loops=1)
        -> Table scan on select_osc_chunk  (cost=906..921 rows=1000) (actual time=0.894..0.992 rows=1000 loops=1)
            -> Materialize  (cost=906..906 rows=1000) (actual time=0.894..0.894 rows=1000 loops=1)
                -> Limit: 1000 row(s)  (cost=806 rows=1000) (actual time=0.14..0.774 rows=1000 loops=1)
                    -> Filter: ((large_books.id >= 50000) and (large_books.id <= 52000))  (cost=806 rows=3992) (actual time=0.14..0.694 rows=1000 loops=1)
                        -> Covering index range scan on large_books using PRIMARY over (50000 <= id <= 52000)  (cost=806 rows=3992) (actual time=0.139..0.535 rows=1000 loops=1)

Looks like there will be no significant change in perf?

@grodowski grodowski marked this pull request as ready for review March 11, 2025 14:57
@Copilot Copilot AI review requested due to automatic review settings March 11, 2025 14:57
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new CLI flag (panic-on-warnings) that causes gh-ost to fail when SQL warnings are detected during row copying and also logs all SQL warnings.

  • A new flag "--panic-on-warnings" is added in the main CLI and corresponding context fields.
  • The iteration range calculation now also returns an expected row count with adjusted logic to process SQL warnings.
  • Various SQL builder functions and tests have been updated to support the enhanced query structure and error handling.

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
dev.yml Adds configuration for testing environments and dependency management.
go/logic/applier.go Updates range end calculation and SQL warning processing logic.
go/sql/builder_test.go Adjusts expected queries and argument lists in tests to match new behavior.
go/logic/migrator.go Incorporates expected range size into chunk iteration and warning checks.
go/sql/builder.go Modifies SQL builder queries with minor refactorings and marks an unused variable.
go/cmd/gh-ost/main.go Introduces the new CLI flag for panic-on-warnings.
go/base/context.go Adds new context fields for managing SQL warnings and panic configuration.
Comments suppressed due to low confidence (1)

go/sql/builder.go:278

  • [nitpick] The variable 'uniqueKeyColumnDescending' is declared but not used anywhere in the function. Removing it could improve code clarity.
uniqueKeyColumnDescending := make([]string, len(uniqueKeyColumnNames)) // TODO unused variable

@meiji163
Copy link
Contributor

meiji163 commented Mar 12, 2025

Looks good! I will test the performance of this internally (hopefully this week)

continue
}
pkDuplicateSuffix := fmt.Sprintf("for key '%s.PRIMARY'", this.migrationContext.GetGhostTableName())
if strings.HasPrefix(message, "Duplicate entry") && strings.HasSuffix(message, pkDuplicateSuffix) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have a different shared unique key than the primary key (migrationContext.UniqueKey)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Based on how the shared key is selected in inspectOriginalAndGhostTables, it seems the correct behavior here would be to ignore any warnings including migrationContext.UniqueKey.Name instead of just hardcoding the pk. This is a logic spill from LHM - I'll fix that and see if I can add a unit test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 36eaf27 👍

@meiji163 meiji163 added this to the v1.1.8 milestone Mar 18, 2025
Copy link
Contributor

@meiji163 meiji163 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost ready to merge, just a couple comments 🚀

There is another edge case where the warning check doesn't work properly. If the shared unique key is being renamed, the INSERT IGNORE warning will say Duplicate entry ... for key _${table}_gho.${renamed_unique_shared_key}, because migrationContext.UniqueKey.Name is the name of the unique key on the original table.
For example, localtests/swap-uk-uk test case fails with --panic-on-warnings.
I believe we can fix it by changing migrationContext.UniqueKey.Name to be the name of the shared key on the ghost table.

Second, the main use of panic on warnings is to prevent data loss on add unique key migrations, but we should note that this PR doesn't detect dataloss from the binlog applier, which uses replace into. We can address this in a followup PR.

@grodowski grodowski force-pushed the grodowski/raise_on_warnings branch from 36eaf27 to 1487b5c Compare March 24, 2025 16:02
@grodowski
Copy link
Contributor Author

grodowski commented Mar 24, 2025

If the shared unique key is being renamed, the INSERT IGNORE warning will say Duplicate entry ... for key _${table}_gho.${renamed_unique_shared_key}, because migrationContext.UniqueKey.Name is the name of the unique key on the original table.

Working on a fix for this ⏳ It gets somewhat more involved as we seem to rely on the original table key name e.g. here.

Edit: proposed fix in 1ca8ffc

Second, the main use of panic on warnings is to prevent data loss on add unique key migrations, but we should note that this PR doesn't detect dataloss from the binlog applier, which uses replace into. We can address this in a followup PR.

That's a good point. I believe https://github.com/Shopify/gh-ost/blob/1487b5c84887e3753257eed3a897a0167513fb3e/go/logic/applier.go#L1372-L1379 would need to re-use the show warnings code. Need to check if it would works with the multi-statement query.

Edit: Tried to write a new unit test for the Applier and didn't see any SQL warnings written - it just overwrites on events by design. So is this about new rows being written by clients while a migration is running (is it a responsibility of the client to not do that?), or is there a race condition I don't see? An example scenario would be really helpful.

@meiji163
Copy link
Contributor

proposed fix in 1ca8ffc

Nice! It looks like the exact Duplicate key warning messages differs between versions (the peril of parsing error messages 😅 ). Perhaps strings.HasPrefix(message, "Duplicate entry") && strings.Contains(message, this.migrationContext.UniqueKey.NameInGhostTable) would work?

…s when renaming unique keys

Error message formats are different across mysql distributions and versions
@grodowski grodowski force-pushed the grodowski/raise_on_warnings branch from fe68db4 to 71a63c4 Compare March 25, 2025 10:18
@grodowski
Copy link
Contributor Author

Perhaps strings.HasPrefix(message, "Duplicate entry") && strings.Contains(message, this.migrationContext.UniqueKey.NameInGhostTable) would work?

Nice, glad that we caught it 😌 Fixed this in the latest commit 👍

@meiji163
Copy link
Contributor

Edit: Tried to write a new unit test for the Applier and didn't see any SQL warnings written - it just overwrites on events by design. So is this about new rows being written by clients while a migration is running (is it a responsibility of the client to not do that?), or is there a race condition I don't see? An example scenario would be really helpful.

Yes replace into doesn't raise warning by design, so we'd have to change it to insert to detect data loss. e.g.

  • migration adding a unique key
  • user inserts row that's has duplicate key, where the row was already copied to ghost table
  • binlog applier sees the insert and replaces the row, without raising any warnings

We can consider it out of scope for this PR

@meiji163
Copy link
Contributor

Thanks for your work on this! 🚀

@meiji163 meiji163 merged commit eedac87 into github:master Mar 26, 2025
8 checks passed
grodowski added a commit to Shopify/gh-ost that referenced this pull request May 22, 2025
The split between Temptable and Offset query builders was originally introduced in github#471. Then, both query builders have been changed to calculate actual chunk sizes in github#1500.

The updated implementation of the Offset query introduced a bug, where a missing `order by` clause in `select_osc_chunk` can result in end values potentially surpassing the chunk_size. This wasn't detected during our initial testing where the query just returned rows in their original order, but may happen in real-world scenarios in case the db returns data in an undefined order.

An obvious fix would be to just add an `order by` to the Offset builder subquery, however since both builders use temptables now, it makes more sense to simplify and use only one of them.

Alternatively, the builder could only use the less performant count query variants when `--panic-on-warnings` is enabled and otherwise use the simpler ones from pull/471. We decided to not follow this path for now, hoping `--panic-on-warnings` becomes an updated and safer default in the future.

Co-authored-by: Bastian Bartmann <[email protected]>
grodowski added a commit to Shopify/gh-ost that referenced this pull request May 22, 2025
The split between Temptable and Offset query builders was originally introduced in github#471. Then, both query builders have been changed to calculate actual chunk sizes in github#1500.

The updated implementation of the Offset query introduced a bug, where a missing `order by` clause in `select_osc_chunk` can result in end values potentially surpassing the chunk_size. This wasn't detected during our initial testing where the query just returned rows in their original order, but may happen in real-world scenarios in case the db returns data in an undefined order.

An obvious fix would be to just add an `order by` to the Offset builder subquery, however since both builders use temptables now, it makes more sense to simplify and use only one of them.

Alternatively, the builder could only use the less performant count query variants when `--panic-on-warnings` is enabled and otherwise use the simpler ones from pull/471. We decided to not follow this path for now, hoping `--panic-on-warnings` becomes an updated and safer default in the future.

Co-authored-by: Bastian Bartmann <[email protected]>
grodowski added a commit to Shopify/gh-ost that referenced this pull request Jun 5, 2025
…roblems with --panic-on-warnings (#12)

Problem 1: The split between Temptable and Offset query builders was originally introduced in github#471. Then, both query builders have been changed to calculate actual chunk sizes in github#1500.

The updated implementation of the Offset query introduced a bug, where a missing `order by` clause in `select_osc_chunk` can result in end values potentially surpassing the chunk_size. This wasn't detected during our initial testing where the query just returned rows in their original order, but may happen in real-world scenarios in case the db returns data in an undefined order.

An obvious fix would be to just add an `order by` to the Offset builder subquery. 

Problem 2: Between the execution of `CalculateNextIterationRangeEndValues` and `ApplyIterationInsertQuery` there could be new rows inserted/deleted into the range causing `expectedRowCount` to be inaccurate. With `--panic-on-warnings` this could lead to the migration erroneously panicking.

Changes summary:
- Removed count subqueries from range builders to restore the more performant approach from PR github#471
- Modified panic-on-warnings logic to trigger errors based solely on SQL warnings, not on row count mismatches
- This addresses potential race conditions where row count comparisons could produce false positives due to concurrent table modifications

Co-authored-by: Bastian Bartmann <[email protected]>
meiji163 pushed a commit that referenced this pull request Jun 5, 2025
…roblems with --panic-on-warnings (#1557)

* Fix CalculateNextIterationRangeEndValues order by.

The split between Temptable and Offset query builders was originally introduced in #471. Then, both query builders have been changed to calculate actual chunk sizes in #1500.

The updated implementation of the Offset query introduced a bug, where a missing `order by` clause in `select_osc_chunk` can result in end values potentially surpassing the chunk_size. This wasn't detected during our initial testing where the query just returned rows in their original order, but may happen in real-world scenarios in case the db returns data in an undefined order.

An obvious fix would be to just add an `order by` to the Offset builder subquery, however since both builders use temptables now, it makes more sense to simplify and use only one of them.

Alternatively, the builder could only use the less performant count query variants when `--panic-on-warnings` is enabled and otherwise use the simpler ones from pull/471. We decided to not follow this path for now, hoping `--panic-on-warnings` becomes an updated and safer default in the future.

Co-authored-by: Bastian Bartmann <[email protected]>

* Fix builder's chunk order and drop panic-on-warnings row count check

- Removed count subqueries from range builders to restore the more performant approach from PR #471
- Modified panic-on-warnings logic to trigger errors based solely on SQL warnings, not on row count mismatches
- This addresses potential race conditions where row count comparisons could produce false positives due to concurrent table modifications

---------

Co-authored-by: Bastian Bartmann <[email protected]>
@xiaoxuanfeng
Copy link

xiaoxuanfeng commented Aug 18, 2025

Edit: Tried to write a new unit test for the Applier and didn't see any SQL warnings written - it just overwrites on events by design. So is this about new rows being written by clients while a migration is running (is it a responsibility of the client to not do that?), or is there a race condition I don't see? An example scenario would be really helpful.

Yes replace into doesn't raise warning by design, so we'd have to change it to insert to detect data loss. e.g.

  • migration adding a unique key
  • user inserts row that's has duplicate key, where the row was already copied to ghost table
  • binlog applier sees the insert and replaces the row, without raising any warnings

We can consider it out of scope for this PR

Hello
Ghost applies binlog incremental data:
The update event in binlog corresponds to the ghost updatte statement,
The delete event corresponds to the ghost delete statement,
The insert event corresponds to the ghost replace statement,
The reason for using 'replace' is because the existing data migration has already migrated this row of data to the shadow table. In fact, we can use 'insert IGNORE into' to solve all the problems easily, because we now have a solution to the conflict issue of 'insert IGNORE into'
@meiji163

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants