Make Levenshtein a proper differ instead of a hacky mode #147

MikeSmithEU · 2020-11-08T00:39:32Z

Levenshtein should be a proper Differ class, not hacked into metrics.

This PR does the necessary changes to achieve that.

extend Dfifer interface to support get_opcode_counts and get_error_rate (especially in the case of Levenshtein this can allow for huge speed and memory gains as for large texts the Levenshtein algorithm needs a big matrix to properly determine the least expensive path)
change metrics, tests and documentation to reflect this change, allow WER to be calculated using Levenshtein

Known issues:

Currently getting the opcodes using levenshtein is broken
- tried Levenshtein package (only supports strings, so not usable)
- using edit-distance package at the moment, but it does not give the proper results for lists. The package is bugged for opcodes and I cannot currently invest the time to fix it or even file a proper error report and unit tests to the maintainer.
- tried some other packages, but same problems seem to occur, or only the edit distance is supported (not getting opcodes) will look at other solutions again later
- a pure python implementation seems ill-advised because of the computational and space complexity of the algorithm (quadratic-time complexity of Wagner-Fischer algorithm is the best we can do, and might be solely for the edit distance, not the opcodes), it could be a temporary stop-gap but probably requires too much time investment
- maybe disable Levenshtein for metrics requiring the opcodes? (currently this is DiffCounts, Diff and WER, WER could be factored out with some code changes - only for MODE_STRICT though, although custom weights should be implementable using Levenshtein)

It seems to me that we should probably only support Levenshtein for distance, not for the opcodes at the moment, editops totals ("opcode counts") might be feasible, have not yet looked at the implementations here as I really wanted to get full and correct opcodes so that we could display the entire diff between ref and hyp.

…nted in all tested libraries)

aro-max · 2021-02-08T21:32:55Z

Thanks, Mike,
What is the bug in the edit-distance package? Do you have a test case?
Maybe to best option is to keep the code as it is and disable the diffcount with Levenshtein? For the next step we could develop our own version of Levenshtein distance at the world level.

MikeSmithEU marked this pull request as ready for review January 14, 2021 19:21

MikeSmithEU requested a review from danielthepope January 14, 2021 19:25

MikeSmithEU added 5 commits January 23, 2021 16:52

levenshtein (tmp commit, incorrect opcodes)

ec23dba

add get_opcode_counts to Diff classes (from metrics)

950c081

get_error_rate for WER

20b9570

cleanup

ab2217f

Don't support opcodes for levenshtein (missing or incorrectly impleme…

c23c4a4

…nted in all tested libraries)

MikeSmithEU force-pushed the cleanup-differs branch from d1a75f4 to c23c4a4 Compare January 23, 2021 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make Levenshtein a proper differ instead of a hacky mode #147

Make Levenshtein a proper differ instead of a hacky mode #147

Uh oh!

MikeSmithEU commented Nov 8, 2020 •

edited

Loading

Uh oh!

aro-max commented Feb 8, 2021 •

edited

Loading

Uh oh!

Uh oh!

Make Levenshtein a proper differ instead of a hacky mode #147

Are you sure you want to change the base?

Make Levenshtein a proper differ instead of a hacky mode #147

Uh oh!

Conversation

MikeSmithEU commented Nov 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aro-max commented Feb 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

MikeSmithEU commented Nov 8, 2020 •

edited

Loading

aro-max commented Feb 8, 2021 •

edited

Loading