|
1 | 1 | # ICU Regular Expressions in Go |
2 | 2 |
|
3 | | -The [ICU library](https://github.com/unicode-org/icu) is used in MySQL to parse regular expressions. |
4 | | -Go's built-in regular expressions follow a different standard than ICU, and thus can cause inconsistencies when attempting to match MySQL's behavior. |
5 | | -These inconsistencies would hopefully result in an error (prompting user intervention), but may silently return unexpected results, raising no alarm when data is being modified in unexpected ways. |
| 3 | +Minimal bindings to [ICU4C](https://github.com/unicode-org/icu)'s regex implementation, for use in Go. |
6 | 4 |
|
7 | | -To get around this, we've implemented the necessary ICU functions by compiling them into a [WebAssembly](https://webassembly.org/) module, and running the module using the [wazero](https://github.com/tetratelabs/wazero) library. |
8 | | -Although this approach does come with a performance penalty, this allows for implementing packages to retain cross-compilation support, as CGo is not invoked due to this package. |
| 5 | +This package is not intended to be a general purpose binding. It's primary purpose is to support [dolt](https://github.com/dolthub/dolt)'s need for ICU-compatible regexes in order to implement MySQL-compatible functionality. |
9 | 6 |
|
10 | | -## Building |
| 7 | +# Use |
11 | 8 |
|
12 | | -To make modifications to the compiled WASM module, we've included a [build script](icu/build.sh). |
13 | | -The requirements are as follows: |
| 9 | +```go |
| 10 | +// Create a regex |
| 11 | +regex := regex.CreateRegex(1024) |
| 12 | +defer regex.Close() |
| 13 | +// Set its pattern |
| 14 | +regex.SetRegexString(context.TODO(), "[abc]+", regex.RegexFlags_None) |
| 15 | +// Set its match string |
| 16 | +regex.SetMatchString(context.TODO(), "123abcabcabcdef") |
| 17 | +// Extract a matching substring; note start and occurence number are 1 indexed. |
| 18 | +substr, ok, err := regex.Substring(context.TODO(), 1, 1) |
| 19 | +assert.NoError(t, err) |
| 20 | +assert.True(t, ok) |
| 21 | +assert.Equals(t, substr, "abcabcabc") |
| 22 | +``` |
14 | 23 |
|
15 | | -* Emscripten v3.1.38 |
16 | | -* wasm2wat |
17 | | -* wat2wasm |
| 24 | +# Building and Dependencies |
18 | 25 |
|
19 | | -Other Emscripten versions may compile just fine, however they have not been tested, and thus we restrict compilation to only the tested version. |
20 | | -This also means that the ICU library is version [68.1](https://github.com/unicode-org/icu/tree/5d81f6f9a0edc47892a1d2af7024f835b47deb82), as that is the only version that our supported version of Emscripten has ported. |
21 | | -Both `wasm2wat` and `wat2wasm` exist to expose the global stack variable, as not all platforms will expose the variable. |
22 | | -None of the exposed functions require [ICU's data](https://unicode-org.github.io/icu/userguide/icu_data/), thus it has been excluded to save on space and memory usage. |
23 | | -MySQL, although collation aware (and in spite of what the documentation may suggest), does not make use of any collation functionality in the context of regular expressions. |
| 26 | +This library, and consequently anything that depends on it, requires ICU4C to build and link against. This library does not ship a pre-compiled version of ICU4C and does not build ICU4C alongside itself as part of its Cgo binding. Consequently, building this library or anything that depends on it requires a C++ toolchain and a version of ICU4C installed. |
24 | 27 |
|
25 | | -## Notes |
| 28 | +For Windows, this library currently only supports MinGW. We are happy to accept changes to support other toolchains based on Go build tags, for example. |
26 | 29 |
|
27 | | -Due to the high startup-cost of the WASM runtime, this package _enforces_ that all Regex objects are closed before being dereferenced. |
28 | | -If any Regex objects are dereferenced before being closed, then a panic will occur at some non-deterministic point in the future. |
| 30 | +For Linux, a package like `libicu-dev` typically has the necessary library. |
| 31 | + |
| 32 | +For Windows, with msys2, `pacman -S icu-devel` installs the necessary development libraries. |
| 33 | + |
| 34 | +For macOS, `brew install icu4c` will install the necsesary library, but it is not on the default search path of the toolchain. Building with something like `CGO_CPPFLAGS=-I$(brew --cellar icu4c)/$(ls $(brew --cellar icu4c))/include CGO_LDFLAGS=-L$(brew --cellar icu4c)/$(ls $(brew --cellar icu4c))/lib` is potentially necessary. |
| 35 | + |
| 36 | +There is some support for statically linking ICU4C, by building with the build tag `icu_static`. Currently this only changes the linker line for a Windows build. For a macOS or Linux build to link it statically, the build tag `icu_static` should still be used, but it should also be the case that the dynamic libraries are not installed. |
| 37 | + |
| 38 | +When using a self-built dynamic library on macOS, the resulting binaries work best if `runConfigureICU` is run with `--enable-rpath`, so that the ICU4C dynamic libraries are discoverable by the built binary at their installed location. |
0 commit comments