Skip to content

Commit 78a38d4

Browse files
authored
Merge pull request #9 from dolthub/aaron/cgo-icu4c
Migrate go-icu-regex implementation to Cgo.
2 parents f2b78f5 + 2f5fb95 commit 78a38d4

File tree

16 files changed

+356
-1133
lines changed

16 files changed

+356
-1133
lines changed

.github/workflows/test.yml

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,28 @@ jobs:
1919
go-version: ${{ matrix.go-version }}
2020
- name: Checkout code
2121
uses: actions/checkout@v3
22+
- name: Install ICU4C (MacOS)
23+
if: ${{ matrix.platform == 'macos-latest' }}
24+
run: |
25+
dir=$(brew --cellar icu4c)
26+
dir="$dir"/$(ls "$dir")
27+
echo CGO_CPPFLAGS=-I$dir/include >> $GITHUB_ENV
28+
echo CGO_LDFLAGS=-L$dir/lib >> $GITHUB_ENV
29+
- name: Install ICU4C (Windows)
30+
if: ${{ matrix.platform == 'windows-latest' }}
31+
uses: msys2/setup-msys2@v2
32+
with:
33+
path-type: inherit
34+
msystem: UCRT64
35+
pacboy: icu:p toolchain:p pkg-config:p
2236
- name: Test
23-
if: ${{ matrix.platform != 'ubuntu-latest' }}
37+
if: ${{ matrix.platform == 'macos-latest' }}
2438
run: go test ./...
39+
- name: Test
40+
if: ${{ matrix.platform == 'windows-latest' }}
41+
shell: msys2 {0}
42+
run: |
43+
go.exe test ./...
2544
- name: Test
2645
if: ${{ matrix.platform == 'ubuntu-latest' }}
2746
run: go test -race ./...

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,3 +40,5 @@ test-server
4040

4141
# OSX Files
4242
.DS_Store
43+
44+
*~

README.md

Lines changed: 29 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,38 @@
11
# ICU Regular Expressions in Go
22

3-
The [ICU library](https://github.com/unicode-org/icu) is used in MySQL to parse regular expressions.
4-
Go's built-in regular expressions follow a different standard than ICU, and thus can cause inconsistencies when attempting to match MySQL's behavior.
5-
These inconsistencies would hopefully result in an error (prompting user intervention), but may silently return unexpected results, raising no alarm when data is being modified in unexpected ways.
3+
Minimal bindings to [ICU4C](https://github.com/unicode-org/icu)'s regex implementation, for use in Go.
64

7-
To get around this, we've implemented the necessary ICU functions by compiling them into a [WebAssembly](https://webassembly.org/) module, and running the module using the [wazero](https://github.com/tetratelabs/wazero) library.
8-
Although this approach does come with a performance penalty, this allows for implementing packages to retain cross-compilation support, as CGo is not invoked due to this package.
5+
This package is not intended to be a general purpose binding. It's primary purpose is to support [dolt](https://github.com/dolthub/dolt)'s need for ICU-compatible regexes in order to implement MySQL-compatible functionality.
96

10-
## Building
7+
# Use
118

12-
To make modifications to the compiled WASM module, we've included a [build script](icu/build.sh).
13-
The requirements are as follows:
9+
```go
10+
// Create a regex
11+
regex := regex.CreateRegex(1024)
12+
defer regex.Close()
13+
// Set its pattern
14+
regex.SetRegexString(context.TODO(), "[abc]+", regex.RegexFlags_None)
15+
// Set its match string
16+
regex.SetMatchString(context.TODO(), "123abcabcabcdef")
17+
// Extract a matching substring; note start and occurence number are 1 indexed.
18+
substr, ok, err := regex.Substring(context.TODO(), 1, 1)
19+
assert.NoError(t, err)
20+
assert.True(t, ok)
21+
assert.Equals(t, substr, "abcabcabc")
22+
```
1423

15-
* Emscripten v3.1.38
16-
* wasm2wat
17-
* wat2wasm
24+
# Building and Dependencies
1825

19-
Other Emscripten versions may compile just fine, however they have not been tested, and thus we restrict compilation to only the tested version.
20-
This also means that the ICU library is version [68.1](https://github.com/unicode-org/icu/tree/5d81f6f9a0edc47892a1d2af7024f835b47deb82), as that is the only version that our supported version of Emscripten has ported.
21-
Both `wasm2wat` and `wat2wasm` exist to expose the global stack variable, as not all platforms will expose the variable.
22-
None of the exposed functions require [ICU's data](https://unicode-org.github.io/icu/userguide/icu_data/), thus it has been excluded to save on space and memory usage.
23-
MySQL, although collation aware (and in spite of what the documentation may suggest), does not make use of any collation functionality in the context of regular expressions.
26+
This library, and consequently anything that depends on it, requires ICU4C to build and link against. This library does not ship a pre-compiled version of ICU4C and does not build ICU4C alongside itself as part of its Cgo binding. Consequently, building this library or anything that depends on it requires a C++ toolchain and a version of ICU4C installed.
2427

25-
## Notes
28+
For Windows, this library currently only supports MinGW. We are happy to accept changes to support other toolchains based on Go build tags, for example.
2629

27-
Due to the high startup-cost of the WASM runtime, this package _enforces_ that all Regex objects are closed before being dereferenced.
28-
If any Regex objects are dereferenced before being closed, then a panic will occur at some non-deterministic point in the future.
30+
For Linux, a package like `libicu-dev` typically has the necessary library.
31+
32+
For Windows, with msys2, `pacman -S icu-devel` installs the necessary development libraries.
33+
34+
For macOS, `brew install icu4c` will install the necsesary library, but it is not on the default search path of the toolchain. Building with something like `CGO_CPPFLAGS=-I$(brew --cellar icu4c)/$(ls $(brew --cellar icu4c))/include CGO_LDFLAGS=-L$(brew --cellar icu4c)/$(ls $(brew --cellar icu4c))/lib` is potentially necessary.
35+
36+
There is some support for statically linking ICU4C, by building with the build tag `icu_static`. Currently this only changes the linker line for a Windows build. For a macOS or Linux build to link it statically, the build tag `icu_static` should still be used, but it should also be the case that the dynamic libraries are not installed.
37+
38+
When using a self-built dynamic library on macOS, the resulting binaries work best if `runConfigureICU` is run with `--enable-rpath`, so that the ICU4C dynamic libraries are discoverable by the built binary at their installed location.

0 commit comments

Comments
 (0)