Skip to content

Commit 2f5fb95

Browse files
committed
README.md: Update the readme.
1 parent 17f989d commit 2f5fb95

File tree

1 file changed

+29
-19
lines changed

1 file changed

+29
-19
lines changed

README.md

Lines changed: 29 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,38 @@
11
# ICU Regular Expressions in Go
22

3-
The [ICU library](https://github.com/unicode-org/icu) is used in MySQL to parse regular expressions.
4-
Go's built-in regular expressions follow a different standard than ICU, and thus can cause inconsistencies when attempting to match MySQL's behavior.
5-
These inconsistencies would hopefully result in an error (prompting user intervention), but may silently return unexpected results, raising no alarm when data is being modified in unexpected ways.
3+
Minimal bindings to [ICU4C](https://github.com/unicode-org/icu)'s regex implementation, for use in Go.
64

7-
To get around this, we've implemented the necessary ICU functions by compiling them into a [WebAssembly](https://webassembly.org/) module, and running the module using the [wazero](https://github.com/tetratelabs/wazero) library.
8-
Although this approach does come with a performance penalty, this allows for implementing packages to retain cross-compilation support, as CGo is not invoked due to this package.
5+
This package is not intended to be a general purpose binding. It's primary purpose is to support [dolt](https://github.com/dolthub/dolt)'s need for ICU-compatible regexes in order to implement MySQL-compatible functionality.
96

10-
## Building
7+
# Use
118

12-
To make modifications to the compiled WASM module, we've included a [build script](icu/build.sh).
13-
The requirements are as follows:
9+
```go
10+
// Create a regex
11+
regex := regex.CreateRegex(1024)
12+
defer regex.Close()
13+
// Set its pattern
14+
regex.SetRegexString(context.TODO(), "[abc]+", regex.RegexFlags_None)
15+
// Set its match string
16+
regex.SetMatchString(context.TODO(), "123abcabcabcdef")
17+
// Extract a matching substring; note start and occurence number are 1 indexed.
18+
substr, ok, err := regex.Substring(context.TODO(), 1, 1)
19+
assert.NoError(t, err)
20+
assert.True(t, ok)
21+
assert.Equals(t, substr, "abcabcabc")
22+
```
1423

15-
* Emscripten v3.1.38
16-
* wasm2wat
17-
* wat2wasm
24+
# Building and Dependencies
1825

19-
Other Emscripten versions may compile just fine, however they have not been tested, and thus we restrict compilation to only the tested version.
20-
This also means that the ICU library is version [68.1](https://github.com/unicode-org/icu/tree/5d81f6f9a0edc47892a1d2af7024f835b47deb82), as that is the only version that our supported version of Emscripten has ported.
21-
Both `wasm2wat` and `wat2wasm` exist to expose the global stack variable, as not all platforms will expose the variable.
22-
None of the exposed functions require [ICU's data](https://unicode-org.github.io/icu/userguide/icu_data/), thus it has been excluded to save on space and memory usage.
23-
MySQL, although collation aware (and in spite of what the documentation may suggest), does not make use of any collation functionality in the context of regular expressions.
26+
This library, and consequently anything that depends on it, requires ICU4C to build and link against. This library does not ship a pre-compiled version of ICU4C and does not build ICU4C alongside itself as part of its Cgo binding. Consequently, building this library or anything that depends on it requires a C++ toolchain and a version of ICU4C installed.
2427

25-
## Notes
28+
For Windows, this library currently only supports MinGW. We are happy to accept changes to support other toolchains based on Go build tags, for example.
2629

27-
Due to the high startup-cost of the WASM runtime, this package _enforces_ that all Regex objects are closed before being dereferenced.
28-
If any Regex objects are dereferenced before being closed, then a panic will occur at some non-deterministic point in the future.
30+
For Linux, a package like `libicu-dev` typically has the necessary library.
31+
32+
For Windows, with msys2, `pacman -S icu-devel` installs the necessary development libraries.
33+
34+
For macOS, `brew install icu4c` will install the necsesary library, but it is not on the default search path of the toolchain. Building with something like `CGO_CPPFLAGS=-I$(brew --cellar icu4c)/$(ls $(brew --cellar icu4c))/include CGO_LDFLAGS=-L$(brew --cellar icu4c)/$(ls $(brew --cellar icu4c))/lib` is potentially necessary.
35+
36+
There is some support for statically linking ICU4C, by building with the build tag `icu_static`. Currently this only changes the linker line for a Windows build. For a macOS or Linux build to link it statically, the build tag `icu_static` should still be used, but it should also be the case that the dynamic libraries are not installed.
37+
38+
When using a self-built dynamic library on macOS, the resulting binaries work best if `runConfigureICU` is run with `--enable-rpath`, so that the ICU4C dynamic libraries are discoverable by the built binary at their installed location.

0 commit comments

Comments
 (0)