Skip to content

Conversation

dkharms
Copy link
Member

@dkharms dkharms commented Sep 1, 2025

Description

Here's the benchmarks for KMP and bytes.Index (I excluded comparison of heap allocation since both version had zero heap allocations).

goos: linux
goarch: amd64
pkg: github.com/ozontech/seq-db/pattern
cpu: 12th Gen Intel(R) Core(TM) i5-12600K
                                              │    kmp.txt    │              bytes.txt               │
                                              │    sec/op     │    sec/op     vs base                │
FindSequence_Deterministic/regular-cases-0-16   19.605n ±  7%   8.456n ±  1%  -56.87% (p=0.000 n=10)
FindSequence_Deterministic/regular-cases-1-16   14.305n ±  2%   5.590n ±  1%  -60.92% (p=0.000 n=10)
FindSequence_Deterministic/corner-cases-0-16     91.35n ±  3%   22.86n ±  1%  -74.98% (p=0.000 n=10)
FindSequence_Deterministic/corner-cases-1-16    167.70n ±  2%   28.44n ±  1%  -83.04% (p=0.000 n=10)
FindSequence_Deterministic/corner-cases-2-16    2482.5n ±  1%   181.8n ±  1%  -92.68% (p=0.000 n=10)
FindSequence_Deterministic/corner-cases-3-16    39.509µ ±  1%   2.663µ ±  1%  -93.26% (p=0.000 n=10)
FindSequence_Random/tiny-16                      63.56n ± 34%   12.10n ± 11%  -80.96% (p=0.000 n=10)
FindSequence_Random/small-16                    190.70n ±  8%   23.70n ± 39%  -87.57% (p=0.000 n=10)
FindSequence_Random/medium-16                   681.70n ± 25%   41.71n ± 28%  -93.88% (p=0.000 n=10)
FindSequence_Random/large-16                    9804.5n ±  2%   538.2n ±  7%  -94.51% (p=0.000 n=10)
FindSequence_Random/extra-large-16              603.54µ ±  1%   58.99µ ±  2%  -90.23% (p=0.000 n=10)
geomean                                          702.6n         94.94n        -86.49%

                                   │    kmp.txt    │                bytes.txt                 │
                                   │      B/s      │      B/s        vs base                  │
FindSequence_Random/tiny-16          960.4Mi ± 52%   5044.3Mi ± 12%   +425.25% (p=0.000 n=10)
FindSequence_Random/small-16         1.250Gi ±  9%   10.424Gi ± 33%   +733.58% (p=0.000 n=10)
FindSequence_Random/medium-16        1.402Gi ± 20%   22.882Gi ± 32%  +1531.58% (p=0.000 n=10)
FindSequence_Random/large-16         1.556Gi ±  2%   28.363Gi ±  7%  +1722.42% (p=0.000 n=10)
FindSequence_Random/extra-large-16   1.618Gi ±  1%   16.555Gi ±  2%   +923.15% (p=0.000 n=10)
geomean                              1.329Gi          14.07Gi         +959.02%

So we see that on regular cases we have around 53% of speedup by using bytes.Index. And on large strings we see around of 90% speedup (I believe that's because we build LPS for needles, not the haystack).

Here are features of my CPU (it supports both SSE4.2 and AVX2):

Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 **sse4_2** x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 **avx2** smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities

Closes #98


  • I have read and followed all requirements in CONTRIBUTING.md;
  • I used LLM/AI assistance to make this pull request;

If you have used LLM/AI assistance please provide model name and full prompt:

Model: {{model-name}}
Prompt: {{prompt}}

Summary by CodeRabbit

  • Refactor

    • Switched substring handling to a leaner byte-oriented implementation for multi-part pattern searches, simplifying internals and improving search performance.
  • Tests

    • Replaced legacy substring tests with new sequence-focused tests and added deterministic and randomized benchmarks plus data generators to strengthen validation and performance coverage.

@dkharms dkharms self-assigned this Sep 1, 2025
@dkharms dkharms added benchmarks Features or improvements over our benchmark processes performance Features or improvements that positively affect seq-db performance labels Sep 1, 2025
@codecov-commenter
Copy link

codecov-commenter commented Sep 1, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.23%. Comparing base (09d9579) to head (fe6da01).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #104      +/-   ##
==========================================
- Coverage   71.50%   71.23%   -0.27%     
==========================================
  Files         201      200       -1     
  Lines       18170    18143      -27     
==========================================
- Hits        12993    12925      -68     
- Misses       4457     4500      +43     
+ Partials      720      718       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

coderabbitai bot commented Sep 1, 2025

Note

Reviews paused

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.
📝 Walkthrough

Walkthrough

Replaces KMP-based substring matcher with a bytes.Index-based findSequence, removes pattern/substring.go and its tests, updates wildcardSearch.middle from []*substring to [][]byte (using util.StringToByteUnsafe), and adds new findSequence tests and benchmarks (test generator contains a loop bug).

Changes

Cohort / File(s) Summary
Core algorithm & pattern update
pattern/substring.go (deleted), pattern/pattern.go
Deleted KMP-based substring implementation and helpers; changed wildcardSearch.middle from []*substring to [][]byte; replaced substring-pattern construction with val := util.StringToByteUnsafe(terms[i].Data) and s.middle = append(s.middle, val); adjusted middleLen to use len(val); added findSequence(haystack []byte, needles [][]byte) int implemented with bytes.Index. Introduces unsafe string→[]byte conversion.
Tests & benchmarks
pattern/substring_test.go (deleted), pattern/pattern_test.go
Removed all KMP-specific unit tests. Added TestFindSequence, BenchmarkFindSequence_Deterministic, BenchmarkFindSequence_Random, and test/benchmark helpers (generateTestData, generateRandomBytes, bb). Note: generateTestData currently uses for i := range needleCount (invalid) and will not compile as-is.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks (4 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “perf: use bytes.Index for substring search” precisely and concisely captures the primary change of replacing the custom KMP implementation with the standard library’s bytes.Index for improved performance. It highlights the key optimization without extraneous details, enabling a quick understanding of the PR’s focus. This aligns directly with the PR objectives around performance.
Linked Issues Check ✅ Passed The changes fully satisfy issue #98 by removing the KMP-based substring logic in pattern/substring.go, updating pattern/pattern.go to use bytes.Index in findSequence, and adding targeted tests and benchmarks to validate and measure the new implementation’s performance. All modifications directly implement the linked issue’s objective of leveraging vectorized instructions and the Rabin–Karp fallback in bytes.Index. There are no outstanding requirements from the issue left unaddressed.
Out of Scope Changes Check ✅ Passed All file changes pertain solely to replacing the KMP algorithm with bytes.Index, removing the old substring abstraction and associated tests, and adding new findSequence tests and benchmarks; no unrelated features or files are modified. Every change maps back to the performance objective defined in the linked issue. There are no extraneous or out-of-scope alterations.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 98-bytes-index

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
pattern/substring.go (1)

7-16: Tighten the loop and avoid double indexing.

Minor cleanup and a tiny win: range with index+value and use val for bytes.Index. No behavior change.

-func findSequence(haystack []byte, needles [][]byte) int {
-	for cur := range needles {
-		val := needles[cur]
-		start := bytes.Index(haystack, needles[cur])
-		if start == -1 {
-			return cur
-		}
-		haystack = haystack[start+len(val):]
-	}
-	return len(needles)
-}
+func findSequence(haystack []byte, needles [][]byte) int {
+	for i, val := range needles {
+		start := bytes.Index(haystack, val)
+		if start == -1 {
+			return i
+		}
+		haystack = haystack[start+len(val):]
+	}
+	return len(needles)
+}

Do we ever expect empty needles? If yes, current logic treats them as “found” without advancing. If that’s undesirable, we should skip/forbid empties and add a unit test.

pattern/pattern.go (3)

78-81: Unsafe aliasing needs lifetime pinning (keep original strings alive).

middle now holds [][]byte produced via an unsafe string→[]byte view. To avoid GC lifetime hazards and future accidental mutation, keep the original strings alongside the byte slices.

 type wildcardSearch struct {
   baseSearch
   prefix    []byte
   suffix    []byte
-  middle    [][]byte
+  middle    [][]byte
+  // keep original strings to pin GC lifetime when using unsafe conversions
+  middleKeepAlive []string
   middleLen int
   narrowed  bool
 }

99-102: Pin the source strings when using StringToByteUnsafe.

Store the source string to ensure it stays alive for the duration of the search. Also document the read-only invariant.

-      val := util.StringToByteUnsafe(terms[i].Data)
-      s.middle = append(s.middle, val)
-      s.middleLen += len(val)
+      str := terms[i].Data
+      val := util.StringToByteUnsafe(str) // read-only view; do not mutate
+      s.middle = append(s.middle, val)
+      s.middleKeepAlive = append(s.middleKeepAlive, str) // pin lifetime
+      s.middleLen += len(val)

If avoiding any unsafe GC nuance is preferred, switch to a safe copy:

-      val := util.StringToByteUnsafe(terms[i].Data)
+      val := []byte(terms[i].Data) // safe copy

151-152: Guard slice bounds locally (defensive check).

checkMiddle guards length globally, but a local bound check keeps findSequence call obviously safe and future-proof if conditions change.

-  return findSequence(val[len(s.prefix):len(val)-len(s.suffix)], s.middle) == len(s.middle)
+  lo, hi := len(s.prefix), len(val)-len(s.suffix)
+  if hi < lo {
+    return false
+  }
+  return findSequence(val[lo:hi], s.middle) == len(s.middle)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 035c094 and 2b9cefd.

📒 Files selected for processing (3)
  • pattern/pattern.go (2 hunks)
  • pattern/substring.go (1 hunks)
  • pattern/substring_test.go (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test
🔇 Additional comments (3)
pattern/substring_test.go (3)

24-40: Tests look good.

Nice coverage on order, overlaps, and corner growth cases.


125-131: LGTM.

Simple and fast random byte generator.


133-135: LGTM.

Nice helper for readability in deterministic cases.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
pattern/substring_test.go (3)

72-79: Use the canonical benchmark loop; drop redundant conversion.

Replace b.Loop() and []byte(c.haystack).

-      b.Run(tc.name+"-"+strconv.Itoa(i), func(b *testing.B) {
-        for b.Loop() {
-          findSequence([]byte(c.haystack), c.needles)
-        }
-      })
+      b.Run(tc.name+"-"+strconv.Itoa(i), func(b *testing.B) {
+        b.ReportAllocs()
+        for i := 0; i < b.N; i++ {
+          findSequence(c.haystack, c.needles)
+        }
+      })

97-108: Stabilize random bench; set bytes once.

Move SetBytes out of the loop and use the canonical loop.

   b.Run(size.name, func(b *testing.B) {
     haystack, needles := generateTestData(
       size.haystackSize, size.needleSize, size.needleCount, 256,
     )
-    b.ResetTimer()
-    for b.Loop() {
-      findSequence(haystack, needles)
-      b.SetBytes(int64(len(haystack)))
-    }
+    b.ReportAllocs()
+    b.SetBytes(int64(len(haystack)))
+    b.ResetTimer()
+    for i := 0; i < b.N; i++ {
+      findSequence(haystack, needles)
+    }
   })

111-123: Fix compile error and bound checks in data generator.

Looping over int with range doesn’t compile; rand.Intn(len(haystack)-needleSize) can panic; copy must be bounded.

 func generateTestData(haystackSize, needleSize, needleCount, charset int) ([]byte, [][]byte) {
   haystack := generateRandomBytes(haystackSize, charset)

   needles := make([][]byte, needleCount)
-  for i := range needleCount {
-    pattern := generateRandomBytes(needleSize, charset)
-    pos := rand.Intn(len(haystack) - needleSize)
-    copy(haystack[pos:], pattern)
-    needles[i] = pattern
-  }
+  if needleSize > haystackSize {
+    needleSize = haystackSize
+  }
+  for i := 0; i < needleCount; i++ {
+    pattern := generateRandomBytes(needleSize, charset)
+    if needleSize > 0 {
+      pos := rand.Intn(len(haystack)-needleSize+1)
+      copy(haystack[pos:pos+needleSize], pattern)
+    }
+    needles[i] = pattern
+  }

   return haystack, needles
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2b9cefd and abb9bb5.

📒 Files selected for processing (3)
  • pattern/pattern.go (2 hunks)
  • pattern/substring.go (1 hunks)
  • pattern/substring_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pattern/substring.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test
🔇 Additional comments (1)
pattern/pattern.go (1)

342-349: Ignore empty-range wrap-around risk
baseSearch.first is int(tp.FirstTID()) (≥1), and util.BinSearchInRange never returns below that, so on “not found”

s.last = s.first - 1  

yields s.last ≥ 0. Casting these non-negative ints to uint32 can’t wrap.

Likely an incorrect or invalid review comment.

@ssnd ssnd requested review from moflotas and eguguchkin September 8, 2025 10:44
@dkharms dkharms enabled auto-merge (squash) September 10, 2025 08:49
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
pattern/pattern_test.go (3)

487-515: Good new coverage; consider a couple more edge tests.

Add cases for empty haystack and/or empty needles to lock in boundary behavior (e.g., [], ["a"], [""]).


517-556: Nit: avoid redundant []byte conversion; optionally set bytes once.

c.haystack is already []byte. Also it helps to set bytes outside the loop.

Apply this diff:

-            b.Run(tc.name+"-"+strconv.Itoa(i), func(b *testing.B) {
-                for b.Loop() {
-                    findSequence([]byte(c.haystack), c.needles)
-                }
-            })
+            b.Run(tc.name+"-"+strconv.Itoa(i), func(b *testing.B) {
+                b.SetBytes(int64(len(c.haystack)))
+                for b.Loop() {
+                    findSequence(c.haystack, c.needles)
+                }
+            })

558-584: SetBytes should be outside the loop.

It’s constant per sub-benchmark; moving it reduces per-iter overhead.

Apply this diff:

-        b.Run(size.name, func(b *testing.B) {
+        b.Run(size.name, func(b *testing.B) {
             haystack, needles := generateTestData(
                 size.haystackSize, size.needleSize, size.needleCount, 256,
             )
-            b.ResetTimer()
+            b.SetBytes(int64(len(haystack)))
+            b.ResetTimer()
             for b.Loop() {
                 findSequence(haystack, needles)
-                b.SetBytes(int64(len(haystack)))
             }
         })
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between abb9bb5 and c3b99d1.

📒 Files selected for processing (4)
  • pattern/pattern.go (3 hunks)
  • pattern/pattern_test.go (2 hunks)
  • pattern/substring.go (0 hunks)
  • pattern/substring_test.go (0 hunks)
💤 Files with no reviewable changes (2)
  • pattern/substring_test.go
  • pattern/substring.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pattern/pattern.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test
🔇 Additional comments (4)
pattern/pattern_test.go (4)

9-9: Import looks correct.

Needed for strings.Repeat in tests/benchmarks.


600-607: LGTM.

Simple, deterministic byte generator for the given charset.


608-610: LGTM.

Handy helper to keep benchmarks tidy.


517-556: No action needed: go.mod and CI workflows already require Go 1.24+

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
pattern/pattern_test.go (1)

586-598: Fix compile error and off-by-one; guard oversized needles in generateTestData.

Looping over an int won’t compile; current pos calc can panic and excludes the last valid start.

 func generateTestData(haystackSize, needleSize, needleCount, charset int) ([]byte, [][]byte) {
 	haystack := generateRandomBytes(haystackSize, charset)
 
 	needles := make([][]byte, needleCount)
-	for i := range needleCount {
-		pattern := generateRandomBytes(needleSize, charset)
-		pos := rand.Intn(len(haystack) - needleSize)
+	for i := 0; i < needleCount; i++ {
+		plen := needleSize
+		if plen > len(haystack) {
+			plen = len(haystack)
+		}
+		pattern := generateRandomBytes(plen, charset)
+		maxStart := len(haystack) - plen
+		pos := 0
+		if maxStart > 0 {
+			pos = rand.Intn(maxStart + 1)
+		}
 		copy(haystack[pos:], pattern)
 		needles[i] = pattern
 	}
 
 	return haystack, needles
 }
🧹 Nitpick comments (1)
pattern/pattern_test.go (1)

487-494: Preallocate needlesB to avoid slice growth (tiny test micro-opt).

Pre-sizing eliminates reallocations in the helper.

-func testFindSequence(a *assert.Assertions, cnt int, needles []string, haystack string) {
-	var needlesB [][]byte
-	for _, needle := range needles {
-		needlesB = append(needlesB, []byte(needle))
-	}
+func testFindSequence(a *assert.Assertions, cnt int, needles []string, haystack string) {
+	needlesB := make([][]byte, len(needles))
+	for i, needle := range needles {
+		needlesB[i] = []byte(needle)
+	}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3b99d1 and 258a412.

📒 Files selected for processing (4)
  • pattern/pattern.go (3 hunks)
  • pattern/pattern_test.go (2 hunks)
  • pattern/substring.go (0 hunks)
  • pattern/substring_test.go (0 hunks)
💤 Files with no reviewable changes (2)
  • pattern/substring_test.go
  • pattern/substring.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pattern/pattern.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
pattern/pattern_test.go (3)

517-556: Fix non-existent testing.B.Loop and drop redundant []byte cast; set bytes once.

This won’t compile (b.Loop doesn’t exist). Also []byte(c.haystack) is redundant; c.haystack is already []byte.

 for _, tc := range testCases {
   for i, c := range tc.cases {
     b.Run(tc.name+"-"+strconv.Itoa(i), func(b *testing.B) {
-      for b.Loop() {
-        findSequence([]byte(c.haystack), c.needles)
-      }
+      b.SetBytes(int64(len(c.haystack)))
+      b.ResetTimer()
+      for i := 0; i < b.N; i++ {
+        findSequence(c.haystack, c.needles)
+      }
     })
   }
 }

558-584: Fix benchmark loop and SetBytes placement.

Use the standard b.N loop; move SetBytes outside the per-iteration body.

 b.Run(size.name, func(b *testing.B) {
   haystack, needles := generateTestData(
     size.haystackSize, size.needleSize, size.needleCount, 256,
   )
-  b.ResetTimer()
-  for b.Loop() {
-    findSequence(haystack, needles)
-    b.SetBytes(int64(len(haystack)))
-  }
+  b.SetBytes(int64(len(haystack)))
+  b.ResetTimer()
+  for i := 0; i < b.N; i++ {
+    findSequence(haystack, needles)
+  }
 })

586-598: Fix compile error, off-by-one, and guard oversize needles.

  • for i := range needleCount won’t compile (cannot range over int).
  • rand.Intn(len(haystack)-needleSize) can panic and excludes the last valid start.
  • Allow needles longer than haystack by clamping length.
 func generateTestData(haystackSize, needleSize, needleCount, charset int) ([]byte, [][]byte) {
   haystack := generateRandomBytes(haystackSize, charset)

   needles := make([][]byte, needleCount)
-  for i := range needleCount {
-    pattern := generateRandomBytes(needleSize, charset)
-    pos := rand.Intn(len(haystack) - needleSize)
+  for i := 0; i < needleCount; i++ {
+    plen := needleSize
+    if plen > len(haystack) {
+      plen = len(haystack)
+    }
+    pattern := generateRandomBytes(plen, charset)
+    maxStart := len(haystack) - plen
+    pos := 0
+    if maxStart > 0 {
+      pos = rand.Intn(maxStart + 1) // include last valid index
+    }
     copy(haystack[pos:], pattern)
     needles[i] = pattern
   }

   return haystack, needles
 }
🧹 Nitpick comments (1)
pattern/pattern_test.go (1)

487-495: Helper looks good; preallocate slice capacity.

Minor nit: avoid reallocations when building needlesB.

 func testFindSequence(a *assert.Assertions, cnt int, needles []string, haystack string) {
-	var needlesB [][]byte
+	needlesB := make([][]byte, 0, len(needles))
 	for _, needle := range needles {
 		needlesB = append(needlesB, []byte(needle))
 	}
 	res := findSequence([]byte(haystack), needlesB)
 	a.Equal(cnt, res, "wrong total number of matches")
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 258a412 and 098bcb7.

📒 Files selected for processing (4)
  • pattern/pattern.go (3 hunks)
  • pattern/pattern_test.go (2 hunks)
  • pattern/substring.go (0 hunks)
  • pattern/substring_test.go (0 hunks)
💤 Files with no reviewable changes (2)
  • pattern/substring_test.go
  • pattern/substring.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pattern/pattern.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test
🔇 Additional comments (4)
pattern/pattern_test.go (4)

9-9: Import of strings is appropriate.

Used by strings.Repeat in tests/benchmarks. LGTM.


496-515: Tests cover useful cases.

Nice spread of scenarios; no issues spotted.


600-606: LGTM.

Simple, efficient generator for benchmarks.


608-610: LGTM.

Tiny helper keeps test cases readable.

@dkharms
Copy link
Member Author

dkharms commented Sep 11, 2025

@coderabbitai pause

Copy link

coderabbitai bot commented Sep 11, 2025

✅ Actions performed

Reviews paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks Features or improvements over our benchmark processes performance Features or improvements that positively affect seq-db performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use bytes.Index instead of KMP for substring search
3 participants