Reconstruct heap graph from disk graph #536

sam-herman · 2025-10-07T19:37:34Z

Description

Multiple database projects such as C* and OpenSearch/Solr/Lucene use LSM mechanism with frequent merges.
This in turn creates a high overhead that forces us to reconstruct the entire graph from scratch upon every merge.
A more economic approach would be to pick a leading graph that was previously persisted to disk and incrementally add small graph nodes to it.

This PR makes some of the required changes to support that behavior in upstream systems such as C* and OpenSearch.

Changes

Serializable Neighbor Distance Cache - Add a serializable cache that can store the node distances within the OnDiskGraphIndex for a faster re-creation of the OnHeapGraphIndex when read back from disk. The cache is separate and is optional, therefore we can choose whether to apply it or not, without any breaking changes to the current OnDiskGraphIndex. This can later be augmented to a format if we choose to.
Help Methods And Constructors - Add helper method and constructors to facilitate easier usage by other projects with graph merge use cases.

Testing

Added tests for graph overlap and recall for reconstructed graphs.

get incremental graph generation working add serialization for NeighborsCache add ord-mapping logic from PQ vectors to RAVV add ord mapping from RAVV to graph creation Signed-off-by: Samuel Herman <[email protected]>

Signed-off-by: Samuel Herman <[email protected]>

github-actions · 2025-10-07T19:37:48Z

Before you submit for review:

Does your PR follow guidelines from CONTRIBUTIONS.md?
Did you summarize what this PR does clearly and concisely?
Did you include performance data for changes which may be performance impacting?
Did you include useful docs for any user-facing changes or features?
Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
Did you trigger and review regression testing results against the base branch via Run Bench Main?
Did you adhere to the code formatting guidelines (TBD)
Did you group your changes for easy review, providing meaningful descriptions for each commit?
Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

marianotepper · 2025-10-07T23:09:54Z

@sam-herman thanks for the contribution.

I created an alternative branch where:

Interface boundaries are respected (no unprotected downcasts).
The NeighborsScoreCache is removed in favor of recomputing the scores live when creating the OnHeapGraphIndex. The NeighborsScoreCache was essentially forcing to write the graph twice on disk and read it twice (i.e., instantiate two copies in memory, the cache and the OnHeapGraph).

The last change is mainly due to a major refactoring that needs to happen in OnHeapGraphIndex to not store the scores anymore. This is creating issues, see #522 for example. So in essence, once the scores are removed, we will have no more need for a cache or for recomputing the scores themselves. I'll be working on this change next week.

sam-herman · 2025-10-08T02:17:11Z

@sam-herman thanks for the contribution.

I created an alternative branch where:

Interface boundaries are respected (no unprotected downcasts).

The NeighborsScoreCache is removed in favor of recomputing the scores live when creating the OnHeapGraphIndex. The NeighborsScoreCache was essentially forcing to write the graph twice on disk and read it twice (i.e., instantiate two copies in memory, the cache and the OnHeapGraph).

The last change is mainly due to a major refactoring that needs to happen in OnHeapGraphIndex to not store the scores anymore. This is creating issues, see #522 for example. So in essence, once the scores are removed, we will have no more need for a cache or for recomputing the scores themselves. I'll be working on this change next week.

@marianotepper the issue with recomputing scores of the entire graph is that it forces a brute force pass.
For frequent merges between small graphs and large graphs, this is a situation we would prefer to avoid to keep the performance as we've previously seen in opensearch-project/opensearch-jvector#167

One option that we mentioned was to not include scores at all in the OnHeapGraphIndex and then only re calculating the relative neighbor scores upon traversal. Would it make sense to keep the score cache for now and later remove it when we are ready to implement this approach?

marianotepper · 2025-10-08T02:53:02Z

I'm not sure what you mean by a "brute force pass." If you mean that it involves brute force search, that's not the case.
It just requires computing maxDegree similarities for each node. Granted this is more work than if the scores are written to disk but still relatively minor and of the similar algorithmic complexity than just reading the graph and the scores. Additionally, it could be parallelized with relatively ease.

However, the "cache" (it is not truly a cache as it stores the whole OnHeapGraphIndex in memory, just with a very slightly different data structure) requires 2x the heap space of the OnHeapGraphIndex in memory and additional storage. Additionally, as I mention in my original post, multiple abstractions needed to be broken for the implementation to work. That is essentially because the class is copying the data structure of the OnHeapGraphIndex and persisting it. If we are going to do that, we might as well just write/load the OnHeapGraphIndex directly.

IMO, accepting a data structure that increases memory usage and breaks abstractions, just to deprecate it in a couple of weeks seems ill advised.

sam-herman · 2025-10-08T14:20:31Z

I'm not sure what you mean by a "brute force pass." If you mean that it involves brute force search, that's not the case. It just requires computing maxDegree similarities for each node.

Yes, I mean it's computing maxDegree similarities for each node.

Granted this is more work than if the scores are written to disk but still relatively minor and of the similar algorithmic complexity than just reading the graph and the scores. Additionally, it could be parallelized with relatively ease.

Theoretical complexity is the same, but in practice for an example of vectors with 1536 dimension (d)
we will be computing n *M * d which is x1536 more operations as opposed to just plainly copying a pre-calculated score.

However, the "cache" (it is not truly a cache as it stores the whole OnHeapGraphIndex in memory, just with a very slightly different data structure) requires 2x the heap space of the OnHeapGraphIndex in memory and additional storage. Additionally, as I mention in my original post, multiple abstractions needed to be broken for the implementation to work. That is essentially because the class is copying the data structure of the OnHeapGraphIndex and persisting it. If we are going to do that, we might as well just write/load the OnHeapGraphIndex directly.

Agree, keeping this in-memory is not ideal, what about if we add an iterator such that we never actually read it fully to memory? That's easy to do and this way it can be used only during construction without increasing memory footprint in any way.

IMO, accepting a data structure that increases memory usage and breaks abstractions, just to deprecate it in a couple of weeks seems ill advised.

It is a good point which I can totally agree with, are we sure this is going to be the case though? What do you think?
I would imagine we have a few nuances to consider. Such as how would we re-compute distances upon insertions to existing graph for pruning?

marianotepper · 2025-10-08T15:59:55Z

Theoretical complexity is the same, but in practice for an example of vectors with 1536 dimension (d)
we will be computing n *M * d which is x1536 more operations as opposed to just plainly copying a pre-calculated score.

This calculation is a bit off, what truly counts is the number of vectorized computations, so the multiplication by d is inflating the number arbitrarily. In particular, when building graphs with PQ vectors (a good idea for the OpenSearch plugin if you are not using it already) the amount of additional computation drops even further. Additionally, we can easily parallelize this step, it dose not need to be carried sequentially. That said, I did agree in my initial comment that this approach adds some compute, so I'm not going to argue this further.

In exchange, it spares memory. Consider that the node IDs + scores occupy 32 * (4 + 4) = 256 bytes per node for a maxDegree of 32. For a graph with 10^9 nodes this amounts to a heap usage of 238GB. Since we are loading it twice, we would need 500GB just store the graph and the scores. Half of that, is not need.

what about if we add an iterator such that we never actually read it fully to memory?

As I mentioned, the NeighborsScoreCache severely breaks abstractions because it is obtaining a copy of the internal data structures of OnHeapGraphIndex just to provide a write/load mechanism, whether this is fully materialized as a second actual copy or not. Essentially,
OnHeapGraphIndex = List<Mat<Integer, Neighbors>>
NeighborsScoreCache = Map<Integer, Mat<Integer, Neighbors>>
so both structures are the same thing, they store the same information in the same way. There is absolutely no reason for NeighborsScoreCache to exist.

Because of this reason, in the current design, the ImmutableGraphIndex in OnHeapGraphIndex.converToHeap is being used for absolutely no reason, as the NeighborsScoreCache already has all the information needed to instantiate the OnHeapGraph (because it IS an OnHeapGraph). As I mention in my previous comment, we might as well just provide write/load functions for OnHeapGraphIndex.

It is a good point which I can totally agree with, are we sure this is going to be the case though? What do you think?
I would imagine we have a few nuances to consider. Such as how would we re-compute distances upon insertions to existing graph for pruning?

We are computing similarities for insertions anyway. I don't see why this would be a problem. Of course, there is a balance between storing things in memory versus having additional compute, as in any SW problem. At billion+ scales, holding things in memory just to avoid a relatively benign amount of extra compute does not make sense to me. Of course, all of this needs to be carefully implemented and analyzed at a large scale. Not wanting to go on a tangent, I'll live this point be for the context of this PR.

All in all, I think a better plan would be to deploy a solution in several stages:

Live with the solution without the NeighborsScoreCache for now. I'd parallelize the node insertions during the conversions (I can help with that). This uses memory more conservatively, does not break abstractions, and, granted, carries additional computations. Ideally, we should evaluate the impact of this additional compute. If it's unsustainable as @sam-herman claims, I'd be happy to re-evaluate this solution, as always.
We can create a new type of MutableGraphIndex that uses an underlying ImmutableGraphIndex (an on-disk graph) without loading it in memory, while having the ability to override those nodes that mutate in an in-memory cache (this cache holds inserted nodes too). Then, upon writing this mutable graph to disk, we only need to persist the changes. This approach is much leaner in its use of heap and would scale as long as we can bound the number of changes in each merge. It also avoid recomputing distances/scores for any node that does get altered by the merge. This is not hard to do and can be easily done in a couple of weeks.
Long-term, we should have the ability to mutate an on-disk graph directly without going through an in-memory structure. The support is essentially in place in the random-access OnDiskGraphIndexWriter. The reason I'm delaying this approach in this plan is the insertion of back edges, which would be problematic. Maybe we could use an in-memory data structure for holding back edges, I'm not sure yet. This point requires more thinking before plowing forward.

marianotepper · 2025-10-08T21:27:56Z

After a virtual discussion with @sam-herman, we agreed on the following compromise:

We will not be using the NeighborsScoreCache
We will temporarily use save/load methods in OnHeapGraphIndex to serialize/deserialize the on-heap-graph. These methods are labeled as experimental as we expect that they will disappear in the future (see the plan in my previous comment that still stands).

This solution still uses more disk space, but we can live with that in the short term. It does not have a heap memory overhead.

I have committed it to this branch. As I did not want to override the branch of this PR unilaterally. @sam-herman can you take a look?

sam-herman · 2025-10-09T13:13:58Z

After a virtual discussion with @sam-herman, we agreed on the following compromise:

We will not be using the NeighborsScoreCache

We will temporarily use save/load methods in OnHeapGraphIndex to serialize/deserialize the on-heap-graph. These methods are labeled as experimental as we expect that they will disappear in the future (see the plan in my previous comment that still stands).

This solution still uses more disk space, but we can live with that in the short term. It does not have a heap memory overhead.

I have committed it to this branch. As I did not want to override the branch of this PR unilaterally. @sam-herman can you take a look?

Hi @marianotepper thank you for the changes, the changes look good to me! Do you mind moving the changes over to this PR? I will later move it upstream as well into opensearch-project/opensearch-jvector#167 to use the load and save method to serialize/deserialize the OnHeapGraph

marianotepper · 2025-10-09T15:38:45Z

Changes merged. This temporary solution still involves one downcast to be able to call OnHeapGraphIndex.save. The changes are labeled as experimental because they will not stick around in the long term.

jvector-base/src/main/java/io/github/jbellis/jvector/graph/OnHeapGraphIndex.java

jshook

Thanks for the discussion, this looks ok for the interim

jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphIndexBuilder.java

jvector-base/src/main/java/io/github/jbellis/jvector/graph/MutableGraphIndex.java

MarkWolters

I left a couple of comments on things that I think could be resolved but I don't see anything significant enough to block

tlwillke

LGTM.

michaeljmarshall · 2025-10-17T21:57:01Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndex.java

    }

-    public int getDimension() {


We are using this in C*, why did this get removed?

I submitted a PR proposing we re-add it. #550

michaeljmarshall · 2025-10-22T19:04:50Z

jvector-base/src/main/java/io/github/jbellis/jvector/graph/similarity/BuildScoreProvider.java

+     * Helper method for the special case that mapping between graph node IDs and ravv ordinals is the identity function.
     */
    static BuildScoreProvider randomAccessScoreProvider(RandomAccessVectorValues ravv, VectorSimilarityFunction similarityFunction) {
+        return randomAccessScoreProvider(ravv, IntStream.range(0, ravv.size()).toArray(), similarityFunction);


This method operates on a mutable ravv, so it's not valid to have a fixed array as a mapper. Further, this has linear memory complexity to represent an identify function. Instead, we should use the OrdinalMapper class to handle this use case, assuming I understand it correctly.

jshook and others added 5 commits October 7, 2025 08:01

support OnHeapGraphReconstruction

e043a63

get incremental graph generation working add serialization for NeighborsCache add ord-mapping logic from PQ vectors to RAVV add ord mapping from RAVV to graph creation Signed-off-by: Samuel Herman <[email protected]>

rebase fix

279b5aa

Signed-off-by: Samuel Herman <[email protected]>

switch interface for convert graph

8cdb9c9

Signed-off-by: Samuel Herman <[email protected]>

remove explicit mentioning of OnDiskGraphIndex in builder

a670cc9

Signed-off-by: Samuel Herman <[email protected]>

add dimension

4cb1353

Signed-off-by: Samuel Herman <[email protected]>

sam-herman requested review from MarkWolters, jshook, marianotepper and tlwillke as code owners October 7, 2025 19:37

sam-herman mentioned this pull request Oct 7, 2025

Incremental insertion to existing graph opensearch-project/opensearch-jvector#167

Merged

2 tasks

marianotepper mentioned this pull request Oct 7, 2025

Reconstruct heap graph from disk graph #519

Closed

Remove NeighborsCache. Refactoring to respect interface boundaries

c1a45c6

marianotepper added 2 commits October 8, 2025 14:18

Implementation that serializes/deserializes the OnHeapGraphIndex

3764116

Label OnHeapGraphIndex.save and OnHeapGraphIndex.load as experimental.

093691e

Add experimental tag to GraphIndexBuilder.buildAndMergeNewNodes

5720660

Bug fixes to make tests pass

1e3847f

jshook reviewed Oct 9, 2025

View reviewed changes

jvector-base/src/main/java/io/github/jbellis/jvector/graph/OnHeapGraphIndex.java Show resolved Hide resolved

Added deprecated tags to OnHeapGraphIndex.load and OnHeapGraphIndex.save

819f1a9

jshook approved these changes Oct 9, 2025

View reviewed changes

MarkWolters reviewed Oct 10, 2025

View reviewed changes

jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphIndexBuilder.java Show resolved Hide resolved

MarkWolters reviewed Oct 10, 2025

View reviewed changes

jvector-base/src/main/java/io/github/jbellis/jvector/graph/MutableGraphIndex.java Show resolved Hide resolved

MarkWolters approved these changes Oct 10, 2025

View reviewed changes

Fix documentation of MutableGraphIndex.allMutationsCompleted

89c5bf3

tlwillke approved these changes Oct 10, 2025

View reviewed changes

tlwillke merged commit 6bcc19b into main Oct 10, 2025
12 checks passed

tlwillke deleted the reconstruct-heap-graph-from-disk-graph branch October 10, 2025 23:05

michaeljmarshall reviewed Oct 17, 2025

View reviewed changes

michaeljmarshall mentioned this pull request Oct 17, 2025

Re-add OnDiskGraphIndex::getDimension #550

Merged

michaeljmarshall reviewed Oct 22, 2025

View reviewed changes

michaeljmarshall mentioned this pull request Oct 22, 2025

Fix BuildScoreProvider::randomAccessScoreProvider bug #554

Closed

marianotepper mentioned this pull request Oct 22, 2025

Add RemappedRandomAccessVectorValues to fix BuildScoreProvider::randomAccessScoreProvider #555

Merged

Reconstruct heap graph from disk graph #536

Reconstruct heap graph from disk graph #536

Uh oh!

Conversation

sam-herman commented Oct 7, 2025

Description

Changes

Testing

Uh oh!

github-actions bot commented Oct 7, 2025 • edited by sam-herman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marianotepper commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sam-herman commented Oct 8, 2025

Uh oh!

marianotepper commented Oct 8, 2025

Uh oh!

sam-herman commented Oct 8, 2025

Uh oh!

marianotepper commented Oct 8, 2025

Uh oh!

marianotepper commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sam-herman commented Oct 9, 2025

Uh oh!

marianotepper commented Oct 9, 2025

Uh oh!

Uh oh!

jshook left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MarkWolters left a comment

Choose a reason for hiding this comment

Uh oh!

tlwillke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

michaeljmarshall Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaeljmarshall Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaeljmarshall Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

github-actions bot commented Oct 7, 2025 •

edited by sam-herman

Loading

marianotepper commented Oct 7, 2025 •

edited

Loading

marianotepper commented Oct 8, 2025 •

edited

Loading

michaeljmarshall Oct 17, 2025 •

edited

Loading

michaeljmarshall Oct 17, 2025 •

edited

Loading