Skip to content

Conversation

@sam-herman
Copy link
Contributor

Description

Multiple database projects such as C* and OpenSearch/Solr/Lucene use LSM mechanism with frequent merges.
This in turn creates a high overhead that forces us to reconstruct the entire graph from scratch upon every merge.
A more economic approach would be to pick a leading graph that was previously persisted to disk and incrementally add small graph nodes to it.

This PR makes some of the required changes to support that behavior in upstream systems such as C* and OpenSearch.

Changes

  1. Serializable Neighbor Distance Cache - Add a serializable cache that can store the node distances within the OnDiskGraphIndex for a faster re-creation of the OnHeapGraphIndex when read back from disk. The cache is separate and is optional, therefore we can choose whether to apply it or not, without any breaking changes to the current OnDiskGraphIndex. This can later be augmented to a format if we choose to.
  2. Help Methods And Constructors - Add helper method and constructors to facilitate easier usage by other projects with graph merge use cases.

Testing

Added tests for graph overlap and recall for reconstructed graphs.

jshook and others added 5 commits October 7, 2025 08:01
get incremental graph generation working
add serialization for NeighborsCache
add ord-mapping logic from PQ vectors to RAVV
add ord mapping from RAVV to graph creation

Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Oct 7, 2025

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

@marianotepper
Copy link
Contributor

marianotepper commented Oct 7, 2025

@sam-herman thanks for the contribution.

I created an alternative branch where:

  • Interface boundaries are respected (no unprotected downcasts).
  • The NeighborsScoreCache is removed in favor of recomputing the scores live when creating the OnHeapGraphIndex. The NeighborsScoreCache was essentially forcing to write the graph twice on disk and read it twice (i.e., instantiate two copies in memory, the cache and the OnHeapGraph).

The last change is mainly due to a major refactoring that needs to happen in OnHeapGraphIndex to not store the scores anymore. This is creating issues, see #522 for example. So in essence, once the scores are removed, we will have no more need for a cache or for recomputing the scores themselves. I'll be working on this change next week.

@sam-herman
Copy link
Contributor Author

@sam-herman thanks for the contribution.

I created an alternative branch where:

  • Interface boundaries are respected (no unprotected downcasts).
  • The NeighborsScoreCache is removed in favor of recomputing the scores live when creating the OnHeapGraphIndex. The NeighborsScoreCache was essentially forcing to write the graph twice on disk and read it twice (i.e., instantiate two copies in memory, the cache and the OnHeapGraph).

The last change is mainly due to a major refactoring that needs to happen in OnHeapGraphIndex to not store the scores anymore. This is creating issues, see #522 for example. So in essence, once the scores are removed, we will have no more need for a cache or for recomputing the scores themselves. I'll be working on this change next week.

@marianotepper the issue with recomputing scores of the entire graph is that it forces a brute force pass.
For frequent merges between small graphs and large graphs, this is a situation we would prefer to avoid to keep the performance as we've previously seen in opensearch-project/opensearch-jvector#167

One option that we mentioned was to not include scores at all in the OnHeapGraphIndex and then only re calculating the relative neighbor scores upon traversal. Would it make sense to keep the score cache for now and later remove it when we are ready to implement this approach?

@marianotepper
Copy link
Contributor

I'm not sure what you mean by a "brute force pass." If you mean that it involves brute force search, that's not the case.
It just requires computing maxDegree similarities for each node. Granted this is more work than if the scores are written to disk but still relatively minor and of the similar algorithmic complexity than just reading the graph and the scores. Additionally, it could be parallelized with relatively ease.

However, the "cache" (it is not truly a cache as it stores the whole OnHeapGraphIndex in memory, just with a very slightly different data structure) requires 2x the heap space of the OnHeapGraphIndex in memory and additional storage. Additionally, as I mention in my original post, multiple abstractions needed to be broken for the implementation to work. That is essentially because the class is copying the data structure of the OnHeapGraphIndex and persisting it. If we are going to do that, we might as well just write/load the OnHeapGraphIndex directly.

IMO, accepting a data structure that increases memory usage and breaks abstractions, just to deprecate it in a couple of weeks seems ill advised.

@sam-herman
Copy link
Contributor Author

I'm not sure what you mean by a "brute force pass." If you mean that it involves brute force search, that's not the case. It just requires computing maxDegree similarities for each node.

Yes, I mean it's computing maxDegree similarities for each node.

Granted this is more work than if the scores are written to disk but still relatively minor and of the similar algorithmic complexity than just reading the graph and the scores. Additionally, it could be parallelized with relatively ease.

Theoretical complexity is the same, but in practice for an example of vectors with 1536 dimension (d)
we will be computing n *M * d which is x1536 more operations as opposed to just plainly copying a pre-calculated score.

However, the "cache" (it is not truly a cache as it stores the whole OnHeapGraphIndex in memory, just with a very slightly different data structure) requires 2x the heap space of the OnHeapGraphIndex in memory and additional storage. Additionally, as I mention in my original post, multiple abstractions needed to be broken for the implementation to work. That is essentially because the class is copying the data structure of the OnHeapGraphIndex and persisting it. If we are going to do that, we might as well just write/load the OnHeapGraphIndex directly.

Agree, keeping this in-memory is not ideal, what about if we add an iterator such that we never actually read it fully to memory? That's easy to do and this way it can be used only during construction without increasing memory footprint in any way.

IMO, accepting a data structure that increases memory usage and breaks abstractions, just to deprecate it in a couple of weeks seems ill advised.

It is a good point which I can totally agree with, are we sure this is going to be the case though? What do you think?
I would imagine we have a few nuances to consider. Such as how would we re-compute distances upon insertions to existing graph for pruning?

@marianotepper
Copy link
Contributor

Theoretical complexity is the same, but in practice for an example of vectors with 1536 dimension (d)
we will be computing n *M * d which is x1536 more operations as opposed to just plainly copying a pre-calculated score.

This calculation is a bit off, what truly counts is the number of vectorized computations, so the multiplication by d is inflating the number arbitrarily. In particular, when building graphs with PQ vectors (a good idea for the OpenSearch plugin if you are not using it already) the amount of additional computation drops even further. Additionally, we can easily parallelize this step, it dose not need to be carried sequentially. That said, I did agree in my initial comment that this approach adds some compute, so I'm not going to argue this further.

In exchange, it spares memory. Consider that the node IDs + scores occupy 32 * (4 + 4) = 256 bytes per node for a maxDegree of 32. For a graph with 10^9 nodes this amounts to a heap usage of 238GB. Since we are loading it twice, we would need 500GB just store the graph and the scores. Half of that, is not need.

what about if we add an iterator such that we never actually read it fully to memory?

As I mentioned, the NeighborsScoreCache severely breaks abstractions because it is obtaining a copy of the internal data structures of OnHeapGraphIndex just to provide a write/load mechanism, whether this is fully materialized as a second actual copy or not. Essentially,
OnHeapGraphIndex = List<Mat<Integer, Neighbors>>
NeighborsScoreCache = Map<Integer, Mat<Integer, Neighbors>>
so both structures are the same thing, they store the same information in the same way. There is absolutely no reason for NeighborsScoreCache to exist.

Because of this reason, in the current design, the ImmutableGraphIndex in OnHeapGraphIndex.converToHeap is being used for absolutely no reason, as the NeighborsScoreCache already has all the information needed to instantiate the OnHeapGraph (because it IS an OnHeapGraph). As I mention in my previous comment, we might as well just provide write/load functions for OnHeapGraphIndex.

It is a good point which I can totally agree with, are we sure this is going to be the case though? What do you think?
I would imagine we have a few nuances to consider. Such as how would we re-compute distances upon insertions to existing graph for pruning?

We are computing similarities for insertions anyway. I don't see why this would be a problem. Of course, there is a balance between storing things in memory versus having additional compute, as in any SW problem. At billion+ scales, holding things in memory just to avoid a relatively benign amount of extra compute does not make sense to me. Of course, all of this needs to be carefully implemented and analyzed at a large scale. Not wanting to go on a tangent, I'll live this point be for the context of this PR.


All in all, I think a better plan would be to deploy a solution in several stages:

  1. Live with the solution without the NeighborsScoreCache for now. I'd parallelize the node insertions during the conversions (I can help with that). This uses memory more conservatively, does not break abstractions, and, granted, carries additional computations. Ideally, we should evaluate the impact of this additional compute. If it's unsustainable as @sam-herman claims, I'd be happy to re-evaluate this solution, as always.
  2. We can create a new type of MutableGraphIndex that uses an underlying ImmutableGraphIndex (an on-disk graph) without loading it in memory, while having the ability to override those nodes that mutate in an in-memory cache (this cache holds inserted nodes too). Then, upon writing this mutable graph to disk, we only need to persist the changes. This approach is much leaner in its use of heap and would scale as long as we can bound the number of changes in each merge. It also avoid recomputing distances/scores for any node that does get altered by the merge. This is not hard to do and can be easily done in a couple of weeks.
  3. Long-term, we should have the ability to mutate an on-disk graph directly without going through an in-memory structure. The support is essentially in place in the random-access OnDiskGraphIndexWriter. The reason I'm delaying this approach in this plan is the insertion of back edges, which would be problematic. Maybe we could use an in-memory data structure for holding back edges, I'm not sure yet. This point requires more thinking before plowing forward.

@marianotepper
Copy link
Contributor

marianotepper commented Oct 8, 2025

After a virtual discussion with @sam-herman, we agreed on the following compromise:

  1. We will not be using the NeighborsScoreCache
  2. We will temporarily use save/load methods in OnHeapGraphIndex to serialize/deserialize the on-heap-graph. These methods are labeled as experimental as we expect that they will disappear in the future (see the plan in my previous comment that still stands).

This solution still uses more disk space, but we can live with that in the short term. It does not have a heap memory overhead.

I have committed it to this branch. As I did not want to override the branch of this PR unilaterally. @sam-herman can you take a look?

@sam-herman
Copy link
Contributor Author

After a virtual discussion with @sam-herman, we agreed on the following compromise:

  1. We will not be using the NeighborsScoreCache
  2. We will temporarily use save/load methods in OnHeapGraphIndex to serialize/deserialize the on-heap-graph. These methods are labeled as experimental as we expect that they will disappear in the future (see the plan in my previous comment that still stands).

This solution still uses more disk space, but we can live with that in the short term. It does not have a heap memory overhead.

I have committed it to this branch. As I did not want to override the branch of this PR unilaterally. @sam-herman can you take a look?

Hi @marianotepper thank you for the changes, the changes look good to me! Do you mind moving the changes over to this PR? I will later move it upstream as well into opensearch-project/opensearch-jvector#167 to use the load and save method to serialize/deserialize the OnHeapGraph

@marianotepper
Copy link
Contributor

Changes merged. This temporary solution still involves one downcast to be able to call OnHeapGraphIndex.save. The changes are labeled as experimental because they will not stick around in the long term.

Copy link
Contributor

@jshook jshook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the discussion, this looks ok for the interim

Copy link
Contributor

@MarkWolters MarkWolters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of comments on things that I think could be resolved but I don't see anything significant enough to block

Copy link
Collaborator

@tlwillke tlwillke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@tlwillke tlwillke merged commit 6bcc19b into main Oct 10, 2025
12 checks passed
@tlwillke tlwillke deleted the reconstruct-heap-graph-from-disk-graph branch October 10, 2025 23:05
Comment on lines 228 to -230
}

public int getDimension() {
Copy link
Member

@michaeljmarshall michaeljmarshall Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using this in C*, why did this get removed?

Copy link
Member

@michaeljmarshall michaeljmarshall Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted a PR proposing we re-add it. #550

* Helper method for the special case that mapping between graph node IDs and ravv ordinals is the identity function.
*/
static BuildScoreProvider randomAccessScoreProvider(RandomAccessVectorValues ravv, VectorSimilarityFunction similarityFunction) {
return randomAccessScoreProvider(ravv, IntStream.range(0, ravv.size()).toArray(), similarityFunction);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method operates on a mutable ravv, so it's not valid to have a fixed array as a mapper. Further, this has linear memory complexity to represent an identify function. Instead, we should use the OrdinalMapper class to handle this use case, assuming I understand it correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants