- 
                Notifications
    You must be signed in to change notification settings 
- Fork 139
Reconstruct heap graph from disk graph #536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
get incremental graph generation working add serialization for NeighborsCache add ord-mapping logic from PQ vectors to RAVV add ord mapping from RAVV to graph creation Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
Signed-off-by: Samuel Herman <[email protected]>
| Before you submit for review: 
 If you did not complete any of these, then please explain below. | 
| @sam-herman thanks for the contribution. I created an alternative branch where: 
 The last change is mainly due to a major refactoring that needs to happen in OnHeapGraphIndex to not store the scores anymore. This is creating issues, see #522 for example. So in essence, once the scores are removed, we will have no more need for a cache or for recomputing the scores themselves. I'll be working on this change next week. | 
| 
 @marianotepper the issue with recomputing scores of the entire graph is that it forces a brute force pass. One option that we mentioned was to not include scores at all in the OnHeapGraphIndex and then only re calculating the relative neighbor scores upon traversal. Would it make sense to keep the score cache for now and later remove it when we are ready to implement this approach? | 
| I'm not sure what you mean by a "brute force pass." If you mean that it involves brute force search, that's not the case. However, the "cache" (it is not truly a cache as it stores the whole OnHeapGraphIndex in memory, just with a very slightly different data structure) requires 2x the heap space of the OnHeapGraphIndex in memory and additional storage. Additionally, as I mention in my original post, multiple abstractions needed to be broken for the implementation to work. That is essentially because the class is copying the data structure of the OnHeapGraphIndex and persisting it. If we are going to do that, we might as well just write/load the OnHeapGraphIndex directly. IMO, accepting a data structure that increases memory usage and breaks abstractions, just to deprecate it in a couple of weeks seems ill advised. | 
| 
 Yes, I mean it's computing maxDegree similarities for each node. 
 Theoretical complexity is the same, but in practice for an example of vectors with 1536 dimension (d) 
 Agree, keeping this in-memory is not ideal, what about if we add an iterator such that we never actually read it fully to memory? That's easy to do and this way it can be used only during construction without increasing memory footprint in any way. 
 It is a good point which I can totally agree with, are we sure this is going to be the case though? What do you think? | 
| 
 This calculation is a bit off, what truly counts is the number of vectorized computations, so the multiplication by d is inflating the number arbitrarily. In particular, when building graphs with PQ vectors (a good idea for the OpenSearch plugin if you are not using it already) the amount of additional computation drops even further. Additionally, we can easily parallelize this step, it dose not need to be carried sequentially. That said, I did agree in my initial comment that this approach adds some compute, so I'm not going to argue this further. In exchange, it spares memory. Consider that the node IDs + scores occupy 32 * (4 + 4) = 256 bytes per node for a maxDegree of 32. For a graph with 10^9 nodes this amounts to a heap usage of 238GB. Since we are loading it twice, we would need 500GB just store the graph and the scores. Half of that, is not need. 
 As I mentioned, the NeighborsScoreCache severely breaks abstractions because it is obtaining a copy of the internal data structures of OnHeapGraphIndex just to provide a write/load mechanism, whether this is fully materialized as a second actual copy or not. Essentially, Because of this reason, in the current design, the ImmutableGraphIndex in OnHeapGraphIndex.converToHeap is being used for absolutely no reason, as the NeighborsScoreCache already has all the information needed to instantiate the OnHeapGraph (because it IS an OnHeapGraph). As I mention in my previous comment, we might as well just provide write/load functions for OnHeapGraphIndex. 
 We are computing similarities for insertions anyway. I don't see why this would be a problem. Of course, there is a balance between storing things in memory versus having additional compute, as in any SW problem. At billion+ scales, holding things in memory just to avoid a relatively benign amount of extra compute does not make sense to me. Of course, all of this needs to be carefully implemented and analyzed at a large scale. Not wanting to go on a tangent, I'll live this point be for the context of this PR. All in all, I think a better plan would be to deploy a solution in several stages: 
 | 
| After a virtual discussion with @sam-herman, we agreed on the following compromise: 
 This solution still uses more disk space, but we can live with that in the short term. It does not have a heap memory overhead. I have committed it to this branch. As I did not want to override the branch of this PR unilaterally. @sam-herman can you take a look? | 
| 
 Hi @marianotepper thank you for the changes, the changes look good to me! Do you mind moving the changes over to this PR? I will later move it upstream as well into opensearch-project/opensearch-jvector#167 to use the  | 
| Changes merged. This temporary solution still involves one downcast to be able to call OnHeapGraphIndex.save. The changes are labeled as experimental because they will not stick around in the long term. | 
        
          
                jvector-base/src/main/java/io/github/jbellis/jvector/graph/OnHeapGraphIndex.java
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the discussion, this looks ok for the interim
        
          
                jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphIndexBuilder.java
          
            Show resolved
            Hide resolved
        
              
          
                jvector-base/src/main/java/io/github/jbellis/jvector/graph/MutableGraphIndex.java
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of comments on things that I think could be resolved but I don't see anything significant enough to block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
| } | ||
|  | ||
| public int getDimension() { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using this in C*, why did this get removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I submitted a PR proposing we re-add it. #550
| * Helper method for the special case that mapping between graph node IDs and ravv ordinals is the identity function. | ||
| */ | ||
| static BuildScoreProvider randomAccessScoreProvider(RandomAccessVectorValues ravv, VectorSimilarityFunction similarityFunction) { | ||
| return randomAccessScoreProvider(ravv, IntStream.range(0, ravv.size()).toArray(), similarityFunction); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method operates on a mutable ravv, so it's not valid to have a fixed array as a mapper. Further, this has linear memory complexity to represent an identify function. Instead, we should use the OrdinalMapper class to handle this use case, assuming I understand it correctly.
Description
Multiple database projects such as C* and OpenSearch/Solr/Lucene use LSM mechanism with frequent merges.
This in turn creates a high overhead that forces us to reconstruct the entire graph from scratch upon every merge.
A more economic approach would be to pick a leading graph that was previously persisted to disk and incrementally add small graph nodes to it.
This PR makes some of the required changes to support that behavior in upstream systems such as C* and OpenSearch.
Changes
OnDiskGraphIndexfor a faster re-creation of theOnHeapGraphIndexwhen read back from disk. The cache is separate and is optional, therefore we can choose whether to apply it or not, without any breaking changes to the currentOnDiskGraphIndex. This can later be augmented to a format if we choose to.Testing
Added tests for graph overlap and recall for reconstructed graphs.