Skip to content

Conversation

AmirhosseinPoolad
Copy link
Contributor

@AmirhosseinPoolad AmirhosseinPoolad commented Oct 21, 2025

Previously we used a non-conformant proxy iterator to sort the rr edges (which were stored in a struct-of-array fashion). That was very fragile and resulted in multiple problems when updating compilers and standard libraries. This PR addresses the issues we had with the previous methods we used to sort edges by using a slower but conformant method.

Sorting edges is now much slower and requires building 4 entire new arrays for edges, but since we only sort two times when building the RR graph it should be fine.

@github-actions github-actions bot added the lang-cpp C/C++ code label Oct 21, 2025
@AmirhosseinPoolad AmirhosseinPoolad changed the title [WIP] Add edge sorting method to rr_graph_storage Add edge sorting method to rr_graph_storage Oct 21, 2025
@AmirhosseinPoolad
Copy link
Contributor Author

Running some performance tests and runtime numbers don't seem to be changed much. Now that I think about it this might actually be faster than the in-place sorting we had before. This method swaps only some indices when sorting and only copying edge data around once, while the previous method swapped (and copied) edge data a lot more during sorting.

Copy link
Contributor

@soheilshahrouz soheilshahrouz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @AmirhosseinPoolad.

I suggested a few minor changes.

new_edge_src_node_[RREdgeId(new_index)] = edge_src_node_[RREdgeId(edge_index)];
new_edge_dest_node_[RREdgeId(new_index)] = edge_dest_node_[RREdgeId(edge_index)];
new_edge_switch_[RREdgeId(new_index)] = edge_switch_[RREdgeId(edge_index)];
new_edge_remapped_[RREdgeId(new_index)] = edge_remapped_[RREdgeId(edge_index)];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use a vtr::vector<RREdgeId, RRedgeId> for edge_indices, you can use vtr::vector::pairs() method to automatically iterate over new_index and edge_index. This can also help get rid of type casting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would kind of obfuscate what the code is doing and make it harder to understand. I did the new_index casting in a separate line to make these four lines more readable.

@github-actions github-actions bot added the VPR VPR FPGA Placement & Routing Tool label Oct 21, 2025
@AmirhosseinPoolad
Copy link
Contributor Author

Ran the VTR circuits, results look very good.

Metric Vs Master
Total VPR runtime 0.968
Max mem. usage 1.030
RRG generation time 0.805

This appears to be a pretty good change overall. New code is (imo) more readable, less fragile and faster with only a slight memory cost.

@vaughnbetz

@vaughnbetz
Copy link
Contributor

vaughnbetz commented Oct 22, 2025 via email

@AmirhosseinPoolad
Copy link
Contributor Author

Did the permutations one by one instead of all at once like you suggested. It certainly helps with memory usage and doesn't really seem to result in a noticeable slowdown that couldn't be attributed to machine load/noise.

Metric At once vs Master One by one vs Master
Total VPR runtime 0.968 0.971
Max mem. usage 1.030 1.012
RRG generation time 0.805 0.799

Same results, but only for circuits whose memory usage exceed 200MB:

Metric At once vs Master One by one vs Master
Total VPR runtime 0.960 0.967
Max mem. usage 1.055 1.025
RRG generation time 0.803 0.800

Detailed results are added to the previous link.

Copy link
Contributor

@vaughnbetz vaughnbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Code looks good, but I have a question / concern about .shrink_to_fit and .capacity. I think we can make the code more memory size safe (and maybe it will save some memory, depending on what the standard library is doing).


// Since vec could have any type, we need to figure out it's type to allocate new_vec.
// The scary std::remove_reference stuff does exactly that. This does nothing other than building a new 'vec' sized vector.
typename std::remove_reference<decltype(vec)>::type new_vec(vec.size(), default_value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we have to do a .reserve or .shrink_to_fit to make sure this new_vec does not have any extra capacity? We're careful to do that in the RR graph, so I imagine the original edge_src_node_ etc. have done that. If we don't do it for the new one, there's some chance the standard library created a vector with extra capacity for us, which we don't want with these big vectors.

Seems worth testing, and also a good practice to use .shrink_to_fit for safety in case some versions of the standard library default to capacity == requested constructor size for big vectors and some default to something else (e.g. 2x size for any constructor that asks for a size).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t_rr_graph_storage actually has a shrink_to_fit method that does this for all the vectors in t_rr_graph_storage. Currently it's only used for building tileable RR graphs. I think we should test calling t_rr_graph_storage::shrink_to_fit at the end of RR graph generation in a separate PR, since doing it in the sorting function wouldn't actually change the peak memory usage of the sorting function and would only help in "long term." It might also have an extra memory and runtime cost (The underlying allocator might have to allocate a new smaller buffer and copy all the elements to it.)

@AmirhosseinPoolad
Copy link
Contributor Author

AmirhosseinPoolad commented Oct 23, 2025

Ran Titan on the current version of the code (without shrink_to_fit). Full data is added to the link.

Metric Vs Master
Total VPR runtime 1.050
Max mem. usage 0.996
RRG generation time 0.878

Pretty sure the runtime values were affected by machine load. Packing (which has nothing to do with the RR Graph) took 4.6% longer and placement took 7.5% longer (with the exact same #swaps) and routing took 1.2% longer. QoR is exactly the same as master which strongly implies that the RR Graph is not changed at all. I also wrote the RR Graph for stereo_vision to a file and the diff was empty.
image

I think the new code uses less memory here since the rr graph generation code over-allocates the edges (first allocates 10x number of nodes for edges and doubles it each time it needs more) but the new sorted edge vectors allocate precisely what they need.

@vaughnb-cerebras
Copy link

Thanks. Looks OK.
Placement may include rr-graph building time; we need to build it to create the placement delay estimator. So depending on where we put the timers it may be in placement time. You should be able to tell from the log file, as we should print messages when we create the rr-graph and compute the placement delay matrix, etc.
Still think we should do a shrink to fit since it gives us more control .... can you try that out?

@vaughnb-cerebras
Copy link

Hmmm .... good point that the shrink_to_fit could hurt if we get a copy. How about this: just print out the .size() and .capacity() of these four vectors on some large design. If they're equal, I'll stop worrying about it, since at least one version of the standard library does the logical thing and sets .capacity to .size when the constructor specifies an initial size. Let me know what you find ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants