-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Facilitate RP2040 XIP-cache-as-RAM feature #2654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
The pico-sdk and RP2040 hardware provide a few facilities that improve performance by moving runtime code and data into SRAM: 1. "pico/platform/sections.h" currently provides the "__not_in_flash", "__not_in_flash_func", and "__time_critical_func" macros for placing runtime code and data into SRAM by assigning them linker section names in the source code. 2. The pico-sdk CMake scripts allow any of four binary types to be selected with similarly named project properties for the RP2040: "default", "blocked_ram", "copy_to_ram", or "no_flash" 3. The RP2040's eXecute-In-Place (XIP) cache has its own connection to the main AHB bus and provides SRAM speeds on cache hits when retrieving runtime code and data from flash But this regime isn't perfect. The 16kB of XIP cache and its connection to the main AHB bus go mostly unused for PICO_COPY_TO_RAM and PICO_NO_FLASH binary type builds, leaving some performance opportunities unrealized in their implementations. The RP2040's eXecute-In-Place (XIP) cache can be disabled by clearing its CTRL.EN bit which allows its 16kB of memory to be used as SRAM directly. These changes are aiming to update the pico-sdk to support the following: 1. Use the "__time_critical_func" macro to place runtime code into XIP RAM for PICO_COPY_TO_RAM and PICO_NO_FLASH binary type builds 2. Add a couple new "copy_to_ram_using_xip_ram" and "no_flash_using_xip_ram" binary type builds for the RP2040 3. Add a new "PICO_USE_XIP_CACHE_AS_RAM" CMake property to enable the XIP cache's use as RAM for time critical instructions in PICO_COPY_TO_RAM and PICO_NO_FLASH binary type builds 4. Add a couple new CMake functions "pico_sections_not_in_flash(TARGET [list_of_sources])" and "pico_sections_time_critical(TARGET [list_of_sources])" that target selected source files or a whole CMake build target's list of source files for placement into RAM and/or XIP RAM I believe I've achieved these 4 goals, but note: I've only tested them manually with CMake based builds on the RP2040 hardware that I have. Though I have made an effort to fail fast when configuration properties are incompatible, and I've also made an effort to stay compatible with the preexisting section names and the preexisting linker scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is basically a copy of memmap_copy_to_ram.ld with new XIP_RAM sections added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is basically a copy of memmap_no_flash.ld with new XIP_RAM sections added
@@ -66,7 +66,8 @@ SECTIONS | |||
KEEP (*(.embedded_block)) | |||
__embedded_block_end = .; | |||
KEEP (*(.reset)) | |||
} | |||
. = ALIGN(4); | |||
} > FLASH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was ported here to align with the implementation found in the rp2350/memmap_copy_to_ram.ld version of this file.
This isn't my area of expertise, but just out of curiosity...
Does this actually increase performance, or does it just increase the amount of usable memory by moving some stuff out of main SRAM into the XIP-SRAM?
Presumably this only works for binaries less than 16kB? And for binaries that small, wouldn't they mostly end up persisting in the XIP cache anyway when run as a regular flash binary? 🤔 |
For my application I'm running a PICO_COPY_TO_RAM build with dual cores and multiple DMA channels active in the background. I've got a lot of work to get done on the RP2040 in a limited CPU cycle budget. For testing I've got a custom systick based logging loop that I can run on both cores (C0 and C1) simultaneously. Here's what some of my performance numbers look like in the following 4 scenarios:
Each CPU can fetch 2 instructions in 1 cycle, so theoretically the two cpus pulling instructions over the XIP I/O bus will see some contention at first but then naturally start interleaving their instruction fetches and be able to feed a sustainable 1 instruction per CPU cycle. This contention likely accounts for the extra 2 CPU cycles seen while using XIP RAM in scenario 2 vs using main RAM scenario 1. In scenarios 3 and 4, where there's heavy contention with the single DMA engine copying data around in main RAM, we see benefits from running instructions out of XIP RAM. It is faster yes, but maybe more importantly the time ranges are tighter. So to answer your question: running instructions for both CPUs out of XIP RAM does imply a small performance hit vs main RAM when there's little contention in main RAM, but when there is contention in main RAM it can perform better in both speed and timing predictability.
You can use the new "pico_sections_time_critical()" CMake function to put your whole project's instruction set into the 16kB of XIP RAM if they fit, but the changes I've provided are really geared more towards moving a targeted subset of functions over into the the 16kB of XIP RAM. This would be done either by applying the "__time_critical_func" macro in code directly, or by specifying the "pico_sections_time_critical()" CMake function in build scripts along with a list of target source files that should have all of their functions placed into XIP RAM. For COPY_TO_RAM and NO_FLASH builds, binaries are not resident in the XIP cache. Note: the linker complains and fails the build if your binaries exceed the 16kB capacity of the XIP RAM. |
Thanks for answering my (probably naive) questions.
If your code is very timing-critical, I wonder if you might be able to get even better performance by allocating the non-striped SRAM banks to specific CPUs and/or DMA channels? See section 2.6.2 in https://datasheets.raspberrypi.com/rp2040/rp2040-datasheet.pdf |
I looked at this. The striped main RAM does a reasonably good job of distributing multiple CPU & DMA engine accesses across a lesser count of main RAM bus connections. Assigning dedicated SRAM banks is only going to be a big performance win when access to the individual SRAM banks is serialized by using just one CPU or DMA engine at a time. For example the core0 processor's stack is placed in scratch_y SRAM bank and the core1 processor's stack is placed in scratch_x SRAM bank. Since stack access is limited to only one CPU thread this works out nicely. Note: those scratch buffers can also be used to store thread local data for their respective CPUs without fear of access contention. My main consideration for the changes here is that there can be contention between the CPU instruction fetches and the CPU data fetches/writes. In the RP2040's "default" and "blocked_ram" binary type builds the CPU instruction fetches go to the XIP cache, so the instruction+data contention doesn't happen there, but in the RP2040's "copy_to_ram" and "no_flash" binary type builds the CPU instruction fetches share the four main RAM bus connections with the CPU data fetches/writes, so the instruction+data contention can happen there. This PR introduces a build time toggle for the RP2040's "copy_to_ram" and "no_flash" binary type builds allowing their CPU instruction fetches to go to the XIP RAM, so the instruction+data contention can be avoided in these binary type builds too. The RP2040 has limited bus connections and it looks like the RP2350 would improve this situation dramatically for my use case, but my aim with this PR is not so grand. I did some work to figure this stuff out for my own exploration, and I want to share this change which I think makes it easy for other developers to unlock the XIP bus connection for use within the RP2040's "copy_to_ram" and "no_flash" binary type builds. |
The pico-sdk and RP2040 hardware provide a few facilities that improve performance by moving runtime code and data into SRAM:
But this regime isn't perfect. The 16kB of XIP cache and its connection to the main AHB bus go mostly unused for PICO_COPY_TO_RAM and PICO_NO_FLASH binary type builds, leaving some performance opportunities unrealized in their implementations.
The RP2040's eXecute-In-Place (XIP) cache can be disabled by clearing its CTRL.EN bit which allows its 16kB of memory to be used as SRAM directly.
These changes aim to update the pico-sdk to support the following:
I believe I've achieved these 4 goals, but note: I've only tested them manually with CMake based builds on the RP2040 hardware that I have. Though I have made an effort to fail fast when configuration properties are incompatible, and I've also made an effort to stay compatible with the preexisting section names and the preexisting linker scripts.
Fixes #2653