Skip to content

Conversation

steffenyount
Copy link

The pico-sdk and RP2040 hardware provide a few facilities that improve performance by moving runtime code and data into SRAM:

  1. "pico/platform/sections.h" currently provides the "__not_in_flash", "__not_in_flash_func", and "__time_critical_func" macros for placing runtime code and data into SRAM by assigning them linker section names in the source code.
  2. The pico-sdk CMake scripts allow any of four binary types to be selected with similarly named project properties for the RP2040: "default", "blocked_ram", "copy_to_ram", or "no_flash"
  3. The RP2040's eXecute-In-Place (XIP) cache has its own connection to the main AHB bus and provides SRAM speeds on cache hits when retrieving runtime code and data from flash

But this regime isn't perfect. The 16kB of XIP cache and its connection to the main AHB bus go mostly unused for PICO_COPY_TO_RAM and PICO_NO_FLASH binary type builds, leaving some performance opportunities unrealized in their implementations.

The RP2040's eXecute-In-Place (XIP) cache can be disabled by clearing its CTRL.EN bit which allows its 16kB of memory to be used as SRAM directly.

These changes aim to update the pico-sdk to support the following:

  1. Use the "__time_critical_func" macro to place runtime code into XIP RAM for PICO_COPY_TO_RAM and PICO_NO_FLASH binary type builds
  2. Add a couple new "copy_to_ram_using_xip_ram" and "no_flash_using_xip_ram" binary type builds for the RP2040
  3. Add a new "PICO_USE_XIP_CACHE_AS_RAM" CMake property to enable the XIP cache's use as RAM for time critical instructions in PICO_COPY_TO_RAM and PICO_NO_FLASH binary type builds
  4. Add a couple new CMake functions "pico_sections_not_in_flash(TARGET [list_of_sources])" and "pico_sections_time_critical(TARGET [list_of_sources])" that target selected source files or a whole CMake build target's list of source files for placement into RAM and/or XIP RAM

I believe I've achieved these 4 goals, but note: I've only tested them manually with CMake based builds on the RP2040 hardware that I have. Though I have made an effort to fail fast when configuration properties are incompatible, and I've also made an effort to stay compatible with the preexisting section names and the preexisting linker scripts.

Fixes #2653

The pico-sdk and RP2040 hardware provide a few facilities that improve performance by moving runtime code and data into SRAM:
1. "pico/platform/sections.h" currently provides the "__not_in_flash", "__not_in_flash_func", and "__time_critical_func" macros for placing runtime code and data into SRAM by assigning them linker section names in the source code.
2. The pico-sdk CMake scripts allow any of four binary types to be selected with similarly named project properties for the RP2040: "default", "blocked_ram", "copy_to_ram", or "no_flash"
3. The RP2040's eXecute-In-Place (XIP) cache has its own connection to the main AHB bus and provides SRAM speeds on cache hits when retrieving runtime code and data from flash

But this regime isn't perfect. The 16kB of XIP cache and its connection to the main AHB bus go mostly unused for PICO_COPY_TO_RAM and PICO_NO_FLASH binary type builds, leaving some performance opportunities unrealized in their implementations.

The RP2040's eXecute-In-Place (XIP) cache can be disabled by clearing its CTRL.EN bit which allows its 16kB of memory to be used as SRAM directly.

These changes are aiming to update the pico-sdk to support the following:

1. Use the "__time_critical_func" macro to place runtime code into XIP RAM for PICO_COPY_TO_RAM and PICO_NO_FLASH binary type builds
2. Add a couple new "copy_to_ram_using_xip_ram" and "no_flash_using_xip_ram" binary type builds for the RP2040
3. Add a new "PICO_USE_XIP_CACHE_AS_RAM" CMake property to enable the XIP cache's use as RAM for time critical instructions in PICO_COPY_TO_RAM and PICO_NO_FLASH binary type builds
4. Add a couple new CMake functions "pico_sections_not_in_flash(TARGET [list_of_sources])" and "pico_sections_time_critical(TARGET [list_of_sources])" that target selected source files or a whole CMake build target's list of source files for placement into RAM and/or XIP RAM

I believe I've achieved these 4 goals, but note: I've only tested them manually with CMake based builds on the RP2040 hardware that I have. Though I have made an effort to fail fast when configuration properties are incompatible, and I've also made an effort to stay compatible with the preexisting section names and the preexisting linker scripts.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is basically a copy of memmap_copy_to_ram.ld with new XIP_RAM sections added

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is basically a copy of memmap_no_flash.ld with new XIP_RAM sections added

@steffenyount steffenyount changed the title Facilitate RP2040 XIP-cache-as-RAM feature (#2653) Facilitate RP2040 XIP-cache-as-RAM feature #2653 Sep 9, 2025
@steffenyount steffenyount changed the title Facilitate RP2040 XIP-cache-as-RAM feature #2653 Facilitate RP2040 XIP-cache-as-RAM feature Sep 9, 2025
@@ -66,7 +66,8 @@ SECTIONS
KEEP (*(.embedded_block))
__embedded_block_end = .;
KEEP (*(.reset))
}
. = ALIGN(4);
} > FLASH
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was ported here to align with the implementation found in the rp2350/memmap_copy_to_ram.ld version of this file.

@lurch
Copy link
Contributor

lurch commented Sep 9, 2025

This isn't my area of expertise, but just out of curiosity...

But this regime isn't perfect. The 16kB of XIP cache and its connection to the main AHB bus go mostly unused for PICO_COPY_TO_RAM and PICO_NO_FLASH binary type builds, leaving some performance opportunities unrealized in their implementations.

Does this actually increase performance, or does it just increase the amount of usable memory by moving some stuff out of main SRAM into the XIP-SRAM?

Add a couple new "copy_to_ram_using_xip_ram" and "no_flash_using_xip_ram" binary type builds for the RP2040

Presumably this only works for binaries less than 16kB? And for binaries that small, wouldn't they mostly end up persisting in the XIP cache anyway when run as a regular flash binary? 🤔

@steffenyount
Copy link
Author

Does this actually increase performance, or does it just increase the amount of usable memory by moving some stuff out of main SRAM into the XIP-SRAM?

For my application I'm running a PICO_COPY_TO_RAM build with dual cores and multiple DMA channels active in the background. I've got a lot of work to get done on the RP2040 in a limited CPU cycle budget.

For testing I've got a custom systick based logging loop that I can run on both cores (C0 and C1) simultaneously. Here's what some of my performance numbers look like in the following 4 scenarios:

  1. instructions in main RAM shared with the logger's data output and no DMA activity in main RAM
  • One log write completes in ~20 CPU cycles (+-2)
print_tick_logs()
C1[   1] 16777195       20: 1 after
C0[   1] 16777193       22: 1 after
C1[   2] 16777175       20: 2 after
C0[   2] 16777173       20: 2 after
C1[   3] 16777155       20: 3 after
C0[   3] 16777153       20: 3 after
C1[   4] 16777134       21: 4 after
C0[   4] 16777133       20: 4 after
C1[   5] 16777114       20: 5 after
C0[   5] 16777113       20: 5 after
C1[   6] 16777094       20: 6 after
C0[   6] 16777093       20: 6 after
C1[   7] 16777074       20: 7 after
C0[   7] 16777072       21: 7 after
C0[   8] 16777052       20: 8 after
C1[   8] 16777052       22: 8 after
C0[   9] 16777030       22: 9 after
C1[   9] 16777030       22: 9 after
C0[  10] 16777012       18: 10 after
C1[  10] 16777010       20: 10 after
  1. instructions in XIP RAM with the logger's data output in main RAM and no DMA activity in main RAM
  • One log write completes in ~22 CPU cycles (+-2)
print_tick_logs()
C1[   1] 16777192       23: 1 after
C0[   1] 16777191       23: 1 after
C1[   2] 16777170       22: 2 after
C0[   2] 16777169       22: 2 after
C1[   3] 16777148       22: 3 after
C0[   3] 16777147       22: 3 after
C1[   4] 16777126       22: 4 after
C0[   4] 16777125       22: 4 after
C1[   5] 16777104       22: 5 after
C0[   5] 16777103       22: 5 after
C1[   6] 16777082       22: 6 after
C0[   6] 16777081       22: 6 after
C1[   7] 16777060       22: 7 after
C0[   7] 16777059       22: 7 after
C1[   8] 16777038       22: 8 after
C0[   8] 16777037       22: 8 after
C0[   9] 16777015       22: 9 after
C1[   9] 16777014       24: 9 after
C1[  10] 16776995       19: 10 after
C0[  10] 16776994       21: 10 after
  1. instructions in main RAM shared with the logger's data output and concurrent DMA activity in main RAM
  • One log write completes in ~26 CPU cycles (+-7)
print_tick_logs()
C0[   1] 16777190       24: 1 after
C1[   1] 16777182       33: 1 after
C0[   2] 16777165       25: 2 after
C1[   2] 16777152       30: 2 after
C0[   3] 16777136       29: 3 after
C1[   3] 16777125       27: 3 after
C0[   4] 16777111       25: 4 after
C1[   4] 16777099       26: 4 after
C0[   5] 16777087       24: 5 after
C1[   5] 16777069       30: 5 after
C0[   6] 16777062       25: 6 after
C1[   6] 16777043       26: 6 after
C0[   7] 16777036       26: 7 after
C1[   7] 16777022       21: 7 after
C0[   8] 16777011       25: 8 after
C1[   8] 16776987       35: 8 after
C0[   9] 16776982       29: 9 after
C0[  10] 16776961       21: 10 after
C1[   9] 16776953       34: 9 after
C1[  10] 16776926       27: 10 after
  1. instructions in XIP RAM with the logger's data output in main RAM and concurrent DMA activity in main RAM
  • One log write completes in ~24 CPU cycles (+-3)
print_tick_logs()
C0[   1] 16777190       25: 1 after
C1[   1] 16777188       27: 1 after
C0[   2] 16777165       25: 2 after
C1[   2] 16777162       26: 2 after
C0[   3] 16777142       23: 3 after
C1[   3] 16777139       23: 3 after
C0[   4] 16777117       25: 4 after
C1[   4] 16777114       25: 4 after
C0[   5] 16777093       24: 5 after
C1[   5] 16777091       23: 5 after
C0[   6] 16777070       23: 6 after
C1[   6] 16777069       22: 6 after
C0[   7] 16777045       25: 7 after
C1[   7] 16777043       26: 7 after
C0[   8] 16777022       23: 8 after
C1[   8] 16777019       24: 8 after
C0[   9] 16776995       27: 9 after
C1[   9] 16776994       25: 9 after
C0[  10] 16776972       23: 10 after
C1[  10] 16776969       25: 10 after

Each CPU can fetch 2 instructions in 1 cycle, so theoretically the two cpus pulling instructions over the XIP I/O bus will see some contention at first but then naturally start interleaving their instruction fetches and be able to feed a sustainable 1 instruction per CPU cycle. This contention likely accounts for the extra 2 CPU cycles seen while using XIP RAM in scenario 2 vs using main RAM scenario 1.

In scenarios 3 and 4, where there's heavy contention with the single DMA engine copying data around in main RAM, we see benefits from running instructions out of XIP RAM. It is faster yes, but maybe more importantly the time ranges are tighter.

So to answer your question: running instructions for both CPUs out of XIP RAM does imply a small performance hit vs main RAM when there's little contention in main RAM, but when there is contention in main RAM it can perform better in both speed and timing predictability.

Add a couple new "copy_to_ram_using_xip_ram" and "no_flash_using_xip_ram" binary type builds for the RP2040

Presumably this only works for binaries less than 16kB? And for binaries that small, wouldn't they mostly end up persisting in the XIP cache anyway when run as a regular flash binary? 🤔

You can use the new "pico_sections_time_critical()" CMake function to put your whole project's instruction set into the 16kB of XIP RAM if they fit, but the changes I've provided are really geared more towards moving a targeted subset of functions over into the the 16kB of XIP RAM. This would be done either by applying the "__time_critical_func" macro in code directly, or by specifying the "pico_sections_time_critical()" CMake function in build scripts along with a list of target source files that should have all of their functions placed into XIP RAM.

For COPY_TO_RAM and NO_FLASH builds, binaries are not resident in the XIP cache.

Note: the linker complains and fails the build if your binaries exceed the 16kB capacity of the XIP RAM.

@lurch
Copy link
Contributor

lurch commented Sep 9, 2025

Thanks for answering my (probably naive) questions.

Each CPU can fetch 2 instructions in 1 cycle, so theoretically the two cpus pulling instructions over the XIP I/O bus will see some contention at first but then naturally start interleaving their instruction fetches and be able to feed a sustainable 1 instruction per CPU cycle. This contention likely accounts for the extra 2 CPU cycles seen while using XIP RAM in scenario 2 vs using main RAM scenario 1.

If your code is very timing-critical, I wonder if you might be able to get even better performance by allocating the non-striped SRAM banks to specific CPUs and/or DMA channels? See section 2.6.2 in https://datasheets.raspberrypi.com/rp2040/rp2040-datasheet.pdf

@steffenyount
Copy link
Author

steffenyount commented Sep 10, 2025

If your code is very timing-critical, I wonder if you might be able to get even better performance by allocating the non-striped SRAM banks to specific CPUs and/or DMA channels? See section 2.6.2 in https://datasheets.raspberrypi.com/rp2040/rp2040-datasheet.pdf

I looked at this. The striped main RAM does a reasonably good job of distributing multiple CPU & DMA engine accesses across a lesser count of main RAM bus connections. Assigning dedicated SRAM banks is only going to be a big performance win when access to the individual SRAM banks is serialized by using just one CPU or DMA engine at a time.

For example the core0 processor's stack is placed in scratch_y SRAM bank and the core1 processor's stack is placed in scratch_x SRAM bank. Since stack access is limited to only one CPU thread this works out nicely. Note: those scratch buffers can also be used to store thread local data for their respective CPUs without fear of access contention.

My main consideration for the changes here is that there can be contention between the CPU instruction fetches and the CPU data fetches/writes. In the RP2040's "default" and "blocked_ram" binary type builds the CPU instruction fetches go to the XIP cache, so the instruction+data contention doesn't happen there, but in the RP2040's "copy_to_ram" and "no_flash" binary type builds the CPU instruction fetches share the four main RAM bus connections with the CPU data fetches/writes, so the instruction+data contention can happen there. This PR introduces a build time toggle for the RP2040's "copy_to_ram" and "no_flash" binary type builds allowing their CPU instruction fetches to go to the XIP RAM, so the instruction+data contention can be avoided in these binary type builds too.

The RP2040 has limited bus connections and it looks like the RP2350 would improve this situation dramatically for my use case, but my aim with this PR is not so grand. I did some work to figure this stuff out for my own exploration, and I want to share this change which I think makes it easy for other developers to unlock the XIP bus connection for use within the RP2040's "copy_to_ram" and "no_flash" binary type builds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants