Skip to content

Conversation

ArcaneNibble
Copy link

This PR implements manual d-cache support for the OHCI host controller driver, using the approach described in #3208

This has been tested with only the USB HID driver on the TI AM1808

The initial motivation for doing this is to remove the `used` flag
in the TD. If we use this flag, we end up being required to read from TDs
that the OHCI controller might be modifying (i.e. the OHCI controller
logically owns the TD). This happens when we try to allocate a
new, empty TD while the OHCI host controller is working on a transfer.

Move the `used` flag to `gtd_extra_data_t`. This data is only used
by the CPU, and the OHCI controller never accesses it.

The existing allocation method for TDs does *not* put an empty TD
onto each ED (i.e it does *not* do what is shown in Figure 5-6
of the OHCI specification). Instead, the NextTD field of the last
TD is set to 0. The TailP field of the ED is also set to 0.
This works in many cases. However, this implementation means that
the CPU may end up trying to write to the NextTD field of an
in-progress transfer while the OHCI host controller logically owns it.

Change the implementation to use an empty TD, as suggested by the
specification, for endpoints other than EP0. This avoids the
above issue. It is not necessary to make the change for EP0
because only at most one TD can ever be pending at a time.

The above change should also remove the need for the stall workaround.

In the future, we want to modify the code to access EDs through
an uncached mapping. Because uncached mappings are slow, we want to
access EDs as little as possible. Currently, when a TD completes,
we access an ED in order to figure out the device address and
endpoint number of the TD which was completed.

Because moving `used` to `gtd_extra_data_t` necessitates expanding it,
we have enough room to also store the device address and endpoint number
of the TD. This patch does so.

With the above two changes, we no longer need to access an ED when
a TD completes. Also remove the `index` field from TDs as it is
no longer necessary.
Copy link
Owner

@hathach hathach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR. I have a few questions regarding the changes.

TU_ATTR_WEAK bool hcd_dcache_clean(void const* addr, uint32_t data_size) { (void) addr; (void) data_size; return true; }
TU_ATTR_WEAK bool hcd_dcache_invalidate(void const* addr, uint32_t data_size) { (void) addr; (void) data_size; return true; }
#ifndef hcd_dcache_uncached
#define hcd_dcache_uncached(x) (x)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what this function does specifically ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my application, I define it like this:

#define hcd_dcache_uncached(x) (*(volatile __typeof__(x) *)((uintptr_t)(&(x)) + MAGIC_OFFSET))

In words, it takes the address of x, adds a specific offset to it, and then accesses this new address (the address with the offset added) with the same type as the original x.

This is why it is a macro and not a function -- the return type is generic and always matches the type of x.

The reason this is able to work is because of my application's platform setup, which is not shown anywhere in the TinyUSB code. The SoC I am using has a MMU. As part of setting up the MMU, I configure:

Virtual Address Physical Address Attributes
0xCxxxxxxx 0xCxxxxxxx Cacheable
0xDxxxxxxx 0xCxxxxxxx Not cacheable

System RAM is at physical address 0xCxxxxxxx, and the system RAM can be accessed from two different virtual address ranges. Most data (including OHCI TDs) is accessed using the cacheable range at VA 0xCxxxxxxx. In order to pass data between the CPU and the OHCI controller, manual cache operations are used.

However, OHCI EDs are accessed using the uncacheable range at VA 0xDxxxxxxx. This accesses the same physical memory, but it bypasses the cache. This means that I do not have to use manual cache operations on EDs. I can access individual words inside an ED without worrying that the other words in the same ED will get cached.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we can use the following virt to phys to do the conversion. You can implement it in the app in the way to check the range to do the offset correctly ? That would make thing consistent

TU_ATTR_WEAK extern void* tusb_app_virt_to_phys(void *virt_addr);
TU_ATTR_WEAK extern void* tusb_app_phys_to_virt(void *phys_addr);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those functions have slightly different semantics, especially since TDs do not go through the alternate uncached memory address. TDs use the normal memory mapping and use explicit cache operations.

It could be possible to use an approach like the following:

  • Modify these functions to take a parameter specifying the type of object being translated (TD, ED, HCCA)
  • Add compiler directives to place uncached data in a different linker section (placing it at 0xDxxxxxxx)

Would this be a better approach?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just realized that this should be possible to implement without adding a type parameter.

Instead, the reason I did not originally implement it this way was because I would have to split up ohci_data_t into separate cacheable and uncacheable parts.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it, but hcd_dcache_uncached() is too specific to your set-up. It makes the code not as readible as I would like. Can you centralize the uncached in the get_ed() function only ?

TU_VERIFY_STATIC( sizeof(ohci_gtd_t) == 16, "size is not correct" );
TU_VERIFY_STATIC( sizeof(ohci_gtd_t) == CFG_TUH_MEM_DCACHE_ENABLE ? CFG_TUH_MEM_DCACHE_LINE_SIZE : 16, "size is not correct" );

typedef union {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we need to have these as separated typedef ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These separate typedefs are used because I want to be able to explicitly and manually read and write each word in the ED. This is because uncached memory access is slow.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use following typedef to keep the code least modified

  // Word 0
  union {
    struct {
        uint32_t dev_addr          : 7;
        uint32_t ep_number         : 4;
        uint32_t pid               : 2;
        uint32_t speed             : 1;
        uint32_t skip              : 1;
        uint32_t is_iso            : 1;
        uint32_t max_packet_size   : 11;
              // HCD: make use of 5 reserved bits
        uint32_t used              : 1;
        uint32_t is_interrupt_xfer : 1;
        uint32_t is_stalled        : 1;
        uint32_t                   : 2;
    };
    uint32_t word0;
  }


typedef struct {
uint16_t expected_bytes; // up to 8192 bytes so max is 13 bits
uint8_t dev_addr : 7;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need this here ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid having to access an ED in done_queue_isr, because I am in general trying to access an ED as infrequently as possible, and I noticed that there were two reasons to access an ED there:

  1. To apply a workaround for a STALL situation
  2. To obtain the dev_addr and ep_addr for calling hcd_event_xfer_complete

For reason 1, I believe the STALL workaround is not necessary if EDs have an empty TD like the OHCI specification suggests.

After making that change to TD allocation, I noticed that reason 2 can be eliminated by moving those fields out of the ED to somewhere else.

As an alternative, the dev_addr and ep_addr fields can be packed into unused bits in the TD word 0. However, this does not work for isochronous TDs (not enough free bits).

Using TD word 0 to store those fields and then packing used alongside expected_bytes can save some memory. Do you want me to use this approach instead?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you quote the section where OHCI specs suggest that. I wrote this long time ago and forgot most of the thing.

@hathach hathach requested a review from Copilot September 13, 2025 05:19
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements manual d-cache support for the OHCI host controller driver to prevent data corruption when the CPU and OHCI controller access shared memory structures. The changes add cache-conscious data structure alignment, explicit cache operations, and modify struct access patterns to work with cache line boundaries.

Key changes:

  • Adds conditional alignment of OHCI transfer descriptors to cache line boundaries when d-cache is enabled
  • Introduces weak cache operation functions (clean/invalidate) and uncached access macros
  • Refactors struct field access to use union types and cache-aware operations throughout the driver

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/portable/ohci/ohci.h Adds cache line alignment for TDs, introduces union types for ED fields, and extensive documentation on cache coherency design
src/portable/ohci/ohci.c Implements cache operations throughout the driver, refactors struct access patterns, and updates transfer descriptor management

This code is written very carefully to always use an uncached view
of memory to read/write EDs. An uncached view must *always* be used,
or else cache behavior can corrupt the ED.

As part of this change, combine access into as few word-sized accesses
as possible. This makes the code perform better. Doing this involves
giving type names to the bitfields that make up the ED's data words.
Copy link
Owner

@hathach hathach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check below feedback


typedef struct {
uint16_t expected_bytes; // up to 8192 bytes so max is 13 bits
uint8_t dev_addr : 7;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you quote the section where OHCI specs suggest that. I wrote this long time ago and forgot most of the thing.

TU_ATTR_WEAK bool hcd_dcache_clean(void const* addr, uint32_t data_size) { (void) addr; (void) data_size; return true; }
TU_ATTR_WEAK bool hcd_dcache_invalidate(void const* addr, uint32_t data_size) { (void) addr; (void) data_size; return true; }
#ifndef hcd_dcache_uncached
#define hcd_dcache_uncached(x) (x)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it, but hcd_dcache_uncached() is too specific to your set-up. It makes the code not as readible as I would like. Can you centralize the uncached in the get_ed() function only ?

@ArcaneNibble
Copy link
Author

Are you referring to allocating an empty TD in the OHCI specification? That's shown in Figure 5-6:

Screen Shot 2025-09-16 at 5 28 35 AM Screen Shot 2025-09-16 at 5 29 08 AM

The reason for doing this is so that you don't have to reach back and modify the last TD (which might be in use by the OHCI hardware already) when inserting at the end.

@Ryzee119
Copy link
Contributor

Are you referring to allocating an empty TD in the OHCI specification? That's shown in Figure 5-6:

This will likely fix #2687 too

@hathach
Copy link
Owner

hathach commented Sep 23, 2025

@ArcaneNibble @Ryzee119 yeah, I agree we should follow the specs suggestion. I was trying to save SRAM footprint, but it cause more troubles than expected. I am in middle of other works, will review this asap, please be patient meanwhile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants