Skip to content

ICU-23056 CLDR 48 alpha0 integration #3600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

DraganBesevic
Copy link
Contributor

@DraganBesevic DraganBesevic commented Aug 20, 2025

Checklist

  • Required: Issue filed: ICU-23056
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

Alpha0 integration for CLDR 48

Known issues:
ICU-23182, (Formatting for Japanese calendars)
ICU-23183 (Time Zone round trip test)
ICU-23185 (hz and hv patterns)
ICU-23186 (Japanese calendars fails round trip unit tests)
ICU-23187 (TestPersonNames test fail for kk_Arab and shn)
- note, kk_Arab and shn are new locales not yet in ICU (until config file is updated)
ICU-23188 (Compact decimal format adds extra RLM mark for ar-EG)

ALLOW_MANY_COMMITS=true

@pedberg-icu
Copy link
Contributor

pedberg-icu commented Aug 20, 2025

@DraganBesevic getting a crash e.g. in the clang-cpp17 tests:

      IntlCalendarTest {
         ... 
         TestForceGannenNumbering {
intltest: smpdtfmt.cpp:1290: void icu_78::_appendSymbol(icu_78::UnicodeString &, int32_t, const icu_78::UnicodeString *, int32_t): Assertion `0 <= value && value < symbolsCount' failed.

Depending on what is going on here: Maybe skip this test with the logKnownIssue already covering Gannen formatting?

@DraganBesevic
Copy link
Contributor Author

@DraganBesevic getting a crash e.g. in the clang-cpp17 tests:

      IntlCalendarTest {
         ... 
         TestForceGannenNumbering {
intltest: smpdtfmt.cpp:1290: void icu_78::_appendSymbol(icu_78::UnicodeString &, int32_t, const icu_78::UnicodeString *, int32_t): Assertion `0 <= value && value < symbolsCount' failed.

Depending on what is going on here: Maybe skip this test with the logKnownIssue already covering Gannen formatting?

That's weird. I wonder why adding a known issue would cause to crash in CPP 17. I will skip it as it is already a known issue

@mihnita
Copy link
Contributor

mihnita commented Aug 22, 2025

I was able to reproduce it on an Ubuntu system at home.

I am trying to figure out what the cause is. But it is very puzzling, I see no reason for this to fail with the CLDR update and not fail without it.

Nothing in this PR can explain what is going on (just by looking at it).
I'm pretty sure that it will all be "obvious" once I figure out the cause :-)

@mihnita
Copy link
Contributor

mihnita commented Aug 22, 2025

The failure happens in icu4c/source/i18n/smpdtfmt.cpp, line 1536, in this call:

_appendSymbol(appendTo, value, fSymbols->fNarrowEras, fSymbols->fNarrowErasCount);

_appendSymbol fails the U_ASSERT(0 <= value && value < symbolsCount); (line 1290)
because value is 235, and symbolsCount is 232.

The locale failing is Japanese with alternate calendar, 235 is Heisei
(see icu4c/source/data/locales/root.txt in pull request)

I am not sure how to fix this, since it is not just a few entries added, the structure seems changed.
Inspect the icu4c/source/data/locales/root.txt (or any other locale file).
The eras changed from an array:

eras{
  abbreviated{                                                   
    "Taika (645–650)",
    "Hakuchi (650–671)",
    "Hakuhō (672–686)",
    "Shuchō (686–701)",
    ...
    "Heisei",
    ...

to a map:

eras{
  abbreviated{
    0{"Taika (645–650)"}
    1{"Hakuchi (650–671)"}
    10{"Tenpyō (729–749)"}
    ...
    235{"Heisei"}
    ...

@mihnita
Copy link
Contributor

mihnita commented Aug 22, 2025

Related?

3a66f8c 2025-08-11 22:24:01 +0000 "ICU-23172 part 1: Convert era names from array to map (#3588):"

@mihnita
Copy link
Contributor

mihnita commented Aug 22, 2025

Why it does not fail on non-Linux?
I suppose it is about how U_ASSERT is implemented.

On Linux the test results in a core-dump.

For macOS that should be enabled, it does not happen by default:
https://nasa.github.io/trick/howto_guides/How-to-dump-core-file-on-MacOS.html

Also, it looks like the test suite (on Linux) recovers after a core dump and continues.
I don't know if it is intentional.
But that makes it kind of hard to find where the failure happens.
You get one code dump for one test, and everything continues.

@mihnita
Copy link
Contributor

mihnita commented Aug 22, 2025

For the ja.txt file the eras/abbreviated has 237 entries, but the eras/narrow has only 232 entries.
Which "rhymes" with the symbolsCount being 232 in U_ASSERT.


This is very likely related to the new bugs filed by Dragan for the Japanese calendar that only happen with the new data.

@AEApple
Copy link

AEApple commented Aug 22, 2025

For the ja.txt file the eras/abbreviated has 237 entries, but the eras/narrow has only 232 entries. Which "rhymes" with the symbolsCount being 232 in U_ASSERT.

This is very likely related to the new bugs filed by Dragan for the Japanese calendar that only happen with the new data.

Per @pedberg-icu - Only the 5 most recent eras are the ones that differ from the abbreviated. It seems likely that those are not being copied for some reason, and if there were true that would explain why the formatting is not using the era code for those. It's not clear where in the integration this is happening.

@pedberg-icu
Copy link
Contributor

pedberg-icu commented Aug 22, 2025

For the ja.txt file the eras/abbreviated has 237 entries, but the eras/narrow has only 232 entries. Which "rhymes" with the symbolsCount being 232 in U_ASSERT.
This is very likely related to the new bugs filed by Dragan for the Japanese calendar that only happen with the new data.

Per @pedberg-icu - Only the 5 most recent eras are the ones that differ from the abbreviated. It seems likely that those are not being copied for some reason, and if there were true that would explain why the formatting is not using the era code for those. It's not clear where in the integration this is happening.

Yeah, it seems like inheritance is not working for the narrow Japanese eras using the new table format for the era names data. But I did run the ICU4C and ICU4J tests with data generated using the new table format, including with early eras deleted, and did not have these problems. I will try to take a look this weekend.

@markusicu
Copy link
Member

The CLDR structure for eras is changing, allowing sparse eras, so that early (inaccurate) Japanese eras can be removed (and other changes endorsed by the design group). Peter's recent PR prepared the eras-loading code so that it can handle either the old or the new data structure. He generated the new structure to test with, but didn't add the new data in the same PR (because the generation pulled in a lot of unrelated stuff), and I guess he didn't see these errors with his generated data.

I reviewed the new loading code, and it seemed fine. Maybe we overlooked something. I actually suspect that the formatting code goes directly into the array when it should use a getter -- there is a new minEras field, and the array omits those first entries. There are also conditions when future eras at the end are dropped.

And, I don't know how loading vs. using abbreviated vs. wide vs. narrow symbols works. Maybe something expects the arrays to has the same lengths, but if there is no abbreviated symbol then the formatting code should pick the narrow one, or vice versa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants