ICU-23056 CLDR 48 alpha0 integration #3600

DraganBesevic · 2025-08-20T16:17:45Z

Checklist

Required: Issue filed: ICU-23056
Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

Alpha0 integration for CLDR 48

Known issues:
ICU-23182, (Formatting for Japanese calendars)
ICU-23183 (Time Zone round trip test)
ICU-23185 (hz and hv patterns)
ICU-23186 (Japanese calendars fails round trip unit tests)
ICU-23187 (TestPersonNames test fail for kk_Arab and shn)
- note, kk_Arab and shn are new locales not yet in ICU (until config file is updated)
ICU-23188 (Compact decimal format adds extra RLM mark for ar-EG)

ALLOW_MANY_COMMITS=true

…iles

pedberg-icu · 2025-08-20T16:29:38Z

@DraganBesevic getting a crash e.g. in the clang-cpp17 tests:

      IntlCalendarTest {
         ... 
         TestForceGannenNumbering {
intltest: smpdtfmt.cpp:1290: void icu_78::_appendSymbol(icu_78::UnicodeString &, int32_t, const icu_78::UnicodeString *, int32_t): Assertion `0 <= value && value < symbolsCount' failed.

Depending on what is going on here: Maybe skip this test with the logKnownIssue already covering Gannen formatting?

DraganBesevic · 2025-08-20T16:57:40Z

@DraganBesevic getting a crash e.g. in the clang-cpp17 tests:
      IntlCalendarTest {
         ... 
         TestForceGannenNumbering {
intltest: smpdtfmt.cpp:1290: void icu_78::_appendSymbol(icu_78::UnicodeString &, int32_t, const icu_78::UnicodeString *, int32_t): Assertion `0 <= value && value < symbolsCount' failed.
Depending on what is going on here: Maybe skip this test with the logKnownIssue already covering Gannen formatting?

That's weird. I wonder why adding a known issue would cause to crash in CPP 17. I will skip it as it is already a known issue

…res for CPP 17

mihnita · 2025-08-22T01:03:38Z

I was able to reproduce it on an Ubuntu system at home.

I am trying to figure out what the cause is. But it is very puzzling, I see no reason for this to fail with the CLDR update and not fail without it.

Nothing in this PR can explain what is going on (just by looking at it).
I'm pretty sure that it will all be "obvious" once I figure out the cause :-)

mihnita · 2025-08-22T05:16:24Z

The failure happens in icu4c/source/i18n/smpdtfmt.cpp, line 1536, in this call:

_appendSymbol(appendTo, value, fSymbols->fNarrowEras, fSymbols->fNarrowErasCount);

_appendSymbol fails the U_ASSERT(0 <= value && value < symbolsCount); (line 1290)
because value is 235, and symbolsCount is 232.

The locale failing is Japanese with alternate calendar, 235 is Heisei
(see icu4c/source/data/locales/root.txt in pull request)

I am not sure how to fix this, since it is not just a few entries added, the structure seems changed.
Inspect the icu4c/source/data/locales/root.txt (or any other locale file).
The eras changed from an array:

eras{
  abbreviated{                                                   
    "Taika (645–650)",
    "Hakuchi (650–671)",
    "Hakuhō (672–686)",
    "Shuchō (686–701)",
    ...
    "Heisei",
    ...

to a map:

eras{
  abbreviated{
    0{"Taika (645–650)"}
    1{"Hakuchi (650–671)"}
    10{"Tenpyō (729–749)"}
    ...
    235{"Heisei"}
    ...

mihnita · 2025-08-22T05:26:31Z

Related?

3a66f8c 2025-08-11 22:24:01 +0000 "ICU-23172 part 1: Convert era names from array to map (#3588):"

mihnita · 2025-08-22T05:49:34Z

Why it does not fail on non-Linux?
I suppose it is about how U_ASSERT is implemented.

On Linux the test results in a core-dump.

For macOS that should be enabled, it does not happen by default:
https://nasa.github.io/trick/howto_guides/How-to-dump-core-file-on-MacOS.html

Also, it looks like the test suite (on Linux) recovers after a core dump and continues.
I don't know if it is intentional.
But that makes it kind of hard to find where the failure happens.
You get one code dump for one test, and everything continues.

mihnita · 2025-08-22T06:10:44Z

For the ja.txt file the eras/abbreviated has 237 entries, but the eras/narrow has only 232 entries.
Which "rhymes" with the symbolsCount being 232 in U_ASSERT.

This is very likely related to the new bugs filed by Dragan for the Japanese calendar that only happen with the new data.

AEApple · 2025-08-22T16:49:05Z

For the ja.txt file the eras/abbreviated has 237 entries, but the eras/narrow has only 232 entries. Which "rhymes" with the symbolsCount being 232 in U_ASSERT.

This is very likely related to the new bugs filed by Dragan for the Japanese calendar that only happen with the new data.

Per @pedberg-icu - Only the 5 most recent eras are the ones that differ from the abbreviated. It seems likely that those are not being copied for some reason, and if there were true that would explain why the formatting is not using the era code for those. It's not clear where in the integration this is happening.

pedberg-icu · 2025-08-22T16:58:19Z

For the ja.txt file the eras/abbreviated has 237 entries, but the eras/narrow has only 232 entries. Which "rhymes" with the symbolsCount being 232 in U_ASSERT.
This is very likely related to the new bugs filed by Dragan for the Japanese calendar that only happen with the new data.

Per @pedberg-icu - Only the 5 most recent eras are the ones that differ from the abbreviated. It seems likely that those are not being copied for some reason, and if there were true that would explain why the formatting is not using the era code for those. It's not clear where in the integration this is happening.

Yeah, it seems like inheritance is not working for the narrow Japanese eras using the new table format for the era names data. But I did run the ICU4C and ICU4J tests with data generated using the new table format, including with early eras deleted, and did not have these problems. I will try to take a look this weekend.

markusicu · 2025-08-22T17:00:08Z

The CLDR structure for eras is changing, allowing sparse eras, so that early (inaccurate) Japanese eras can be removed (and other changes endorsed by the design group). Peter's recent PR prepared the eras-loading code so that it can handle either the old or the new data structure. He generated the new structure to test with, but didn't add the new data in the same PR (because the generation pulled in a lot of unrelated stuff), and I guess he didn't see these errors with his generated data.

I reviewed the new loading code, and it seemed fine. Maybe we overlooked something. I actually suspect that the formatting code goes directly into the array when it should use a getter -- there is a new minEras field, and the array omits those first entries. There are also conditions when future eras at the end are dropped.

And, I don't know how loading vs. using abbreviated vs. wide vs. narrow symbols works. Maybe something expects the arrays to has the same lengths, but if there is no abbreviated symbol then the formatting code should pick the narrow one, or vice versa.

DraganBesevic added 3 commits August 20, 2025 09:06

ICU-23056 Integrate CLDR 48 release alpha0, part 1, binary files

317c4d5

ICU-23056 Integrate CLDR 48 release alpha0, part 2, data files

e5eade2

ICU-23056 Integrate CLDR 48 release alpha0, part 3, source and test f…

c5d1d7f

…iles

DraganBesevic requested review from mihnita, macchiati, markusicu, pedberg-icu and richgillam August 20, 2025 16:17

markusicu self-assigned this Aug 20, 2025

DraganBesevic added 2 commits August 20, 2025 10:09

ICU-23056 Integrate CLDR 48 release alpha0, part 4, fixing test failu…

5b162d6

…res for CPP 17

ICU-23056 Integrate CLDR 48 release alpha0, part 5, fixing test failu…

b82ca82

…res for CPP 17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ICU-23056 CLDR 48 alpha0 integration #3600

ICU-23056 CLDR 48 alpha0 integration #3600

DraganBesevic commented Aug 20, 2025 •

edited by markusicu

Loading

Uh oh!

pedberg-icu commented Aug 20, 2025 •

edited

Loading

Uh oh!

DraganBesevic commented Aug 20, 2025

Uh oh!

mihnita commented Aug 22, 2025

Uh oh!

mihnita commented Aug 22, 2025 •

edited

Loading

Uh oh!

mihnita commented Aug 22, 2025

Uh oh!

mihnita commented Aug 22, 2025 •

edited

Loading

Uh oh!

mihnita commented Aug 22, 2025

Uh oh!

AEApple commented Aug 22, 2025

Uh oh!

pedberg-icu commented Aug 22, 2025 •

edited

Loading

Uh oh!

markusicu commented Aug 22, 2025

Uh oh!

Uh oh!

Uh oh!

ICU-23056 CLDR 48 alpha0 integration #3600

Are you sure you want to change the base?

ICU-23056 CLDR 48 alpha0 integration #3600

Conversation

DraganBesevic commented Aug 20, 2025 • edited by markusicu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

pedberg-icu commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DraganBesevic commented Aug 20, 2025

Uh oh!

mihnita commented Aug 22, 2025

Uh oh!

mihnita commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mihnita commented Aug 22, 2025

Uh oh!

mihnita commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mihnita commented Aug 22, 2025

Uh oh!

AEApple commented Aug 22, 2025

Uh oh!

pedberg-icu commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markusicu commented Aug 22, 2025

Uh oh!

Uh oh!

DraganBesevic commented Aug 20, 2025 •

edited by markusicu

Loading

pedberg-icu commented Aug 20, 2025 •

edited

Loading

mihnita commented Aug 22, 2025 •

edited

Loading

mihnita commented Aug 22, 2025 •

edited

Loading

pedberg-icu commented Aug 22, 2025 •

edited

Loading