Fix HtmlUtils unescape for supplementary chars #35477

juntae6942 · 2025-09-13T14:24:07Z

Currently, HtmlUtils.htmlUnescape() does not correctly handle numeric character references for Unicode supplementary characters (e.g., emojis).

For example, an entity like 😀 (😀) is incorrectly converted to a garbled character corresponding to U+F600 due to data truncation.

Step to Reproduce

public static void main(String[] args) {
        // Test character: 'Grinning Face' emoji (😀)
        // Unicode code point: U+1F600
        // Hexadecimal: 1F600
        // Decimal: 128512

        // 1. Input value as a decimal HTML entity
        String inputDecimal = "&#128512;";

        // 2. Input value as a hexadecimal HTML entity
        String inputHex = "&#x1F600;";

        // 3. The expected result after correct conversion
        String expectedOutput = "😀";

        System.out.println("--- Decimal HTML Entity Test ---");
        System.out.println("Input: " + inputDecimal);

        // Call the HtmlUtils.htmlUnescape() method
        String actualOutputDecimal = HtmlUtils.htmlUnescape(inputDecimal);

        System.out.println("Actual Output: " + actualOutputDecimal);
        System.out.println("Expected Output: " + expectedOutput);
        System.out.println("Result matches expected: " + expectedOutput.equals(actualOutputDecimal));

        System.out.println("\n--- Hexadecimal HTML Entity Test ---");
        System.out.println("Input: " + inputHex);

        // Call the HtmlUtils.htmlUnescape() method
        String actualOutputHex = HtmlUtils.htmlUnescape(inputHex);

        System.out.println("Actual Output: " + actualOutputHex);
        System.out.println("Expected Output: " + expectedOutput);
        System.out.println("Result matches expected: " + expectedOutput.equals(actualOutputHex));
    }

Cause

The root cause was a problematic cast to a 16-bit char in the HtmlCharacterEntityDecoder. This operation truncated any Unicode code point value greater than U+FFFF, leading to the loss of the most significant bits.

Solution

This PR resolves the issue by replacing the direct (char) cast with a call to StringBuilder.appendCodePoint().

The appendCodePoint() method is designed to handle the full range of Unicode code points. It correctly converts supplementary characters into a two-character surrogate pair, ensuring that all characters are unescaped without data loss. A corresponding unit test has been added to verify this fix.

Signed-off-by: potato <[email protected]>

spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Sep 13, 2025

juntae6942 mentioned this pull request Sep 13, 2025

HtmlUtils.htmlUnescape() incorrect for numeric character references >= 𐀀 / 𐀀 #35426

Open

Fix HtmlUtils unescape for supplementary chars

0b60800

Signed-off-by: potato <[email protected]>

juntae6942 force-pushed the fix/spring-framework-35426-htmlunescape-unicode branch from bc095df to 369ffe4 Compare September 14, 2025 04:41

Test: Add case for basic HTML entities in HtmlUtils

a6efa2a

Signed-off-by: potato <[email protected]>

juntae6942 force-pushed the fix/spring-framework-35426-htmlunescape-unicode branch from 369ffe4 to a6efa2a Compare September 14, 2025 04:48

Style: Format code to align with project conventions

7378478

Signed-off-by: potato <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix HtmlUtils unescape for supplementary chars #35477

Fix HtmlUtils unescape for supplementary chars #35477

juntae6942 commented Sep 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fix HtmlUtils unescape for supplementary chars #35477

Are you sure you want to change the base?

Fix HtmlUtils unescape for supplementary chars #35477

Conversation

juntae6942 commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Step to Reproduce

Cause

Solution

Uh oh!

Uh oh!

juntae6942 commented Sep 13, 2025 •

edited

Loading