Skip to content

Conversation

juntae6942
Copy link

@juntae6942 juntae6942 commented Sep 13, 2025

Closes: #35426

Currently, HtmlUtils.htmlUnescape() does not correctly handle numeric character references for Unicode supplementary characters (e.g., emojis).

For example, an entity like 😀 (😀) is incorrectly converted to a garbled character corresponding to U+F600 due to data truncation.

Step to Reproduce

public static void main(String[] args) {
        // Test character: 'Grinning Face' emoji (😀)
        // Unicode code point: U+1F600
        // Hexadecimal: 1F600
        // Decimal: 128512

        // 1. Input value as a decimal HTML entity
        String inputDecimal = "😀";

        // 2. Input value as a hexadecimal HTML entity
        String inputHex = "😀";

        // 3. The expected result after correct conversion
        String expectedOutput = "😀";

        System.out.println("--- Decimal HTML Entity Test ---");
        System.out.println("Input: " + inputDecimal);

        // Call the HtmlUtils.htmlUnescape() method
        String actualOutputDecimal = HtmlUtils.htmlUnescape(inputDecimal);

        System.out.println("Actual Output: " + actualOutputDecimal);
        System.out.println("Expected Output: " + expectedOutput);
        System.out.println("Result matches expected: " + expectedOutput.equals(actualOutputDecimal));

        System.out.println("\n--- Hexadecimal HTML Entity Test ---");
        System.out.println("Input: " + inputHex);

        // Call the HtmlUtils.htmlUnescape() method
        String actualOutputHex = HtmlUtils.htmlUnescape(inputHex);

        System.out.println("Actual Output: " + actualOutputHex);
        System.out.println("Expected Output: " + expectedOutput);
        System.out.println("Result matches expected: " + expectedOutput.equals(actualOutputHex));
    }
스크린샷 2025-09-13 오후 11 29 37

Cause

The root cause was a problematic cast to a 16-bit char in the HtmlCharacterEntityDecoder. This operation truncated any Unicode code point value greater than U+FFFF, leading to the loss of the most significant bits.

Solution

This PR resolves the issue by replacing the direct (char) cast with a call to StringBuilder.appendCodePoint().

The appendCodePoint() method is designed to handle the full range of Unicode code points. It correctly converts supplementary characters into a two-character surrogate pair, ensuring that all characters are unescaped without data loss. A corresponding unit test has been added to verify this fix.

@juntae6942 juntae6942 force-pushed the fix/spring-framework-35426-htmlunescape-unicode branch from bc095df to 369ffe4 Compare September 14, 2025 04:41
@juntae6942 juntae6942 force-pushed the fix/spring-framework-35426-htmlunescape-unicode branch from 369ffe4 to a6efa2a Compare September 14, 2025 04:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: waiting-for-triage An issue we've not yet triaged or decided on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HtmlUtils.htmlUnescape() incorrect for numeric character references >= 𐀀 / 𐀀
2 participants