Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Question How to input 5 digit unicode escape strings? (To create spam filter)

Discussion in 'Editor & General Support' started by mikejm_, Jul 23, 2023.

  1. mikejm_

    mikejm_

    Joined:
    Oct 9, 2021
    Posts:
    346
    I am trying to write a spam filter that will pick up variations of spellings of words from text, so for example the standard typed version of the following should be fltered as much as this unicode variant:

    onlyfans scriptunicode.PNG

    My strategy is to use this site:
    https://util.unicode.org/UnicodeJsps/confusables.jsp?a=a&r=None

    to make a list of confusables of all the standard A-Z and a-z letters. I figure I can create a List of each confusable for each letter, ie.

    Code (csharp):
    1.  
    2. public static List<string> aCharList = new() {
    3.         "a",
    4.         "A",
    5.         //small a:
    6.         "\u0061",
    7.         "\u0251",
    8.         "\u03B1",
    9.         "\u0430",
    10.         "\u237A",
    11.         "\u1D41A",
    12.         "\u1D44E",
    13.         "\u1D482",
    14.         "\u1D4B6",
    15.         "\u1D4EA",
    16.         "\u1D51E",
    17.         "\u1D552",
    18.         "\u1D586",
    19.         "\u1D5BA",
    20.         "\u1D5EE",
    21.         "\u1D622",
    22.         "\u1D656",
    23.         "\u1D68A",
    24.         "\u1D6C2",
    25.         "\u1D6FC",
    26.         "\u1D736",
    27.         "\u1D770",
    28.         "\u1D7AA",
    29.         "\uFF41",
    30.  
    31.  
    32.     };
    I can use to iterate for finding "alternative character spellings" of the same words. But this isn't working . As per C#: the \u must be followed by 4 digits only U+0000 to U+FFFF:

    https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/builtin-types/char

    I see this in Visual Studio where the 5th digit of these unicode characters is not recognized. Any easy way to implement this strategy or any different ideas? I could alternatively just copy and paste the Unicode characters from that confusables display website into my code. But I feel like using the code for them is more secure against file format changes.

    Is there some solution or existing approach?
     
  2. tleylan

    tleylan

    Joined:
    Jun 17, 2020
    Posts:
    521
  3. mikejm_

    mikejm_

    Joined:
    Oct 9, 2021
    Posts:
    346
    Thanks but that didn't work. When I try that:

    Code (csharp):
    1.  
    2. public static List<char> aCharList = new() {
    3.         'a',
    4.         'A',
    5.         //small a:
    6.         '\u0061',
    7.         '\u0251',
    8.         '\u03B1',
    9.         '\u0430',
    10.         '\u237A',
    11.         '\u1D41A',
    12.         '\u1D44E',
    13.         '\u1D482',
    14.         '\u1D4B6',
    15.         '\u1D4EA',
    16.         '\u1D51E',
    17.         '\u1D552',
    18.         '\u1D586',
    19.         '\u1D5BA',
    20.         '\u1D5EE',
    21.         '\u1D622',
    22.         '\u1D656',
    23.         '\u1D68A',
    24.         '\u1D6C2',
    25.         '\u1D6FC',
    26.         '\u1D736',
    27.         '\u1D770',
    28.         '\u1D7AA',
    29.         '\uFF41',
    30.  
    31.  
    32.     };
    I get the error "Too many characters in character literal" on all the five letter ones.

    When I tried copying and pasting the characters in, I get this where the ones made from 5 digit codes look pink (Don't know what that means):

    confusables 1.PNG

    Any correct way to handle these characters?
     
  4. tleylan

    tleylan

    Joined:
    Jun 17, 2020
    Posts:
    521
    I didn't know about the 5-digit version. Apparently they are handled via something known as surrogate pairs. You might have to search for a solution you can use. You might post the answer here when you find it.
     
    mikejm_ likes this.
  5. mikejm_

    mikejm_

    Joined:
    Oct 9, 2021
    Posts:
    346
    Thanks, yes I just found it this morning also: http://www.russellcottrell.com/greek/utilities/SurrogatePairCalculator.htm

    I presume it is safer to code with these than copying/pasting the characters themselves in in case they are lost on build that way so that is what I will do. Thanks.
     
    tleylan likes this.