Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. We have updated the language to the Editor Terms based on feedback from our employees and community. Learn more.
    Dismiss Notice

Question I want to exclude special decorative characters.

Discussion in 'Scripting' started by saebashi, Jun 9, 2022.

  1. saebashi

    saebashi

    Joined:
    Nov 5, 2013
    Posts:
    13
    You can create on this page " ђẸˡO" and " ɦɛʟʟօ" and other decorative characters (I don't know the correct name for it) that users use and then use that string in other platforms such as consoles This can cause problems when retrieving it on other platforms such as consoles.
    I would like to know how to replace or exclude such characters to avoid this.
     
  2. Kurt-Dekker

    Kurt-Dekker

    Joined:
    Mar 16, 2013
    Posts:
    36,970
    I call them filthy dirty nasty characters and they should all be destroyed.

    Here's how:

    Code (csharp):
    1. using UnityEngine;
    2.  
    3. // @kurtdekker
    4. // Detect filthy dirty nasty non-ASCII characters
    5.  
    6. public class DetectFilthyDirtyNastyCharacters : MonoBehaviour
    7. {
    8.     void Start ()
    9.     {
    10.         string data = "Test \" ɦɛʟʟօ\" and";
    11.  
    12.         Debug.Log( "Original:" + data);
    13.  
    14.         // iterate all characters
    15.         foreach( var c in data)
    16.         {
    17.             // allow valid 7-bit non-control characters only
    18.             if (c >= (char)32 && c < (char) 128)
    19.             {
    20.                 Debug.Log( "Good:" + c);
    21.             }
    22.             else
    23.             {
    24.                 Debug.Log( "Bad:" + c);
    25.             }
    26.         }
    27.     }
    28. }
     
    Bunny83 likes this.
  3. Bunny83

    Bunny83

    Joined:
    Oct 18, 2010
    Posts:
    3,572
    Those are not "decorative characters" but there's simply a world outside the ASCII (American Standard Code for Information Interchange) world. Different countries have different languages and letters. For example ɛ is the latin epsilon. Here in germany we also have a couple of "umlauts" which of course do not exist in english. The unicode character set is designed to support all languages of this planet plus some additional characters and glyphs such as emojis. If you can / can not display a character depends entirely what "pages" of the unicode standard your used font supports. Of course you could restrict text to be ASCII, but that would make a lot of people unhappy I guess. As an example my family name contains an "ö". The common replacement is usually "oe" but other languages have other letters which may not have a proper replacement.

    So this now depends on the usecase. In which context can user actually enter such text? Is it about their username? chat messages? ... You will always end up in situations where some platforms do not support the full standard. For example the iPhone was one of the first phones which supported emoji. So iPhone users could send each other messages with contained emojies but users of other devices could not see them, see gibberish. Now probably most phones support the emoji page. Though what fonts and characters a machine / device can display depends on the installed languages and supported fonts.

    (page update)

    Yes, if you want to stick to ASCII characters you can filter them as Kurt showed. See ASCII table for more information. characters below 32 are control characters which are usually not printable. Though those include things like new line characters(#10, #13) or tabs (#9). So if you want to filter chat messages, be aware that you strip out line breaks this way.
     
    saebashi and Kurt-Dekker like this.
  4. saebashi

    saebashi

    Joined:
    Nov 5, 2013
    Posts:
    13
    Thanks for the reply Kurt-Dekker!
    However, there is still a problem with this, it seems to identify and exclude my language, Japanese, Chinese, etc.
     
  5. Kurt-Dekker

    Kurt-Dekker

    Joined:
    Mar 16, 2013
    Posts:
    36,970
    Then you'd need to study the code point tables and decide which ones you want and which ones you don't want.
     
  6. saebashi

    saebashi

    Joined:
    Nov 5, 2013
    Posts:
    13
    Thank you both for your answers.
    After a little bit of struggling I solved this by using TextMeshPro's HasCharacter() to replace the character if it is not in the TMP_Font_Asset.

    Code (CSharp):
    1.         string CheckTextCharactor (string str_) {
    2.             for (int i = 0; i < str_.Length; i++) {
    3.                 if (!tmp_font_asset.HasCharacter (str_[i], true)) {
    4.                     Debug.Log ("bad:" + str_[i]);
    5.                     str_ = str_.Replace (str_[i].ToString (), "*");
    6.                 }
    7.             }
    8.             Debug.Log (str_);
    9.             return str_;
    10.         }
     
    Kurt-Dekker and Bunny83 like this.
  7. Bunny83

    Bunny83

    Joined:
    Oct 18, 2010
    Posts:
    3,572
    That's probably the most reasonable way to approach such issues.

    I'm not sure which languages you want / need to support but usually almost all relevant characters are in the BMP (Basic Multilingual Plane). Your font may not support all those languages. Though the whole BMP can be represented by a single UTF16 character (which C# uses for "char" values). You would run into issues when characters from other planes are used in which case you get those "surrogate pairs". They are actually part of the BMP and encode 10 bits each. Always two of them (one high and one low 16 bit char) are required to form an actual character. I guess that the fonts Unity provides do not have surrogate characters on them so you coulc filter them out individually as you're currently doing. So while this works it probably is not the most robust solution.

    ps: when you're only dealing with short names your approach using String.Replace works just fine. It creates some garbage but shouldn't be too bad. For longer texts I would recommend to use a stringbuilder and essentially copy char by char over and "replace" bad ones directly with an "if else" construct. Note String.Replace has a char version. You currently use the string version. Not only is the string version slower (since it can match actual strings), but when directly replacing one char with another you can use

    Code (CSharp):
    1.    .Replace (str_[i], '*');
    Note the char literal instead of a string literal as replacement. This does not work when you want to replace the char with an empty string as the char version can only replace 1 char with another. This replace method should also have less internal memory overhead since the input and output strings always have the same length.
     
  8. saebashi

    saebashi

    Joined:
    Nov 5, 2013
    Posts:
    13
    Thanks Bunny83.
    I was very troubled by the problem after the release, but was able to solve it.