Search Unity

TextMesh Pro removing rich text tags from a string

Discussion in 'UGUI & TextMesh Pro' started by TheValar, May 7, 2018.

  1. TheValar

    TheValar

    Joined:
    Nov 12, 2012
    Posts:
    760
    Is there a way to strip out rich text tags from a string?

    I'm working on a chat feature in a game which will use rich text tags for alignment and coloring certain parts of the string but I don't want user's to be able to inject their own rich text tags. Is there a function somewhere in TextMeshPro that will help me remove all the rich text tags from a string before putting in my own?

    thanks
     
    MrLucid72 likes this.
  2. Stephan_B

    Stephan_B

    Joined:
    Feb 26, 2017
    Posts:
    6,596
    The <noparse> </noparse> tag doesn't remove Rich Text but prevents them from being parsed / processed. As such as using any tag types in between this <noparse> tag will show up as text.

    There are several TMP users using this where the user text is enclosed in the <noparse> tags.
     
    rahil-p likes this.
  3. TheValar

    TheValar

    Joined:
    Nov 12, 2012
    Posts:
    760
    I suppose I could probably make that work. Will be a pain to have to insert a bunch of no parse groups in between my own rich text tags that I insert into the users message. Would definitely be an appreciated feature to have something exposed from TextMesh Pro to scrub that stuff out
     
  4. Stephan_B

    Stephan_B

    Joined:
    Feb 26, 2017
    Posts:
    6,596
    Can you provide an example of the text a user would type vs. how you would reformat it to include your own tags?
     
  5. TheValar

    TheValar

    Joined:
    Nov 12, 2012
    Posts:
    760
    user might input the message
    hello everyone! This is great :D


    what I will write to the text component will end up being
    <align="right"><color=#00ffff><b>feralholtzem:</b></color> Hello everyone! This is great <sprite name=":D"></align>
     
    MrLucid72 likes this.
  6. TheValar

    TheValar

    Joined:
    Nov 12, 2012
    Posts:
    760
    right now I have it stripping out <noalign> and </noalign> strings from the users message, and then closing and re-opening noalign around the emoji rich text that I insert myself. This seems to work
     
  7. jdeuce

    jdeuce

    Joined:
    Dec 15, 2012
    Posts:
    22
    @Stephan_B: this solution seems woefully inadequate to prevent user tag injection.

    All the user would have to do is prepend their message with </noparse> to get back to being able to inject rich text tags. It would be best if TextMeshPro provided an escape method that could be called to guarantee that any rich text control characters entered by users have been escaped before we pass the text in to be processed.

    e.g.
    Code (CSharp):
    1. textMeshComponent.text = "Some string from the user is: " + TextMeshPro.escape(unsafeString);
    2.  
    3. // instead of
    4.  
    5. textMeshComponent.text = "Some string from the user is: <noparse>" + unsafeString + "</noparse>;
     
    Last edited: May 9, 2019
    kdserra, awsapps, phrenq and 3 others like this.
  8. MrLucid72

    MrLucid72

    Joined:
    Jan 12, 2016
    Posts:
    996
    Hey folks -- did anyone ever find a non-exploitable solution to this? I'm having issues with this, myself ( https://forum.unity.com/threads/rich-text-exploit-with-tmp.694366/#post-4646368 ).

    Maybe .replace("</noparse>", "")? It's really the only thing that would break things, I'd imagine. Surely doesn't do glory on performance, though.

    EDIT: Yep, seems to do the trick!



    EDIT: Hmm, the <noparse> method is actually pretty unfortunate because if people copy text they see, it copies WITH The noparse tags. Gonna continue about this in the link above to the other thread.
     
    Last edited: Jun 18, 2019
    MilenaRocha likes this.
  9. MrLucid72

    MrLucid72

    Joined:
    Jan 12, 2016
    Posts:
    996
    1.5 year bump - still seeking solution to this important issue.

    You cannot clear tags without some really nasty regex stripping. <noparse> works for uncopyable text only. However, for text that allows copy+paste, this copies ALL tags in between.
     
    DePotterM and MilenaRocha like this.
  10. Psycho8Vegemite

    Psycho8Vegemite

    Joined:
    Mar 9, 2013
    Posts:
    8
    I came across this problem a bit ago and I solved it by just detecting the rich text and removing them. The code below is easily expandable (imo) and works fast enough for my needs:

    https://gitlab.com/-/snippets/2031682
    Code (CSharp):
    1. public static string RemoveRichText(string input)
    2. {
    3.  
    4.     input = RemoveRichTextDynamicTag(input, "color");
    5.  
    6.     input = RemoveRichTextTag(input, "b");
    7.     input = RemoveRichTextTag(input, "i");
    8.  
    9.  
    10.     // TMP
    11.     input = RemoveRichTextDynamicTag(input, "align");
    12.     input = RemoveRichTextDynamicTag(input, "size");
    13.     input = RemoveRichTextDynamicTag(input, "cspace");
    14.     input = RemoveRichTextDynamicTag(input, "font");
    15.     input = RemoveRichTextDynamicTag(input, "indent");
    16.     input = RemoveRichTextDynamicTag(input, "line-height");
    17.     input = RemoveRichTextDynamicTag(input, "line-indent");
    18.     input = RemoveRichTextDynamicTag(input, "link");
    19.     input = RemoveRichTextDynamicTag(input, "margin");
    20.     input = RemoveRichTextDynamicTag(input, "margin-left");
    21.     input = RemoveRichTextDynamicTag(input, "margin-right");
    22.     input = RemoveRichTextDynamicTag(input, "mark");
    23.     input = RemoveRichTextDynamicTag(input, "mspace");
    24.     input = RemoveRichTextDynamicTag(input, "noparse");
    25.     input = RemoveRichTextDynamicTag(input, "nobr");
    26.     input = RemoveRichTextDynamicTag(input, "page");
    27.     input = RemoveRichTextDynamicTag(input, "pos");
    28.     input = RemoveRichTextDynamicTag(input, "space");
    29.     input = RemoveRichTextDynamicTag(input, "sprite index");
    30.     input = RemoveRichTextDynamicTag(input, "sprite name");
    31.     input = RemoveRichTextDynamicTag(input, "sprite");
    32.     input = RemoveRichTextDynamicTag(input, "style");
    33.     input = RemoveRichTextDynamicTag(input, "voffset");
    34.     input = RemoveRichTextDynamicTag(input, "width");
    35.  
    36.     input = RemoveRichTextTag(input, "u");
    37.     input = RemoveRichTextTag(input, "s");
    38.     input = RemoveRichTextTag(input, "sup");
    39.     input = RemoveRichTextTag(input, "sub");
    40.     input = RemoveRichTextTag(input, "allcaps");
    41.     input = RemoveRichTextTag(input, "smallcaps");
    42.     input = RemoveRichTextTag(input, "uppercase");
    43.     // TMP end
    44.  
    45.  
    46.     return input;
    47.  
    48. }
    49.  
    50.  
    51.  
    52. private static string RemoveRichTextDynamicTag (string input, string tag)
    53. {
    54.     int index = -1;
    55.     while (true)
    56.     {
    57.         index = input.IndexOf($"<{tag}=");
    58.         //Debug.Log($"{{{index}}} - <noparse>{input}");
    59.         if (index != -1)
    60.         {
    61.             int endIndex = input.Substring(index, input.Length - index).IndexOf('>');
    62.             if (endIndex > 0)
    63.                 input = input.Remove(index, endIndex + 1);
    64.             continue;
    65.         }
    66.         input = RemoveRichTextTag(input, tag, false);
    67.         return input;
    68.     }
    69. }
    70. private static string RemoveRichTextTag (string input, string tag, bool isStart = true)
    71. {
    72.     while (true)
    73.     {
    74.         int index = input.IndexOf(isStart ? $"<{tag}>" : $"</{tag}>");
    75.         if (index != -1)
    76.         {
    77.             input = input.Remove(index, 2 + tag.Length + (!isStart).GetHashCode());
    78.             continue;
    79.         }
    80.         if (isStart)
    81.             input = RemoveRichTextTag(input, tag, false);
    82.         return input;
    83.     }
    84. }
    85.  
     
    PhannGor and guvit like this.
  11. MrLucid72

    MrLucid72

    Joined:
    Jan 12, 2016
    Posts:
    996
    Hmm, anymore graceful solution? We really could use a TMP .stripRichTags(opts)
     
    DePotterM likes this.
  12. hk1ll3r

    hk1ll3r

    Joined:
    Sep 13, 2018
    Posts:
    88
    TextMeshPro's current algorithm seems weird.
    It first replaces control sequences (\n, \t \uXXXX, etc.) then processes the tags. That prevents us from escaping special characters using \uXXXX. In every other rich text, special sequences will bypass the tag processing algorithm.
    If that wasn't the case, a `SanitizeUserInput` helper would simply replace '<' and '>' with \uXXXX and \uYYYY.
    @Stephan_B thoughts?
     
    phrenq likes this.
  13. shivanshsaini17

    shivanshsaini17

    Joined:
    Aug 17, 2018
    Posts:
    4
    Code (CSharp):
    1. TextMeshProUGUI t = cards[playedCard].GetComponentInChildren<TextMeshProUGUI>();
    2. t.text = t.GetParsedText();
    This solves my problem.

    You can also set the richtext boolean to false afterwards.
     
    Last edited: Jul 4, 2021
    kdserra and mitaywalle like this.
  14. mitaywalle

    mitaywalle

    Joined:
    Jul 1, 2013
    Posts:
    253
    Thanks!
     
  15. cuddlepunk

    cuddlepunk

    Joined:
    Apr 9, 2014
    Posts:
    18
    this is more for anyone drifting through, since I imagine you've all now found a pretty nice solution, but I believe I have a relatively simple answer.

    Using regular expressions, it thankfully becomes simple to search a string for basic rich text tags, and then remove them while preserving the rest of the string.

    In the code featured below, the regular expression removes any instance of < >, alongside any characters found in between the arrows.

    Code (CSharp):
    1. using System.Text.RegularExpressions;
    Code (CSharp):
    1.  
    2. Regex  rich = new Regex(@"<[^>]*>");
    3. string text = "your text here";
    4.  
    5. if (rich.IsMatch (text))
    6. {
    7.        text = rich.Replace(text, string.Empty);
    8. }
     
    Last edited: Nov 23, 2023
  16. Bamboy

    Bamboy

    Joined:
    Sep 4, 2012
    Posts:
    64
    It's just really annoying to have to have a reference to a TMP component in order to call GetParsedText() - something like that should be a static method. The fact that it isn't a static method makes me concerned that the string it gives you may be different depending on the TMP component's settings. The documentation for this method is non-existent almost, aside from it stating that the method exists.
     
    wderksen98 and HastingsYoung like this.
  17. Foreman_Dev

    Foreman_Dev

    Joined:
    Feb 4, 2018
    Posts:
    83
    Thanks for this! I needed to remove rich text formatting in order for a text-to-speech service to read things properly, otherwise it was literally saying "B" for bolded text that used <b></b>.

    I ended up wrapping this into a string extension method for easy access throughout my project.
     
    cuddlepunk likes this.
  18. jesper42

    jesper42

    Joined:
    Jan 11, 2011
    Posts:
    28
    Another solution.

    Replacing normal "<>" with their big versions to prevent TextMeshPro tags in user input in chat.

    This still allows users to write funny "tags" <lol> and make ascii art.

    ps. As somebody wrote before, replacing "<>" with normal size unicode symbols doesn't work due to how TMP parses strings.


    Code (CSharp):
    1. protected static string DisableTags(string txt)
    2. {
    3.     txt = txt.Replace("<", "\uFF1C");
    4.     txt = txt.Replace(">", "\uFF1E");
    5.     return txt;
    6. }
     
    shelim and Foreman_Dev like this.
  19. shelim

    shelim

    Joined:
    Aug 26, 2017
    Posts:
    29
    Found anther hackish workaround:

    This solution will escape all tags in such way they are displayed to user as if rich tag were disabled, but still allow other rich text tag option. Do whatever you want with this license.

    Code (CSharp):
    1. using System.Text;
    2. using System.Text.RegularExpressions;
    3.  
    4.  
    5. public static class StringExtensions
    6. {
    7.   private static readonly Regex _escapeRegex = new Regex("(<.*?>)", RegexOptions.Compiled);
    8.  
    9.   public static string EscapeTMP(this string str)
    10.   {
    11.     return _escapeRegex.Replace(str, "<noparse>$1</noparse>").Replace("</noparse></noparse>", "</noparse></<b></b>noparse>");
    12.   }
    13. }
    Usage:

    Code (CSharp):
    1. string userdata = "I don't like your attiude!";
    2. tmp.text = "<b>User entered:</b> " + userdata.EscapeTMP();
    Explanation:

    This method will first decorate each and every tag with <noparse> (without changing anything inside), and then follow a special case step to keep </noparse> intact (if the user happen to write it directly, it will be preserved as well). Reasonably tested on malicious input (such as links).

    ...

    Unity, please, allow us for proper built-in user-sanitization.