Search Unity

  1. We are migrating the Unity Forums to Unity Discussions. On July 12, the Unity Forums will become read-only. On July 15, Unity Discussions will become read-only until July 18, when the new design and the migrated forum contents will go live. Read our full announcement for more information and let us know if you have any questions.

Need help with checking if string exists in around 280k words text file

Discussion in 'Scripting' started by Luxurdo, Aug 11, 2022.

  1. Luxurdo


    Aug 11, 2022
    Hello! I am new to unity and I am trying to make a word game.
    So I used a text file and I converted it into a list in my code. The issue I had is that whenever I tried to check if the Word exists in the list it says that it doesn't exist in the list. The only word it says that it exists in the list is the word that is on the end of the list which is "ZZZS". I tried to do Debug.Log(dictionary[0]) and it shows the first word in the text file but if I put that word in my game, it detects it as not a word. I don't think its a case sensitive issue because the game I made only uses caps lock letters and my Text File uses caps lock letters as well. I think the problem is that my file has too many words.

    Here's my code:

    Code (CSharp):
    1. using System.Collections;
    2. using System.Collections.Generic;
    3. using UnityEngine;
    4. using TMPro;
    6. public class WordChecker : MonoBehaviour
    7. {
    8.     public TextMeshPro display;
    9.     public TextAsset Words;
    10.     public string WordTest;
    11.     private string theWord;
    13.     void Start()
    14.     {
    15.         display = GameObject.Find("WordDisplay").GetComponent<TextMeshPro>();
    17.     }
    19.     public void OnMouseDown()
    20.     {
    21.         theWord = display.text;
    22.         var content = Words.text;
    23.         var AllWords = content.Split('\n');
    24.         var dictionary = new List<string>(AllWords);
    27.         if (dictionary.Contains(theWord))
    28.         {
    29.             Debug.Log("The word exists in the dictionary.");
    30.         }
    31.         else
    32.         {
    33.             Debug.Log("The word does not exist in the dictionary.");
    34.         }
    35.     }
    37. }
    If you happen to know a way to improve this code please let me know.

    Attached Files:

  2. mgear


    Aug 3, 2010
    try to printing the last item after splitting to see if it gets added there (as last line doesn't contain \n)
  3. kdgalla


    Mar 15, 2013
    Why create dictionary over again every time someone clicks? Why not just do it once on Start or something?
    Luxurdo and PraetorBlue like this.
  4. oscarAbraham


    Jan 7, 2013
    Ah! Friend, you are another victim of mess that are newline formats. I downloaded your file to make sure. Welcome to the club.

    Basically, Windows' programs by default write new lines as "\r\n" instead of "\n". That is carriage return before line feed. It's so silly; they still treat text like a typewriter, where writing a new line meant returning the carriage to the first column then moving the paper to the next row. And it will be like that until the end of civilization, because that's how tech works. Your file uses windows-style newlines, so there's a '\r' char in almost all the words (except the last one).

    So, what can you do? You could use
    instead of
    . Don't use
    content.split('\r', '\n')
    because it will add an empty word for each "\r\n" pair. Don't use
    content.Split('\r', '\n', System.StringSplitOptions.RemoveEmptyEntries)
    because it's buggy in Mono in these kinds of cases; it will combine a lot of words into one sometimes.

    A solution that may be more solid is to process your file(s) to convert them from CRLF to LF; your code would work without any change. There's software to do that, and some text editors in windows have a setting for it already. That way, if someone with Mac or Linux ever authors a file, it will be compatible with your game.

    That said, I think the best you could do is not to use spaces to separate words. That's a bug waiting to happen. The csv format was made for these kinds of cases, so I'd start with that; it's just words separated by commas and it can be exported from Excel. Or you could use JSON, it's universal and it let's you add metadata to your words if you ever need it.

    I hope this helps.
    Last edited: Aug 12, 2022
    Luxurdo likes this.
  5. Luxurdo


    Aug 11, 2022
    It's finally working now! Thank you so much!
  6. Bunny83


    Oct 18, 2010
    Well, that's actually not silly and your interpretation is actually not quite true ^^. The "LF" character is called the "line feed" character (\n). It usually is not supposed to return the carriage to the first column. Yes, a lot other systems use this convention, but they actually interpret the control characters incorrectly.

    Btw: HTTP is also still using /r/n as a line delimiter and that's true for all systems. Pretty much all text line based protocols use /r/n.

    Don't get me wrong, I also do think a single character would make our life a lot easier. IBM mainframes actually used a completely seperate new line character "NL" 0x15. Though none of the other systems have adopted it.So the issue isn't really the two character delimiter but the fact that each system rolls their own interpretation. The classic MacOS only used a single "/r" which is equally misinterpreted. Since there are essentially 3 different interpretations (and /r/n would be the only "correct" one), those are causing our headache in the first place. In the end it's just a matter of sticking to a standard. Though none of the major systems are willing to let go of their interpretation. Well MacOS essentially switched to the unix interpretation. However you always have a mix of legacy applications and newer ones. You can not simply decide to switch to a new system "just because". Responsible companies do not break their standard, ask Linus Torvalds :)
    oscarAbraham likes this.
  7. oscarAbraham


    Jan 7, 2013
    :) That was very illuminating, thanks. About this particular quote, the character I meant to say returns the carriage to the first column is "\r". I understand that there's a context where CRLF is more correct than LF, I just think it's a bit funny that this context is related to typewriters.