Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Best free word database?

Discussion in 'General Discussion' started by yoonitee, Feb 4, 2015.

  1. yoonitee

    yoonitee

    Joined:
    Jun 27, 2013
    Posts:
    2,364
    I'm looking for a simple list of words split into nouns, verbs, adjectives, etc. to do some simple sentence parsing. Mainly, just for fun, but also maybe for a game. So I want to get the sentence "The cow jumped over the moon." And it would know that "cow" is a noun, jumped is the verb, etc. Just for a first guess. I am thinking of making something that takes that sentence then transforms it into an animation of a cow jumping over the moon.

    I've looked at WordNet but it seems a bit overcomplicated for my purposes. (They define "cat" as a verb to beat with a cat-o-9-tails. That is not likely to come up!) Any other suggestions?

    I've done similar things before but I just compiled my own list of words.

    I mean I could just get any dictionary and parse that to find the types of each words but that would take a bit longer.

    Anyone else done this sort of thing?
     
    Last edited: Feb 4, 2015
  2. Marble

    Marble

    Joined:
    Aug 29, 2005
    Posts:
    1,266
    Great question. I looked into this a while ago and couldn't find anything either.
     
  3. zombiegorilla

    zombiegorilla

    Moderator

    Joined:
    May 8, 2012
    Posts:
    8,952
    yoonitee likes this.
  4. Dameon_

    Dameon_

    Joined:
    Apr 11, 2014
    Posts:
    542
    You need more than a list of words to do sentence parsing. A lot more. Especially with English, which has such a loose sentence structure. A lot of it is context and positioning.

    For example, with the sentence "the cow jumps over the moon", we have a context. A poem wherein a cow jumps from one side of the moon to the other. However, you could take the same sentence and process it to mean "over the moon, the cow jumps." How does your program know whether to animate a cow hovering over the moon jumping up and down, or a cow flying from one side of the moon to the other?

    Natural language processing is fun, but hard, because computers require much more explicit directions than people.
     
  5. yoonitee

    yoonitee

    Joined:
    Jun 27, 2013
    Posts:
    2,364
    It would work like this:

    First it sees that cow is the subject of the sentence. Then it sees that cows usually live on farms. So it draws a farm with a cow in it. Then applies the animation "jump" to the cow. Then it might have to zoom out a bit to encompass both the earth and moon system. Admittedly, this one would be hard to animate in a physically realistic way.

    As to your point, it would know that "jump over" doesn't mean you are over something and jumping!!

    However "The cat chased the mouse". It would see that cats usually live in houses. So it would draw a room with a cat. Chase is a type of run so it would animate the cat running. And then it would draw a mouse. That one's quite simple.

    It won't get all the sentences but I think with a few actions like:

    "chase", "eat", "kill", "jump", "sit", "stand", "walk", "run", "talk"

    and a few models such as houses, bodies, animals, and so on, you could do quite a lot.

    "The man walked into the house". Could be easily animated. (It might animate him walking through the wall but it's close enough!).

    "Goldilocks ate the porridge." is fine. (As long as a previous sentence had said "There once lived a girl called Goldilocks with yellow hair.")

    "She had never felt these strange feelings before." might be a bit more tricky!

    It would mean writing practically the whole of the Sims + City Tycoon + Monkey Island + Scribblenauts into one program but I think it's doable! (Something to do in the next ten years!)

    I'd like to do it open source and kind of collaborative too. But I don't know how to prevent people just abusing the system. such as adding "cats are a type of cutlery." and things.
     
    Last edited: Feb 4, 2015
  6. Aldo

    Aldo

    Joined:
    Aug 10, 2012
    Posts:
    173
    I was gonna tell you to get a dictionary txt and parse for an SQL DB but your examples make it harder.

    You say "Cow" the game knows that cows live in farms, what if he says Wolf? Dog, cat, unicorn, etc...

    What you would need is to delimit your posibilities.

    Subjects
    Places
    Actions

    So you can have 100+ animals
    20 places
    50 actions

    And let the player work with that. Like scribblenauts
     
  7. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    I think this is a colossal enough task without getting the computer to think of random other stuff to ambiguously bring into consideration.

    To get an idea of how much more complex that makes the problem, consider: what if the sentence was changed to "The escaped cow jumped over the moon"? To handle this you'd have to somehow consider not only the assumptions you're making about words (eg: that cows are associated with farms) but also all possible assumptions that any word can subvert in any other word (eg: "escaped" decouples the "cow" to "farm" association... how many other such things will you have to know?).
     
  8. delinx32

    delinx32

    Joined:
    Apr 20, 2012
    Posts:
    417
    yoonitee likes this.
  9. zombiegorilla

    zombiegorilla

    Moderator

    Joined:
    May 8, 2012
    Posts:
    8,952
    If you are really interested in this type of thing, you should spend some time looking at CYC (openCYC). http://www.cyc.com/platform/opencyc
    It is intensely fascinating stuff. The project was originally started as a way to prevent emergent computer intelligences in the future from making simple errors in logic like those that drove HAL to decide that killing his crew was an acceptable course so he didn't have to lie to them. Yes, the whole was inspired by the events in 2001.

    CYC is an repository of knowledge and a structure/formulas to understand relationships and parse statements. Its kind of way to formalize common knowledge. It was part the early exploration in AI, and understanding how we learn and parse information. Among other things, It uses small assertions in context to understand and parse larger concepts. A huge chunk of our knowledge comes from understanding complex relationships and inferences without having to actually "think" about it, or parse it out. It's pretty fascinating, and often a bit funny.
    Some examples (most of which CYC can parse:

    If you smash a chunk of wood, you get a several smaller chunks of wood. But if you smash a table, you don't get several smaller tables.

    Fred saw the plane flying over Zurich.
    Fred saw the mountains flying over Zurich.

    Babies cannot be doctors.

    Probably not much practical for making word games, but pretty interesting stuff, and it actually has been used for some types of games.
     
    angrypenguin likes this.
  10. yoonitee

    yoonitee

    Joined:
    Jun 27, 2013
    Posts:
    2,364
    Yeah, I had a look at that. I think what's missing from some of these databases is the geometric relationships between things. For example, it might know "A cat has four legs", but it should also have some kind of internal model of a cat too so it know where those legs are. And a "house has a door" but shouldn't it also have a typical house model in it's memory geometrically made of cubes and prisms, etc. For example if it had a 3D model of a typical mammal, and then it was told "A giraffe is a yellow mammal with a long neck and horns". It could take it's typical mammal, stretch it's neck bones, and then have a pretty good idea of what a giraffe looks like.

    So, I'd like to combine this verbal database with a visual database of things.

    One thing that would be difficult however is inheritance. For example if it had a 3D model of a bird with all the parts labelled, then you want to define a penguin, you would need to transform the bird model into a penguin shape. Or alternatively you would need a separate penguin model but have to match up all the parts. It get's kind of tricky at that point!

    On the other had a game like the Sims has the opposite problem. It contains all the 3D models but doesn't have the verbal database. So the Sims just speak in Simspeak!

    But the idea is to start with a few objects, a few actions, and simple sentences and just build it from there.

    Ultimately, I want it to be able answer things like "How many penguins can you fit in a car?" Then it should say, "I've thought about it, and in a typical car you can fit about 7 typical penguins inside."
     
    Last edited: Feb 5, 2015
    angrypenguin likes this.
  11. yoonitee

    yoonitee

    Joined:
    Jun 27, 2013
    Posts:
    2,364
    As an example of what I'm talking about. Consider what are the meanings of "into", "onto", "over". They are really geometric descriptions of paths of an animation.

    into(x) = path { not in x, in x}
    onto(x) = path{ not on x, on x}
    over(x) = path{ not above x, above x, not above x (other side)}
    towards(x) = path{far from x, near to x}

    So each of these words could be translated into animations fairly easily with Bezier curves. Maybe with some ambiguity but that ambiguity is present in human thought too.

    We describe even longer paths in sentences such as "The cat ran over the hill, into the forest, and up a tree."

    I think that "jumped over" and "jumped above" have two different meanings. One means that you are only above for a short period.
     
    Last edited: Feb 6, 2015
  12. delinx32

    delinx32

    Joined:
    Apr 20, 2012
    Posts:
    417
    That cat stayed over the dogs house, and played games into the night, and stayed up all of it.
     
  13. yoonitee

    yoonitee

    Joined:
    Jun 27, 2013
    Posts:
    2,364
    "That cat"= cat is subject, "that" referring to a cat in visual range.
    "stayed over" = common phrase meaning to stay in a place over night. Most commonly referring to humans. Thus interpret cat as anthropomorphised.
    "the dogs house" = phrase probably missing apostrophe interpret as "the dog's house" = "house of dog". Most commonly this is a "kennel".
    "played games" = verb=play, object=games, subject=cat, as before.
    "into [time period x]" = timepath{before x, during x}
    "stayed up" = common phrase meaning to keep awake usually followed by a time quantifyer.
    "all of {x}" = logical phrase referring to something which has quantity.
    "it" = looking for match in set {cat,dog,house,night}, highest probability corresponds to "night" as this is connected with "staying up" and is nearest word.

    QED.
     
  14. delinx32

    delinx32

    Joined:
    Apr 20, 2012
    Posts:
    417
    Everything you say is true, but you used your human brain to figure it out. There is no word database that could do that for you without you programming the rules. Sure "stayed over" is a common phrase in context, but what about "The sun stayed over the horizon throughout the day". "stayed over" has a completely different contextual meaning.

    Human language is filled with nuance, computers are really bad at nuance, but really good at rigid rules.
     
  15. yoonitee

    yoonitee

    Joined:
    Jun 27, 2013
    Posts:
    2,364
    True. But people still write dictionaries even though they might not capture all nuances. It is still worthwhile endeavour. Well, some people might disagree.

    I think in most dictionaries worth their salt will have these definitions e.g. this one.

    As for your example, you would have to encode:
    "x stayed over y" where x typeOf lifeform, y typeOf someone's abode.
    otherwise default to the more literal meaning.

    In my opinion a human mind is just running through all the possibilities and seeing which one makes most sense. A computer could do just the same.

    The alternative is to let a computer self-learn. But then we can't be sure, just like some humans, that it has learned a lot of wrong things.

    But these are interesting examples! I hadn't thought of! "stayed over" is surely just a shorted version of "stayed overnight at".

    Yes, definitely context is very important! But I think what you would do is just build something simple and then add on these more nuanced definitions as they crop up. Just like teaching a child. I think a bigger problem with my idea is that most sentences don't have a visual meaning, like most of the sentences I've written in this post! (How do we even make sense of sentences if we can't picture them in our minds? that's kind of weird don't you think?)
     
    Last edited: Feb 7, 2015
  16. delinx32

    delinx32

    Joined:
    Apr 20, 2012
    Posts:
    417
    Its not that I don't think its possible, I just think that you'll spend an awful lot of time parsing the english dictionary to determine what is a "thing, place, person, animal, etc", then developing the rules for how those things interact with common phrases. That's probably going to be the hardest part. Heck, you could probably write the code to parse the sentences in a couple of days, but the data entry is going to be huge.

    You may want to crowd source your data, ie:

    User types:
    "I spent the night in the lair of the aquadon!"

    You may not have "lair" in your database so you ask:

    "Was the lair similar to a: a) cave, b) apple, or c) form of currency"

    Or, maybe link to an online thesaurus to see if it matches any other words that you do have in your dictionary.

    And of course an aquadon is something I made up, so you might say: "Describe the aquadon"

    "Oh, its a creature, kind of looks like a whale, but a bit like a brontosaurus, but it has fangs like a saber tooth cat, a green one, with blue stipes. Growls a lot, sleeps all night, and eats skittles."

    I say good luck, and I would look forward to seeing any project that could actually do this. You'd probably get hired by the Siri, Google Now, or Cortana team on the spot and make a gazillion dollars.
     
    yoonitee likes this.
  17. yoonitee

    yoonitee

    Joined:
    Jun 27, 2013
    Posts:
    2,364
    Is a gazillion

    a) a type of dinosaur b) a type of arch c) a type of boat

    Actually I found this ConceptNet which is similar to what you said. I think it is crowd sourced too. Like I said it has the semantic information but probably not the geometric information. Such as noses are usually in the middle of a face below the eyes and above the mouth. Or things of that nature.

    But there are so many mistakes in ConceptNet it renders it useless. Such as looking for the definition of "the". My other strategy is to research the minimum number of words needed to form a language and just restrict it to those. Like this one.

    I think I might make it more into a super-high level programming language such as:

    Code (CSharp):
    1.  
    2. Create a yellow cat.  //or maybe "Imagine a yellow cat"
    3. Create a white mouse.
    4. Make the cat chase the mouse.  //or maybe "Imagine the cat chases the mouse."
    5. Create a big dog.
    6. If the cat is near the dog make the dog bark.
    7. Let a giraffe be a yellow mammal with a long neck and horns and brown spots.
    8. Create a giraffe next to the house.
    9. Make all things that are chasing mice explode.
    10. etc.
    11.  
    Then you should be able to add modules to it as header files. Because then it might be useful for games!
     
    Last edited: Feb 8, 2015
  18. delinx32

    delinx32

    Joined:
    Apr 20, 2012
    Posts:
    417
    if you made your own small language then you could monetize via microtransactions through word packs. You could maybe start with a list of 50 animals places things, and then add more via 99 cent app store downloads. It would also let you control how things are drawn since you'll know all the possible combinations ahead of time. It would simplify your life a great deal too:)
     
  19. yoonitee

    yoonitee

    Joined:
    Jun 27, 2013
    Posts:
    2,364
    We almost posted the exact same idea at the same time! :eek: Yeah I like the idea of adding on extra modules.

    Although there's something called Inform7 which seems very similar! Also this.

    Another advantage of linking a word database with some geometric data is that you could watch the system dream of new things such as "A green penguin driving a tractor".
     
    Last edited: Feb 9, 2015