Automatic Thesaurus
Last week, I landed on another PhD worthy research project.
Given a very large corpus of sentences, such as a digitized version of the Library of Congress, or a less noisy version of the Internet, how can you automatically generate a Thesaurus?
At first I thought the problem should be fairly easy, but the more I thought about it the more difficult and daunting the task became. For example, as a first approach, we might assume that textual substitution would be a good proxy for identifying synonymous terms. That is if a couple terms really are synonymous, then they ought to be substitutable for each other in a sentence. With a large enough set of sentences, we should be able to identify such situations, and thereby bootstrap the building of the thesaurus. But there’s a small problem, provided by my good friend EvB:
The sky is blue.
The ocean is blue.
But sky is not the same as ocean. Sure they are similar. A poet could compose a nice metaphor of fish swimming through their sky above the bottom feeders. But this metaphorical relationship isn’t one that would necessarily make it into a human compile thesaurus. So, textual substitution can easily lead us astray.
Continuing with EvB’s particularly good example we can also identify another problem. Suppose that we incorporate a bit of natural language understanding, enough to pull out parts of speech. Then, the system would easily identify the equation of sky with blue, or ocean with blue. But neither of these statements is true either. Usually people take the example to mean not that sky and blue are the same thing, but that the sky belongs to the set of objects that have a property called color, the value of which is blue. So this understanding depends on what the definition of ‘is’ is (obviously not a simple affair). We also would like to avoid drawing a relationship between any of the pronouns and the rest of the language.
Next lets look at how people tend to write. Any good library is gonna be full of metaphor, simile, pun, allusion, word play, sound play, and other such highly nuanced expression. All of these things will trump any reasonably simple attempt at drawing a link between synonymous words. Political propaganda and polemic, will probably be particularly bad at equating terms that should probably be kept logically distinct. Furthermore, at least when I write, I’m reminded of other things during the process, things that are associated, but not necessarily synonymous. These remindings are an important part of the essay writing process, but will certainly throw noise into the digital library.
But if it’s so hard to make a mechanical system for identifying synonyms, then how do humans do it? Here I have a hypothesis: that similar words stimulate similar patterns in the brain. Thus when a human tries to think up synonyms it’s really the same as playing word association with a filter. First, the word stimulates the brain, bring up certain associations. These associations will be based on ‘brain distance’, a measure of the similarity of brain activity for certain words and thoughts. But some associations will be radically different from the synonyms that we’re looking for. For example, antonyms and non-sequiturs often come up in word association games. So a filter is applied to weed these out, and what’s left is passed through a dictionary/meaning check. Anything passing this process will be reported as a synonymous term.
So, in order to really generate a thesaurus, we do need AI (or at least an underlying cognitive model). When I first thought of the thesaurus problem, I was hoping that it was paired down enough, small and simple enough that it would be doable without all this complexity. We might have to reduce the problem further, make it looser. Say, build an association dictionary, rather than a thesaurus. An association dictionary might be possible, because it forgoes the understanding of meaning and similarity, it doesn’t have to question or measure why two words should be associated, only record that they are used similarly.
So, if you can automate the building of a thesaurus, you should get a PhD in Linguistics.
There’s another problem that I didn’t flesh out in the piece above:
Some synonyms are not grammatically substitutable. For example, wonder and suppose are synonymous. Yet the sentence “I wonder whether xyz” is grammatically correct, while “I suppose whether xyz” is not. Because of this, it will be very difficult to discovery synonyms without a natural language parser. I’d prefer to uncover everything using a textual analysis, because of its relative simplicity, but this example demonstrates that we really do have to move beyond text and towards syntax.
I met with Ray (a linguistics major) about this issue, and together we found a crux to the current approaches: Symmetry.
Most of the word association programs look for things like ‘both words occur in the same sentence’ or ‘are within 5 words of each other’. The problem here is that these are actual metrics in an abstract space. As such, they contain a very nice mathematical property of symmetry. Unfortunately, after assembling all the words into a graph, with edge weights representing whatever metric was chosen, the graph is undirected because of the symmetry in the metric. When I look at an actual thesaurus I don’t necessarily see this symmetry. I can look up X, and find that Y is a synonym. But then look up Y and not find X as a synonym.
Clearly, if we are to make a realistic thesaurus we must break the symmetry in our lower-level metric. We need a quasi-metric. The easiest way that I could think to break the symmetry is to take English word order into account. A more complicated way might be to find a tree-distance in a sentence diagram. But I’m still hoping that understanding parts of speech isn’t necessary.
Last night I spent some time with my friends Elias and Alex, and Alex gave me a workable solution to the problem. If we’re willing to piggyback on human effort, then it should be possible to automatically build an association map between words of different languages, by examining large amounts of human translated works. It should be easier to identify synonyms across languages that synonyms within a language.
So, I envision the approach as so:
This isn’t exactly what I want, because I still think it should be possible to just read in a bunch of sentences in one language and spit out a thesaurus for that language, but at least it’s something.
While it may not be possible to build a thesaurus using the cross-language association maps described above, this helped to generate another idea.
If we make an asymmetric map of a single language, we can think of the result as a directed graph. Then a centrality measure of some kind should yield the most important words in that language. Hypothesis: this list will be similar (but not identical) across languages. It should also reveal a bit about how we comprehend the world.