Natural Language Processing John Batali Department of Cognitive Science University of California at San Diego ================================================================ The term: "natural" languages refer to the languages that people speak, like English and Japanese and Swahili, as opposed to artificial languages like programming languages or logic. NLP = "Natural Language Processing". Programs that deal with natural language in some way or another. "Computational Linguistics". Doing linguistics with computers. Related to NLP, but sometimes explicitly linguistic, for example building models of linguistic theories to test their properties, without any real desire to use them for interfaces or any other application. Uses for NLP: User interfaces -- better than obscure command languages. It would be nice if you could just tell the computer what you want it to do. Of course we are talking about a textual interface -- not speech. Knowledge-Acquisition -- programs that could read books and manuals or the newspaper. So you don't have to explicitly encode all of the knowledge they need to solve problems or do whatever they do. Information Retrieval -- find articles about a given topic. Program has to be able somehow to determine whether the articles match a given query. Translation -- it sure would be nice if machines could automtically translate from one language to another. This was one of the first tasks they tried applying computers to. It is very hard. ================================================================ Linguistic levels of Analysis Language obeys regularities and exhibits useful properties at a number of somewhat separable "levels". Think of language as transfer of information. It is much more than that. But that is a good place to start. Suppose that the speaker has some meaning that they wish to convey to some hearer. Speech (or gesture) imposes a linearity on the signal. All you can play with is the properties of a sequence of tokens. Actually, why tokens? Well for one thing that makes it possible to learn. So the other thing to play with is the order the tokens can occur. So somehow, a meaning gets encoded as a sequence of tokens, each of which has some set of distinguishable properties, and is then interpreted by figuring out what meaning corresponds to those tokens in that order. Another way to think about it is that the properties of the tokens and their sequence somehow "elicits" an understanding of the meaning. Language is a set of resources to enable us to share meanings, but isn't best thought of as a means for *encoding* meanings. This is a sort of philosophical issue perhaps, but if this point of view is true, it makes much of the AI approach to NLP somewhat suspect, as it is really based on the "encoded meanings" view of language. The lowest level is the actual properties of the signal stream: phonology -- speech sounds and how we make them morphology -- the structure of words syntax -- how the sequences are structured semantics -- meanings of the strings There are important interfaces among all of these levels. For example sometimes the meaning of sentences can determine how individual words are pronounced. This many levels is obviously needed. But language turns out to be more clever than this. For example, language can be more efficient by not having to say the same thing twice, so we have pronouns and other ways of making use of what has already been said: A bear went into the woods. It found a tree. Also, since language is most often used among people who are in the same situation, it can make use of features of the situation: this/that you/me/they here/there now/then The mechanisms whereby features of the context, whether it is the context created by a sequence of sentences, or the actual context where the speaking happens is called "pragmatics". Another issue has to do with the fact that the simple model of language as information transfer is clealy not right. For one thing, we know there are at least the following three types of sentences: statements imperatives questions And each of them can be used to do a different kind of thing. The first *might* be called information transfer. But what about imperatives? What about questions? To some degree the analysis of such sentences can involve the ideas of a basic notion of meaning. Speech acts. There are other, higher-levels of structuring that language exhibits. For example there is conversational structure, where people know when they get to talk in a conversation, and what constitutes a valid contribution. There is "narrative structure" whereby stories are put together in ways that make sense and are interesting. There is "expository structure" which involves the way that informative texts (like encyclopedias) are arranged so as to usefully convey information. These issues blend off from linguistics into literature and library science, among other things. Of course with hypertext and multi-media and virtual reality, these higher levels of structure are being explored in new ways. ================================================================ Issues in Syntax For various reasons, a lot of attention in computational linguistics has been paid to syntax. Partly this has to do with the fact that real linguistics have spent a lot of work on it. Partly because it needs to be done before just about anything else can be done. I won't talk much about morphology. We will assume that words can be associated with a set of features or properties. For example the word "dog" is a noun, it is singular, its meaning involves a kind of animal. The word "dogs" is related, obviously, but has the property of being plural. The word "eat" is a verb, it is in what we might call the "base" form, it denotes a particular kind of action. The word "ate" is related, it is in the "past tense" form. You can imagine I'm sure that the techniques of knowledge representation that we have looked at can be applied to the problem of representing facts about the properties and relations among words. The key observation in the theory of syntax is that the words in a sentence can be more or less naturally grouped into what are called "phrases", and those phrases can often be treated as a unit. So in a sentence "The dog chased the bear," the sequence "the dog" forms a natural unit. The sequence "chased the bear" is a natural unit, as is "the bear". Why do I say that "the dog" is a natural unit? Well one thing is that I can replace it by another sequence that has the same referent, or a related referent. For example I could replace it by: Snoopy (a name) It (a pronoun) My brother's favorite pet (a more complex description) What about "chased the bear"? Again, I could replace it by died (a single word) was hit by a truck (a more complex event) This basic structure, in English, is sometimes called the "subject-predicate" structure. The subject is a nominal, something that can refer to an object or thing, the predicate is a "verb phrase", which describes an action or event. Of course, as in the example, the verb phrase can also contain other constituents, for example another nominal. These phrases also have structure. For example a noun phrase (a kind of nominal) can have a determiner, zero or more adjectives, and a noun, maybe followed by another phrase, like: the big dog that ate my homework Verb phrases can have complicated "verb groups" like will not be eaten Syntactic theories try to predict and explain what patterns are used in a language. Sometimes this involves figuring out what patterns just don't work. For example the following sentences have something wrong with them: * the dogs runs home * he died the book * she saw himself in the mirror * they told it to she Figuring out exactly what is wrong with such sentences allows linguists to create theories that help understand the way that sentences get structured. The general idea, in English, is that a sentence consists, as I said of a subject and a predicate. A predicate is a verb followed by one or more nominals or prepositional phrases. Verbs often require a certain number of either nominals or prepositional phrases, these are called "complements". For example: it died (no complements, "intransitive") the horse kicked the farmer (one complement "transitive") I gave her the book (two complements) I gave the book to her (one complement is a prepositional phrase) ================================================================ The sentences above are wrong for reasons that can be stated clearly. But another class of constraits was discovered in the early 60s. They generally involve sentences in which a componant is moved out of its ordinary position, for example to make a question or relative clause. Consider: I like flowers. Can be transformed into: What do I like? And He gave the fish to Ned. Can be transformed to: Who did he give the fish to? (Some people say this is ungrammatical. They are wrong. But even the "grammatical" version "to whom did he give the fish?" illustrates the point I am making.) The general rule seems to be that you an take any nominal and replace it with a question word, and move it to the front of the sentence. But consider the following sentences: A She likes ice cream and olives. A' * What does she like ice cream and? B I know a Democrat who hates Clinton. B' * Who do you know a Democrat who hates? Now these sentences are interesting because it is not exactly clear what sort of rule is being broken, you never see such sentences in language textbooks as the sort of thing to avoid, and children never produce them - and in fact children often make the sorts of errors mentioned previously. ================================================================ Other information may also be added to a sentence which is not required by the verb but which adds other information about what is going on, these are called "adjuncts". it died yesterday (gives time) it died in the garage (gives location) it died because nobody fed it (gives reason) Note that in the last example a "sentence" is part of another sentence. This can happen in various ways. For example some verbs take sentence-like units as complements: he thought I liked him Or, as above, they can be used as adjuncts. Rather than call these sentences, they are sometimes called "clauses" -- a clause is a verb with some other arguments, usually its complements, sometimes (not always) a subject. "Phrase structure trees" are often used to represent the configuration of sentences. These can show how the structural elements are related, and the relations among nodes in the tree can be used to describe constraints that have to hold. One approach to characterizing syntactic structure involves giving rules to describe how phrases can be generated. For example here are some such rules: S -> NP VP NP -> Det {Adj} Noun VP -> Verb {NP} {PP} PP -> Prep NP A category in parens {}, means that it is optional. Assuming that we have a "lexicon" of words, with their categories represented, these rules could be used to generate some syntactic structures that sentences may exhibit. Suppose we add this rule: NP -> Det {Adj} Noun {PP} For example "the man on the dock". This gives rise to the possibility that two sentences with the same sequence of words could be grouped differently. I saw the man with a telescope. These different configurations can be associated with different meanings. This is called "syntactic ambiguity." Ambiguity is when a word or sentence can be taken as having more than one distinct meaning. For example some words have more than one meaning: I went to the bank. Different meanings of words can cause sentences to be understood in very different ways: I saw her duck. Flying planes can be dangerous. The sorts of rules that I have described are called "context-free" because the rewrite operation that they describe doesn't depend on any context in which the left-hand symbol occurs. But this can't capture some fairly simple regularities: Agreement: *She saw himself. Complements: *He put the block. Case: *They saw she. To solve this, rules need to specify more than just what tree configurations can occur, but must somehow indicate constraints that hold among the elements in the tree. Another issue is that some sentences seem pretty directly related to others. For example consider the following pairs: he ate the fish the fish was eaten by him she read the book what did she read? the dog is at the corner the dog at the corner barked There is a sense in which the second sentence or phrase is a "transformed" version of the first. This observation led to a powerful theory of syntactic structure called "transformational grammar" in which a language began with some simple context-free rules and some local constraints to create a set of basic sentences, which could then be transformed in various ways. It turned out however that this didn't really work, so lately linguists are looking at a more abstract theory. The basic idea is that there is a general theory of phrase structure: X -- lexical category (noun, preposition, verb) X' -- "modified" lexical category (with complements) X'' -- "specified" lexical category. Constraints can be specified among phrases built up this way. And restrictions on movement can be stated. The hypothesis goes even deeper than this, in that some linguists believe that this representation system is somehow innate, that it underlies all human linguistic knowledge. The evidence for this claim is the fact that all languages can be described using this terminology (more or less) and that it doesn't have to be this way. There is also evidence having to do with the fact that there are often relations between ordering rules in languages that seem to hold for all phrases, rather than for just one type of phrase. For example there are languages in which the complements of a verb go after the verb. (Like English.) In many of these languages, modifiers to nouns and complements to prepositions go after the modified element (like English for prepositions, but not for nouns, French is a good example of this). Obviously this doesn't always work, but it works often enough that some researchers think that there might be something there. Others think this whole notion is totally bogus (for example most people at UCSD). ================================================================ Parsing Given all of the attention paid to syntax, it is not surprising that a lot of work has been done on getting computers to come up with a characterization of the semantic structures of sentences. Obviously, the way that this will work depends on the specific syntactic theory you believe in, but in general a parsing program is a search through the space of possible structural characterizations of the sentence, constrained by the fact that the structural characterization must be compatible with the given sequence of words. Most of the research on automatic parsing, has involved context-free grammars. Sometimes the basic ideas from context-free parsing are then augmented to make the parser able to handle non-context-free-constraints. The general idea of parsing with a set of context-free rules is to start generating possible tree structures, until a rule generates a lexical category. This is then checked against the next word in the sentence. If it is of the appropriate category, the parse continues. If not, the parser must explore another node in the search space. For example: S -> NP VP NP -> Det Noun VP -> Verb {NP} {PP} PP -> Prep NP Suppose we are parsing: The dog barked in the yard. We assume we have sentence, so we start with the tree: S We expand it using the rule NP VP Working from left to right, we expand the NP node: Det Noun Now "Det" is a lexical category, so we look at the first word of the sentence, it is indeed a determiner, so we continue. The next category "Noun" is also a lexical category, so we check, and succeed. Now we come to a non-lexical category, VP, so we find a rule for that. This rule has optional constituents, so we treat each optional possibility as a separate node. Our first assumes that both are optional: VP -> Verb And we create a node for each of the other possibilities: VP -> Verb NP VP -> Verb PP VP -> Verb NP PP The first node predicts a verb and one is there so we continue. However that rule says we should be done, and we aren't yet, so it fails, and we go back to the next node. This one also predicts a verb, so we continue. We expand and NP node which predicts a Determiner, but there is none there, so that one fails. The next node predicts a verb, and we expand the PP node to predict a preposition, which is what is there, and we continue on. Obviously there can be lots more complexity to all of this but the general idea in what is called "top down" parsing is a depth-first search down the left side of the tree until a lexical category is predicted. This is compared with the next word in the sentence. To handle non-context-free phenomena, a context-free parser is sometimes augmented with some additional tests or operations to perform after the parser succeeds on the context-free operation to possibily eliminate some sentences. For example we might have: S -> NP VP (= (number NP) (number VP)) Where 'number' returns whether is argument is singular or plural. Of course we will have to augment our representation of the synactic structure somehow to record this and other potentially relevant syntactic properties. We will see a specific example of this next time, when we examine a parser that uses the machinery we developed for proving theorems. ================================================================ Issues in Semantics Although it is hard to tell sometimes at linguistics talks, the only reason that people are interested in syntax is that the structure of a sentence is presumabely related somehow to the meaning that it conveys. One idea in semantics we have already seen -- the idea of hierarchies of objects. To some degree, the meanings of nouns and noun phrases can be understood with the sorts of knowledge representation ideas we have already looked at, and many of these ideas were developed for natural language understanding systems. The idea of the "referent" of a noun phrase -- the thing that it refers to, usually by satisfying some description. The idea of hierarchies of objects can also be extended to the idea of hierarchies of actions and events. In the theory of "conceptual dependency" the claim is that the relations among complex events by composing them out of more simple events. A key idea in representing events is that certain kinds of events have specific "participants". For example a "buy" event has a buyer and a seller and a thing bought. A "move" event has the thing that moves and possibily an initial and a final location and maybe path along which the motion happens. These observations lead to the theory of "case frames". A case frame is a representation of an action or event, along with its participants. The reason they are called "case" frames has to do with the fact that in many languages (though not English), nouns are assigned case depending on the role that the referent of the noun phrase plays in the sentence. For example in Latin, there is a different ending to indicate if the word is the subject of the sentence, the direct object, or if it refers to a location (and some more). The idea of case frames is that each verb is associated with a specific case frame, and a set of "role mappings" which indicate how the syntactic arguments of the sentence are assigned to the participant slots in the case frame. Here are some typical slots in case frames: agent object location source goal beneficiary For example the verb "buy" might be associated with a "purchase" case frame with a buyer and seller and an thing bought. So we will assume that it uses the "source" slot for the seller, the "goal" slot for the buyer, and the "object" slot for the thing bought. Thus the verb "buy" maps the subject of the sentence to the "goal" slot, the direct object to the "object" slot, and the object of the prepositoon "from" to the "source" slot. Note that prepositions are often used to assign case roles. Obviously, "from" is often the "source" slot and "to" is often the goal slot. Now consider the verb "sell". This evokes the same case frame but with different mappings: the subject is now the source, the object is again the object, and the object of "to" is the goal. [[Semantic Projection Stuff]] ================================================================ Issues in Pragmatics Pragmatics usually refers to how contextual resources are used to work out the specific meanings of sentences. Sometimes the contextual resources are linguistic, for example referring expressions, and sometimes they are part of the speech situation, for example the speaker and hearer, and the time and place of the utterance. So for example we have in English the difference between "definite" and "indefinite" reference. An "indefinite" expression gives a description and is often used to indicate that an object satisfying that description is to be newly introduced into the discourse. A "definite" referring expression is used to refer back to a previously mentioned entity. So in: A bear came to our campsite last night. The bear was eating our garbage. It scared my brother. The first expression "a bear" is indefinite. Introduces the entity to the store. "The bear" is definite. Refers to previously introduced bear. So does "it". All of this requires some notion of a "structure" or "context" in which referring expressions are introduced. The discourse situation must be represented also, for many references to be understood. For example we need to represent the speaker and hearer, and perhaps onlookers, if we are to work out the intended referents of "me" and "you" and "us" and "them". Also the times of "now" and "yesterday", and the locations of "here" and "there". Different languages partition the speech situation in different ways than English. For example many languages have a second person plural, sort of like "you all". Some have two different kinds of first person plurals -- one that includes the hearer, and one that doesn't. Spanish, for example, has four spatial pronouns, one for near the speaker, one for near the hearer, one for in the region where both are, and one for a region far from both. ================================================================ Issues in Discourse The next level of analysis is called "discourse theory". This is about the higher level relations that hold among sequences of sentences in a discourse or a narrative. It merges sometimes with literary theory, but also with pragmatics. One thing to understand is that different sentences do different kinds of "work" in a discourse. We have seen some examples of this already -- noun phrases that refer to new entities, or back to previously introduced ones. Same for whole sentences. Some introduce new events or relations, some used them to introduce something new. A car began rolling down the hill It collided with a lamppost. One important idea in discourse theory is the idea that much language is performed in the context of some mutual activity. For example two people could be working on some project together. In this case, they are probably both somewhat aware of the plan that they are both following, and so much of the pragmatic information needed to understand what they are talking about can be thought of in terms of that plan. And sometimes utterances can be understood as if they were steps in the execution of a plan. For example if I say, please pass the salt This could be thought of as a way to get me the salt, if having salt was part of a plan. Some people think of sentences like can you pass the salt As "indirect speech acts" because they look like questions, but aren't really. One way to think about sentences like this is that the hearer understands that this is probably not a question, but is a conventionalized (and polite) means of asking for the salt. Another analysis of this sort of sentence is that you are trying to avoid rejection. You do this by considering ways that your plan might fail. So you don't want to have this happen: please pass the salt I can't, I'm tied up with ropes. oh, sorry. So you ask about potential problems first -- asking about ability. So that if there is a problem, you don't have to ask directly and you won't be rejected. It is sort of like: are you doing anything saturday night? yes, I'm feeding my goldfish So you don't have to be rejected if you actually ask for a date. ================================================================ Information Retrieval An important new application of natural language processing is in the area of "information retrieval". In this field we aren't as much interested in working out the linguistic details of texts, but we are interested in finding information about some topic. The general model is this: a huge database of articles (like an encyclopedia) a user "query", like "find me articles about Newt Gingrich's illegal PAC activities" The goal is that the computer be able to figure out which articles in the database are relevant to your query. There are two ways commonly used to assess the success of an information retreival system: Relevance: all relevant articles are found Accuracy: all articles found are relevant Note that it is easy to make a system that is very high in the "relevance" index -- simply return all articles in the database. Clearly this will return all relevant ones, but of course this isn't very useful. Accuracy is important, but much more difficult. One approach to information retrieval is to use the NLP ideas described above to make a program that parses the text of all of the articles, and represents their meanings, and then does the same thing for a query. The problem with this approach is that few of the linguistic issues needed to really do this have been solved. Parsing alone is hard, and that is one of the best understood things. A more practical approach, and one that actually works, is to forget linguistic structure, and just work with keywords. The idea is to take the query and to find articles that contain as many of the words in the query as possible. Usually such systems use a "stemmer" to convert word forms into base words, and use them in both the query and articles to match. So we wouldn't search for "activities" But for "activity". Also we remove common words like "the" and "of". Sometimes the system will use a table of synonyms to enlarge the search more. For example an article might not use the word "illegal", but might use words like "criminal" or "illicit" or "questionable". This simple keyword approach works often, but will sometimes fail to do well in the accuracy measure. For example suppose I am looking for articles on "free markets". The problem is that lots of articles will contain the words "free" and "market" that don't say anything about free markets, for example there will be articles about free stuff you can get at supermarkets. So some IR systems allow you to specific specific ordering constraints, or boolean combinations of index keywords. Right now, with the world wide web, there is a lot of interest in finding good IR systems -- ways to let someone describe a set of interests, and locate WWW pages that correspond to that set. If you can figure out a good way to do this you will probably get rich.