FR97/2011/G1 - Formal Models in Natural Language Processing

Materials for Lectures (FRVŠ: Horáček, Burgetová, Zámečníková (2011))

Lecture Topic Presentation Printable
Formal Description of Natural Languages
01 Aims of Linguistic Theory PDF PDF
02 Generative Power of Natural Languages PDF PDF
NLP Formalisms
03 Augmented Transition Networks PDF PDF
04 Generalized Phrase Structure Grammar PDF PDF
05 Head-driven Phrase Structure Grammar PDF PDF
06 Lexical Functional Grammar PDF PDF
07 Lexicalized Tree Adjoining Grammar PDF PDF
08 Dependency Grammars PDF PDF
Statistical NLP
09 Probabilistic Context-Free Grammar PDF PDF
Alternative Approach
10 Scattered Context Grammar PDF PDF
Parsing
11 LR Parsing PDF PDF
All Lectures in One Archive ZIP ZIP

Introduction

The study of natural languages is one of the key factors in the foundation of Formal Language Theory (FLT). The earliest works in the area of FLT were often motivated by the effort to understand and formally describe the syntax of natural languages. However, since then, the areas of FLT and Natural Language Processing (NLP) have become quite distinct.

  • NLP studies many other aspects of natural langues besides their syntax (such as morphology or semantics). It is focused mainly on practical applications, and involves various tasks (for example spell and grammar checking or machine translation; in a broader sense we can even include such tasks as speech recognition and synthesis).
  • FLT focuses on purely theoretical studies of formal models and their properties. Practical applications are often only briefly sketched out.

The materials available here are based on the connections between NLP and FLT. For the NLP-oriented reader, they should provide an overview of the formal background of the tools used in NLP. Several well-known models are presented. For those with primary interest in formal languages, these materials contain information about an important application area of FLT. All discussed topics are illustrated by practical examples.

Topics

Formal Description of Natural Languages

  • Aims of Linguistic Theory PDF
    • Origins of computational linguistics are mostly connected with artificial intelligence (AI) applications.
    • The main concern is the syntax of natural languages.
    • Notion of transformational rules - useful for generalizations in languages (for example, we can use them to transform an active sentence in English into passive).
  • Generative Power of Natural Languages PDF
    • Where do natural languages fit in the Chomsky Hierarchy?
    • Natural languages are not regular.
    • Are natural languages context-free?
    • Introduction of Transformational Grammars.

References

  1. James Allen: Natural Language Understanding. The Benjamin/Cummings Publishing Company Inc., 1995
  2. Robert N. Moll, Michael A. Arbib, Assaf J. Kfoury: An Introduction to Formal Language Theory. Springer-Verlag, 1988

NLP Formalisms

In this section, you can find several formalisms that are used in practice in NLP. Rather than trying to provide in-depth coverage of all aspects, the materials should present an overview of available alternatives. The core principles of the models are described, and illustrated by examples.

  • Augmented Transition Networks (ATN) PDF
    • One of the earliest models for the description of natural languages.
    • Recursive transition network with registers.
    • ATNs are Turing-complete.
  • Generalized Phrase Structure Grammar (GPSG) PDF
    • Created as an attempt to show that it is possible to describe natural languages in a context-free framework, without using transformations [4].
    • Apart from context-free rules, GPSG includes features and metarules.
    • GPSG itself is not used in many practical applications, but its basic principles inspired some much more successful models (such as HPSG).
  • Head-driven Phrase Structure Grammar (HPSG) PDF
    • Non-derivational, unification-based grammar with declarative constraints.
    • Basic HPSG type is the sign - formalized as a typed feature structure.
    • Widely used in practice in NLP.
  • Lexical Functional Grammar (LFG) PDF
    • Main focus is still on syntax, but LFG also covers its relation to morphology, semantics and pragmatics.
    • Multiple dimensions of structure. There are two fundamental levels - c-structure and f-structure.
    • Used as the theoretical basis of various machine translation tools.
  • Lexicalized Tree Adjoining Grammar (LTAG) PDF
    • Elementary objects are trees, not strings (TAG is a “tree-generating system”).
    • Two basic operations over trees: substitution and adjunction.
    • Strong generative capacity - TAGs can generate some context-sensitive languages.
    • Lexicalization means associating each elementary structure with a lexical item.
  • Dependency Grammars PDF
    • The term actually encompasses a whole group of particular formal models, such as:
      • Theory of Structural Syntax [V]
      • Word Grammar (WG) [II]
      • Functional Generative Description (FGD) [IV]
      • Meaning-Text Theory (MTT) [III]
      • Extensible Dependency Grammar (XDG) [I]
    • Dependency grammars are an alternative to phrase structure grammars - they capture relations between words directly, there are no phrasal nodes.
    • Generally more simple, robust and portable than PSG, but less informative.
    • Good for describing free word order languages (such as Czech).

References

  1. James Allen: Natural Language Understanding. The Benjamin/Cummings Publishing Company Inc., 2005
  2. Doug Arnold: Lexical Functional Grammar. Dept of Language and Linguistics, University of Essex, 2011
  3. Andrew Carnie: Syntax: A Generative Introduction. Blackwell Publishing, Oxford, 2002
  4. Gerald Gazdar, Ewan H. Klein, Geoffery K. Pullum, Ivan A. Sag: Generalized Phrase Structure Grammar. Harvard University Press, 1985
  5. Robert N. Moll, Michael A. Arbib, Assaf J. Kfoury: An Introduction to Formal Language Theory. Springer-Verlag, 1988
  6. Carl Pollard, Ivan A. Sag: Head-Driven Phrase Structure Grammar. University of Chicago Press, 1994
  7. Yves Schabes, Aravind K. Joshi: Tree-Adjoining Grammars and Lexicalized Grammars. 1991
  8. Dependency grammar formalisms
    1. Ralph Debusmann, Denys Duchier, Geert-Jan Kruijff: Extensible Dependency Grammar: A New Methodology. In Proceedings of the Workshop on Recent Advances in Dependency Grammar, p. 78-85, 2004
    2. Richard Hudson: Word Grammar. Blackwell, 1984
    3. Igor Mel'čuk: Dependency Syntax: Theory and Practice. State University of New York Press, 1988
    4. Petr Sgall, Eva Hajičová, Jarmila Panevová, Jacob Mey: The meaning of the sentence in its semantic and pragmatic aspects. Springer, 1986
    5. Lucien Tesnière: Éléments de syntaxe structurale. Editions Klincksieck, 1959

Statistical NLP

With the availability of large text corpora (both annotated and unannotated) for various languages, statistical approaches became prominent in modern NLP. Instead of the traditional approach, where we try to describe all relevant aspects of a language manually (hand-crafted rules etc.), we can apply machine learning methods to extract some generalized information from a corpus.

Most of the models discussed in the previous section can be (and have been) adapted for use in statistical NLP.

  • Probabilistic Context-Free Grammar (PCFG) PDF
    • Context-free grammar where every rule has a given probability.
    • While PCFG may not be often used in practice, it is presented here as a straightforward extension of CFG, which is a well-known model from classic FLT.
    • Also useful to demonstrate some principles from statistical NLP and machine learning (i.e. the Inside-Outside Algorithm).
    • Practical applications for example in grammar checking.

References

  1. Christopher D. Manning, Hinrich Schütze: Foundations of Statistical Natural Language Processing. MIT Press, 1999

Alternative Approach

One of the major trends in FLT is the so called regulated rewriting. The basic idea is to increase the generative power of a given formal model (usually CFG) by adding some mathematically simple mechanism that will control (regulate) the sentence generation. A well-known example is the Matrix Grammar [1]. Another way to increase the generative capacity may be by modifying the form of the rules, as is done in the Scattered Context Grammar [2].

Such models could potentially be useful in NLP as well, because we would be able to capture even non-context-free features of natural languages easily.

  • Scattered Context Grammar (SCG) PDF
    • A modification of context-free grammar, where we rewrite an n-tuple of nonterminals simultaneously in each derivation step.
    • Using SCG and Transformational SCG in the description of English syntax is proposed and demonstrated in [3].

References

  1. Jürgen Dassow, Gheorghe Păun: Regulated Rewriting in Formal Language Theory. Springer, 1989
  2. Sheila A. Greibach, John E. Hopcroft: Scattered Context Grammars. In Journal of Computer and System Sciences, 3:233-247, 1969
  3. Alexander Meduna and Jiří Techet: Scattered Context Grammars and their Applications. WITpress, 2010

Parsing

In practice, having a formal description the syntax of a language is usually not enough. We need to be able not only to generate well-formed sentences of a language, often we want to analyze a given sentence. For instance, we want to decide if a sentence belongs to a given language, or to obtain further information about its constituents and structure.

Syntactic analysis (parsing) is an important part of FLT. For practical applications (in NLP, but also in other areas), we need efficient parsing methods and algorithms. This is the main reason why many of the formal models used in NLP are in some way based on CFG or context-free rules - CFG is the most powerful grammar of the Chomsky Hierarchy for which we are still able to perform syntatic analysis efficiently enough.

There are several well-known parsing algorithms for CFG, such as CKY or Chart Parsing. Some algorithms are only applicable to a given subset of CFGs (for example LL Parsing), while other ones are general.

  • LR Parsing PDF
    • Bottom-up parsing.
    • Can be used for any languages accepted by deterministic pushdown automata.
    • (Generalized) LR Parsing is one of the practically used methods in NLP.

References

  1. Alexander Meduna: Elements of Compiler Design. Taylor & Francis, New York, 2008

Acknowledgement

lectures/phd/fr97/index.txt · Last modified: 2013/02/18 12:46 by ihoracekp
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki