Natural language processing (NLP) has been extensively addressed in the last thirty years. However, some fundamental limitations have not been overcome in this field, and the technology has not yet achieved a revolutionary impact on society. Researchers have studied NLP problems from different perspectives such as syntax, semantics, and pragmatics. The critical thing for many NLP systems is the knowledge acquisition. Many methods of gaining knowledge have been proposed from previous studies [2].
Syntax is without doubt the most mature field of study in both computational linguistics and the closely related field of linguistics. Grammars have been derived to generate (specify) syntax. Since the definition of the Chomsky hierarchy (finite state languages, context-free languages, context-sensitive languages, and the recursively enumerable languages), different grammar formalisms are introduced for describing syntactic phenomena. Various natural language phenomena have been solved using different grammars. However, there is no known grammar formalism which can be universally applied. The general concentration in computational linguistics of formalizing grammars is kept focusing on extending Chomsky's context-free grammars to handle more natural language phenomena while preserving context-free grammars' nice properties in computing sense, because of their uses in programming languages. As early as Chomsky himself has realized that context-free grammars can not be applied without modification to satisfy NLP requirements. A successful extension invented later is the transformational grammar. It played a dominating role in grammar formalisms for years. Most NLP books cover the transformational grammar like in [3]. Comparing with some newer developed grammars, transformational grammars are straight forward to be generated from context-free grammars, however, still coming short for handling many NLP problems.
In this survey, we focus on three kinds of grammar formalisms (indexed grammars, linear indexed grammars, and global indexed grammars) that have been invented one after another. Unlike the transformational grammars, these three formalisms have been thoroughly analyzed mathematically for their descriptive power. Generally, they are also more of some abstract theoretical interests in dealing with NLP problems, which means that their mathematical and computational models are formally specified. Indexed grammars (IGs, Aho(1968)) are straightly extensions of context-free grammars [1]. IGs are not semilinear, which are not computationally efficient. The later modification of linear index grammars (LIGs, Gazdar(1988)) was set to achieve the semilinear property, which add an additional constraint in the form of the productions [6]. Global indexed grammars (GIGs, Castaño(2003)) is the newest invention using the stack of indices as a global control structure. This formalism provides a global but restricted context that can be updated at any local point in the derivation [4]. From the naming convention, it is not hard to notice their relationships that LIGs are developed based on IGs, and GIGs are derived from LIGs. Languages generated by IGs are called indexed languages, which include all context-free languages and yet a proper subset of the class of context-sensitive languages. As a result of this formalism, indexed languages hold many of the closure properties and decidability results from context-free languages. The further formalisms by LIGs and GIGs result languages inherited most of the properties from context-free languages like polynomial parsability and semilinearity.
We look at the formalisms of these three grammars (IG, LIG, and GIG) in the following sections (IG in Section 2, LIG in Section 3, and GIG in Section 4) in certain details. Some comparisons are proposed through a hierarchical perspective in Section 5. Challenges and future works in NLP field related with grammar formalism are discussed in Section 6.