Topics List 3: Languages

32. We have already noted that a language is a set of strings over some alphabet. We now define a "grammar" to be a means of specifying the strings which are in a language.

What form might a grammar take? In the case of natural languages, grammars for the general reader often take the form of a text written either in the same or in another natural language. (In language classes, for example, we may encounter an English grammar or a French grammar, written in English.) These are typically imprecise, and don't cover many situations, especially the more complex ones, which arise in normal use of the language.

Linguists working on linguistic theories often refer examples to native speakers, to enquire whether some utterance is reasonable in the language. (One of the hazards of employment in a language and literature department is to be pestered with such questions on a frequent basis.)

From a computational standpoint, these approaches are not satisfactory. Evidently, for these purposes, a grammar has to be some kind of effective procedure (i.e. a Tm).

Notice, by the way, that my definition of grammar at the start of this note is worded very carefully. "Means" is a neutral term; "method" or "set of rules" might imply "effective procedure", and rule out a book or consultation with a native speaker.

In the light of Note 35, my definition also refers only to acceptance or generation of strings in a language, not to rejection of those which are not.

33-36. The use of unrestricted Tms to specify languages is referred to in the next few sections. We won't pursue these in class, but you should know their contents. If you are interested, you should try to find the proof ideas; otherwise you will have to take the theorems on trust. Starting in section 37, we shall consider an alternative formulation for a grammar proposed in the 1950's by Noam Chomsky.

33. We can imagine two ways of using a Tm to specify strings: either by having it generate them all (recursive enumeration) or by having it accept ("parse") individual strings given to it. Both are useful. Generation has proved to be a very clear way to specify a language to a user; but a programming language compiler must be able to recognize the strings presented by the user.

(Actually, this definition of parsing is very undemanding. The grammar merely tells us that an utterance belongs to the language. Both an English language teacher and a programming language compiler writer demand more. They want the operation of "parsing" a sentence/program to reveal the sentence's structure. We shall pursue this later.)

34. As it happens, generation and acceptance are equivalent:

   (a) if a language is recursively enumerable (i.e. can be generated by a Tm), there is another Tm which accepts it,

   (b) if a language is accepted by a Tm, there is another Tm which recursively enumerates its strings.

35. The definition of acceptance of a language by a Tm calls for the Tm to halt and accept a string if it is in the language, but allows the Tm to compute for ever if the given string not in the language. Often we need a stronger requirement. We introduce the notion of a recursive language. Such a language requires a "decision procedure": a Tm guaranteed to halt, and either to accept or reject any string it is given.

36. We can prove the following theorems:

   (a) every recursive language is recursively enumerable (by definition),

   (b) if a language is recursive, so is its complement,

   (c) if a language and its complement are both recursively enumerable, the language is recursive,

We can also construct a language which is recursively enumerable, but whose complement is not (another diagonal argument!) and from (b), can then conclude that

   (d) this language is not recursive.

In summary, there is a language which is not recursively enumerable, and there is a recursively enumerable language which is not recursive. (In other words, the class of recursive languages is a proper subset of the class of recursively enumerable ones, and the latter a proper subset of the class of all languages.)

Chomsky grammars

In the 1950's, Chomsky proposed the concept of "generative phrase-structure grammars" as a means of specifying (the syntax of) languages.

(NOTE. Chomsky was concerned, as we are here, only with specifying which utterances in a language are grammatically correct, that is to say, with the syntax of the language, and not with question of whether these utterances are meaningful, that is, with the semantics. As an example of a syntactically correct but meaningless sentence, he coined the celebrated:

"Colorless green ideas sleep furiously." (his spelling!)

which has since been the subject of numerous discussions.)

Chomsky grammars are characterized by

(In the case of natural languages, the symbols of the terminal alphabet are normally words of the language, and the strings are sentences.)

To obtain a string of the language, we use the starting symbol as initial string (of length 1), and repeatedly replace substrings in accordance with the rewriting rules until the current string contains only terminal symbols.

So, if S is the starting symbol, and the rules are S -> bS, S -> T, T -> aT, T -> c, successive steps might be S, bS, bbS, bbT, bbaT, bbaaT, bbaaac.

Alternative choices of rewriting rules yield different strings. The terminal strings generated in this way are the strings/sentences of the language. (What is the language generated in the example?)

The sequence of strings from starting symbol to terminal string/sentence is called a "derivation" of the sentence, and members of the sequence are "sentential forms". If, during a derivation, no rewriting rule can be found whose left-hand side is a substring of the current string, that particular derivation is blocked and yields no sentence.

Chomsky proposed three successively more severe restrictions to the permitted shape of his rewriting rules, thereby obtaining:

Restriction 1 requires that the right-hand side of a rule shall be at least as long as the left-hand side.

Restriction 2 requires that the left-hand side of every rule shall be a single non-terminal and the right-hand side a non-empty string; so every rule has the form A -> X for some string X.

Restriction 3 requires that each of the rules shall have the form A -> aB or A -> b where A and B are single non-terminals and a and b are single terminals.

We are concerned here primarily with context-free and right-linear grammars, and the languages which they generate.

By the way, we often write several rules having the same left-hand side as one rule with several alternatives separated by |

S -> aBC, S -> bc, S -> C
as
S -> aBC | bc | C

37. Consider the following grammars.

(a) terminal alphabet: {a, b, +, *}
      non-terminal alphabet: {S ,T, F}
      initial symbol: S
      rules: S -> S + T, S -> T, T -> T * F, T -> F, F -> a, F -> b

(b) terminal alphabet: the words in upper case
      non-terminal alphabet: the items in brackets
      initial symbol: [sentence]
      rules:

        [sentence] -> [nphrase] [verb] [nphrase]
        [nphrase]  -> THE [nqual]
        [nqual]    -> [noun] | [adjseq] [noun] | [noun] [rel]
        [adjseq]   -> [adj] | [adjseq] [adj]
        [rel]      -> WHICH [nphrase] [verb]
        [adj]      -> BIG | BAD | RED
        [verb]     -> EAT | EATS | BITE | BUILDS | LIVE IN
        [noun]     -> HOUSE | CAT | CATS | DOGS | BRAN

(c) as for (a), except

      rules: S -> aT, S -> bT, S -> a, S -> b, T -> +S, T -> *S

(d) The "mini-English" grammar on page 93 of Sipser.

(e) terminal alphabet: {x = 16, x = x + 1, x = x - 1, x > 0, if, while, etc.}
      non-terminal alphabet: the items in brackets
      initial symbol: [program]
      rules:

        [program]    -> [statement] ; [program] | [statement]
        [statement]  -> [if-st] | [while-st] | [assignment]
        [if-st]      -> if [bool] then [program] else [program] endif
        [while-st]   -> while [bool] do [program] endwhile
        [bool]       -> x > 0
        [assignment] -> x = 16 | x = x + 1 | x = x - 1

For each grammar, derive some typical strings and describe what language is being generated.

Can you change (a) to allow pairs of parentheses where these are customary?

In (b), aside from their possibly nonsensical meaning, we get some of the sentences which are grammatically (syntactically) correct in English, but others which are not. Can you adapt the grammar to produce grammatical sentences only?

38. [trees] A derivation in a type 2 grammar can be represented by a tree. The tree is obtained if we (recursively) picture the rule A -> BCD... as creating a subtree with root A and descendants B, C, D, ... The tree imparts a structure to the derived string. We can regard the complete substring descended from any node in the tree as a "phrase" of the derived string. In most applications, these phrases are meaningful subunits of the string, and in fact, it is this kind of tree that the English teacher or the compiler writer demand when they ask for a given string to be parsed. Thus the choice of grammar may be important. There may be many different grammars which generate the same language, but some may impart phrase structures which correspond to meaningful substrings, others not. We will then prefer the former.

By the way, using a tree to represent a derivation by a tree conveniently eliminates a nuisance. In a context-free grammar (though not necessarily in less restricted grammars), the order of application of the rules is arbitrary. If you have a sentential form such as aBCDe, then sooner or later, rules of the form B -> PqR, etc. must be used to rewrite B, C, D. But these may be applied in any order without changing the terminal string. The different orders obviously produce trivially different derivation sequences, but all give the same tree. It is, in fact, the structural information in the tree which is useful, not the precise derivation sequence.

Returning to the examples of section 37, draw the trees for some of the strings you derived.

Grammars (a) and (c) generate the same language. (c) is right-linear, (a) is not. Consider the derivation tree for the string a + b * a + b in each of the languages. Would you consider one of the grammars "better" than the other for this language? Why?

In (d), consider the sentence:

the girl touches the boy with the flower

Can you find two different derivation trees for this sentence?

A few paragraphs ago, we talked of two derivations for a string/sentence which differed trivially in the order of applying rules. This is much more serious. This sentence is generated with two different trees, and is actually ambiguous. Each of the two trees suggests a different meaning for the sentence. We will return to this later.

The concept of a derivation tree is closely equivalent to a "phrase bracketing" notation, in which we write a derived string in the form [ab[c[def]g[h]]], where each [...] defines a phrase.

Examine the derivation trees for the right-linear grammar of 37(c). Derivation trees for type 3 grammars are boring. They all have essentially the same shape, and tell us nothing about the structure of the terminal string.

Consider any right-linear grammar, such as the one with start symbol S and rewriting rules such as form S->aT, S->bT, T->cS, S->a, S->c. What can you say about all sentential forms in a right-linear derivation except the terminal string?

39. [empty string] In the form described above, no grammar except type 0 can generate the empty string. In order to allow this, we permit the addition of a rule S -> l (where l is empty) to any grammar, provided that S is the initial symbol of the grammar and S does not occur on the right-hand side of any rule. We see later that any grammar can be expressed in this form.

40. [relationship of languages to machines]

(a) The languages generated by unrestricted Chomsky grammars and those accepted by Tms are precisely the same. Specifically we can prove, by giving constructions for the grammar and for the Tm repectively, that

    every language accepted by a Tm has a type 0 Chomsky grammar, and

    every language having a type 0 Chomsky grammar is generated by some Tm.

We will not concern ourselves with the proofs in this course.

(b) The right-linear languages are precisely the ones accepted by fsms. Specifically

    every language generated by a type 3 grammar is accepted by some fsm, and

    every language accepted by an fsm has a type 3 grammar.

Work out the construction which takes us from either to the other. There is a close relation between states and non-terminals, and between transitions and rewriting rules.

(c) The two previous subsections suggest the question: is there a class of machines which accept precisely the languages generated by context-free grammars? The answer is yes. They are called "pushdown automata". We will consider them later.

Similarly, there is a class of machines which accept languages generated by type 1 grammars. They are the "linear bounded automata". We will not consider them further.

In both cases, these are essentially Tms with restrictions on the way they use their tapes.

41. We might wonder about the "unsymmetric" choice which Chomsky made in formulating his third restriction: why right-linear rules? Why not left-linear rules, of the form A -> Ba, and thus a left-linear grammar? In fact it doesn't matter:

    if a language has a right-linear grammar, it also has a left-linear grammar, and vice versa.

There is presumably a direct way to construct one from the other, but I always go around the block. I begin by noting that, given a right-linear grammar to generate L, we can obtain a left-linear grammar for LR (reverse of L) merely by flipping all rules A -> aB into A->Ba.

So,

   given a right-linear grammar for L,
          obtain an fsm for L using section 40(b) part 1,
          then obtain an fsm for LR using section 9,
          then obtain a right-linear grammar for LR using 40(b) part 2,
          then obtain a left-linear grammar for L by flipping its rules.

   given a left-linear grammar for L,
          obtain a right-linear grammar for LR by flipping its rules,
          then obtain an fsm for LR using section 40(b) part 1,
          then obtain an fsm for L using section 9,
          then obtain a right-linear grammar for L using 40(b) part 2.

42-45. In these sections we will state a number of results. It is useful to know them, but we will not consider their proofs in this course.

42. [hierarchy] We can continue the hierarchy started in section 36 with the following theorems:

   (e) every language generated by a type 1 grammar (csg) is recursive,

   (f) there is a recursive language which is not generated by any csg,

   (g) there is a language generated by a csg which cannot be generated by any type 2 grammar (cfg), and a language generated by a cfg which cannot not be generated by any right-linear grammar.

In short, the type 1 languages are a proper subset of the recursive ones, the type 2 languages a proper subset of the type 1, and the type 3 a proper subset of the type 2.

[Appreciate that every type 3 language is also type 2, and every type 2 also type 1, since each restriction includes the previous one.]

43. [closure properties]

   (a) The union, concatenation, Kleene star and reverse of type n languages are also type n (n = 0, 1, 2, 3).

   (b) The intersection of type n languages is type n (n = 0, 1, 3). However, the intersection of type 2 languages is not always type 2.

   (c) The complement of a type 3 language is always type 3. The complement of a type n language is not always type n for n = 0, 2. The situation for n = 1 has not yet been proved either way.

44. There is a cfg "pumping" theorem, sometimes called "the uvwxy theorem", which is analogous to the one for fsms, and can be used to show that various languages are not cfls. One such language is

{anbncn| n >= 1}

which has a context-sensitive grammar, but not a cfg.

45. The pumping theorem can also be used to show that there is an algorithm to determine whether the language generated by a cfg is empty, finite or infinite. [There can be no such algorithms for csgs or above.]

Context-free grammars

46. [equivalence] Grammars of type 2 model some, though not all, of the syntax features which are desired in programming languages, and do so in a clear manner. (In contrast, grammars of type 1 are very opaque.) Cfgs are therefore commonly used to specify parts of programming language syntax.

If one is writing a compiler for some programming language, it is often useful to manipulate the grammar into some special form, to simplify parsing programs.

What we really mean is that, given a cfg G, we create a new cfg G' which is, in some sense, equivalent to G. So we have to consider the concept of equivalence of grammars.

Equivalence comes in several strengths.

The weakest equivalence between G and G' is merely that both grammars generate the same strings. But this is not enough. If the parsing operation is going to be used to uncover the structure of a string (and not just the fact that the string is in the language), we do not want a G' which produces a completely different tree than G.

At the other extreme, the strongest equivalence requires G and G' to produce a string with trees which are identical except for the names of metavariables at non-leaf nodes. This is too severe.

We will be satisfied with an intermediate form of equivalence. Consider the arithmetic expression:

a - b * c - d

It is important that its structure includes [b * c] and [a - b * c] among its phrases, indicating the subexpression [b * c] is a phrase whose value is to be computed and subtracted from a, and that d is to be subtracted from the result. It does not matter if the grammar creates the phrase structure
[ [a - [b * c] ] - d ], or divides the phrases more finely, perhaps to [ [a] - [ [b] * [c] ] - d ]. But the structures [a - b] * [c - d] and [a - [ [b * c] - d ] ], whose phrases suggest that the subtractions should carried out before the multiplication, or that the second subtraction should be done before the first, are unacceptable.

Most of the following "simplifications" produce a suitable intermediate equivalence: the shape of the derivation tree is altered, but not severely damaged.

47. ["simplified" forms of grammars: all types]

   (a) Any type n language can be generated by a type n grammar whose initial symbol does not appear on the right-hand side of any rule.

   (b) Any type n language (n = 0, 1, 2) can be generated by a type n grammar in which all rules have the forms X -> Y or A -> b, where X and Y are strings of non-terminals only, A is non-terminal, and b a terminal.

48. ["simplified" forms of grammars: cfgs] There are algorithms which allow us to do each of the following:

   (c) eliminate all non-terminals which yield no terminal strings,

   (d) eliminate all non-terminals which cannot appear in strings derived from S,

   (e) eliminate all rules of the form A -> B, where B is a single non-terminal,

   (f) [Chomsky normal form] put all rules in the form A -> BC or A -> b,

   (g) eliminate all direct left-recursive rules (rules of the form A -> AB...),

   (h) [Greibach normal form] put all rules in the form A -> bBCD... (BCD... possibly empty),

   (i) eliminate all non-terminals which generate only finitely many terminal strings.

49. [arithmetic expressions] A supplementary document shows twelve different grammars for simple arithmetic expressions without parentheses. These illustrate many of the manipulations mentioned in sections 47 and 48, and a few other features. Take some simple expressions and draw derivation trees in each grammar, to see the effect of the different grammars on the structure.

50. [ambiguity] As we have seen earlier (section 38 and arithmetic expression grammar #12), some cfgs are ambiguous. There are strings in the languages they generate which have two or more distinct derivation trees. This is a property of the grammar. The same language may be generated by another cfg which is not ambiguous. Some languages, however, are INHERENTLY ambiguous: every cfg which generates them is ambiguous. Unfortunately, there can be no algorithm which determines whether an arbitrary cfg is ambiguous, and no algorithm to determine whether an arbitrary cfg generates a language which is inherently ambiguous. [These problems are said to be "recursively unsolvable".] We will not concern ourselves with proofs.

51. [pushdown automata] Pushdown automata (pda) are machines with a finite set of states, which read an input tape from left to right, each character once only, and make use of an infinitely large "pushdown store" (i.e. a stack) as memory. The rules of operation permit us to access only the topmost symbol of the stack; to access lower symbols we must discard those above. There are two styles of pda: those which accept/reject input strings "by empty stack", and those which do so "by accepting state". The two styles can be proved equivalent (but we won't).

52. Every language generated by a cfg is recognized by a (non-deterministic) pda, and conversely.

53. Unfortunately, this time, it is not the case that every language recognized by a non-deterministic pda is also recognized by a deterministic pda. In fact, there are cfls which are not recognized by any deterministic pda.

Why do we care? Because the pda which recognizes a language can be turned into a straightforward algorithm to parse strings of the language. Imagine a programming language whose syntax is defined by a cfg. The corresponding pda shows us how to recognize (and as a matter fact, obtain the derivation tree for) the thousands of symbols in a program, by reading it once from left to right, pushing and popping a stack, based only on the current state, the next input symbol and the current top stack item. But if the pda is non-deterministic, several alternatives may be available at every moment, requiring massive backtracking or parallelism, and even the possibility of ambiguity (if more than one of the alternatives leads to recognition.)

Worse still, although we can prove that some specific cfgs have dpdas, and that some specific ones do not, the general question of whether a cfg has a dpda is not computable by a Tm! (Careful now. It is not good enough simply to apply the construction which leads to the proof of section 52, and to look for non-deterministic transitions. This construction often leads to an npda, but that doesn't prove that there is no deterministic pda to do the same job.)

So context-free languages (cfls) fall into two classes: deterministic cfls, which are recognized by dpdas, and the others, which are not.

The latter may be time-consuming to parse, and may even contain ambiguous strings. (This is hardly a good property for a programming language. Imagine writing a program, only to have the compiler unwittingly find a different, but correct, interpretation of it.)

The matter of finding subclasses of the cfls, or individual cfls, which can be proved to be deterministic is important in programming language design, since these have good parsing algorithms and are unambiguous. No doubt this topic will reappear in cisc458.

54. By way of examples:

(a) The language {anbn| n >= 1} and the palindrome language with centre marker are examples of dcfls. Give informal descriptions of dpdas which recognize them.

(The palindrome language contains strings which read the same backwards as forwards. The palindrome language with centre marker has a distinct character in the middle of every string. So if S is an alphabet {a, b, c, $}, and T is the subset {a, b, c}, the palindrome language with centre marker is the set {w$wR | w belongs to T* and wR is w reversed}.)

(b) If we omit the centre marker, the language is still a cfl. (Write down a cfg which generates it.)

But it is not a dcfl. We can see this informally. The problem is that to recognize its strings scanning character by character from the left, you have to know when you reach the centre. But there is no way to tell where the centre is until you get to the end. So, at every character, you must non-deterministically follow two paths: one in case you are at the centre, the other in case you are not.

55-57. Here are a few more results to consider if time allows.

55. The complement of a dcfl is a dcfl.

56. The languages L1 = {ambmcn| m, n >= 1} and L2 = {ambncn | m, n >= 1} are dcfls.

57. Consider L = complement L1 v complement L2. This is the union of two dcfls, and is therefore a cfl. But

complement L = L1 intersect L2 = {anbncn| n >= 1}

which is not a cfl. Thus L is a cfl whose complement is not a cfl (and so, not a dcfl). Thus L is a cfl but not a dcfl.

And indeed, we have here counterexamples which prove not only a statement in section 53 (that there is a cfl which is not a dcfl), but also that

(a) the complement of a cfl is not necessarily a cfl,

(b) the intersection of two cfls is not necessarily a cfl,

(c) the intersection of two dcfls is not necessarily a dcfl (or even a cfl),

(d) the union of two dcfls is not necessarily a dcfl.

Farewell

58. [variations/generalizations of Chomsky grammars] Various researchers have sought variations on, or generalizations of, Chomsky grammars; not, surely, in the hope of creating a class of languages above type 0, for if we believe Turing's thesis, this is doomed to failure; but rather to find mechanisms better tailored to various applications. For example, it would be nice to find a mechanism to rise above type 2 languages and capture certain aspects of the syntax of programming and natural languages in a clear way, without having to make use of context-sensitive rules.

We may note that the characteristics of a Chomsky grammar include TWO FINITE alphabets, a SINGLE starting string which is a SINGLE NON-TERMINAL symbol, a FINITE set of rules of a certain FORM, with the rules applied SEQUENTIALLY. Any of the characteristics noted in uppercase may be modified to yield further classes of grammar.

Grammars which have been studied (not necessarily all modifications to these Chomsky characteristics) include:

     Post systems: single alphabet, rules of a more general form; application: formal study of theorem-proving.

     Lindenmayer grammars: single alphabet, parallel application of rules; applications: modelling of biological phenomena, graphics.

     W-grammars (or two-level grammars): infinite sets of non-terminals and rules, themselves defined by rules; application: definition of programming languages.

     Time-varying grammars: the rules vary with each moment.

     Attribute grammars: the nodes of derivation trees have attributes attached which may be carried down the tree; applications to both programming and natural languages.

     Transformational grammars: the (shape of the) derivation tree can be transformed; application: definition of natural languages.

Regretably, we won't have time to study these here.