The Morphology

Following lexical selection and the retrieval of words requested by indirections, the VINCI morphology process creates the appropriate inflected word or phrase at each leaf-node, based on its lexicon entry, its attributes, and perhaps some information from the surrounding nodes. To define the appropriate form, each lexicon entry contains one or more morphological expressions.

To introduce morphological expressions, we draw an analogy with the much more familiar notion of a numerical expression.

A numerical expression computes numbers. It is built of components such as numbers, variables and functions, along with operators like + - * and /. The variables and functions compute numbers, and may themselves be defined by other numerical expressions. An example might be:


    x + 7.86 * y - cos(0.25 * pi)

This tells us how to combine the numbers produced by x, y and the cosine function. It is worth pointing out that this last component may compute a number using some numerical expression or may retrieve an entry from a table.

Analogously, a morphological expression computes words or phrases. It is built of components such as:

fixed strings:"er"
lexical fields:#7
table elements:futur(Nombre, Personne)
rules:$er
previous pass result:!

referred to as primaries of the expressions, along with operators + - and &.

A typical morphological expression is:


    #7 + "er" + futur(Nombre, Personne)

This computes the future tense of a French regular -er verb, requesting VINCI to take the field 7 of the lexicon entry, perhaps a stem such as "parl" or "donn", to attach "er", and then to append an element extracted from a table futur using Nombre and Personne values to select a table entry.

Let us consider the various primaries and operators.

The primaries each produce a string.

The fixed string requires no elaboration. ! produces the string computed in the previous pass. (See Multiple Passes below.) Each of the other primaries defines a further morphological expression (possibly just a fixed string), and calls for its evaluation.

A lexical field primary directs VINCI to the specified field of the lexicon entry. In the example above, we are expecting this to be a fixed string, a stem such as "parl" or "donn", but in general it may be any morphological expression. The field itself may be any except 2, 3 and 4. VINCI imposes content restrictions on these three which prevent them being interpreted as an appropriate expression.

A table element primary requests VINCI to obtain an element from a morphology table. The example table futur probably has the form:


    futur("ai", "as", "a", "ons", "ez", "ont")

in which each element is a fixed string, one of the endings used for the French future tense. But again the element may be any morphological expression. We describe tables more fully below.

A rule primary tells VINCI to evaluate a morphology rule. This has the form of a guarded rule; for example:


    rule er
       pres:   #7 + er_pres(Nombre, Personne);
       imparf: #7 + imparf_ending(Nombre, Personne);
       fut:    #7 + "er" + futur(Nombre, Personne);
       cond:   #7 + "er" + imparf_ending(Nombre, Personne);
    %

which generates four of the tenses of French regular -er verbs. In this simple example the guards are just attributes, which must be present in the leaf node of the verb for a guard match to be successful. In their full generality, the guards are much more complex; they will be discussed below. The right-hand sides are morphological expressions.

Three operators may appear in the expressions:

+ calls for adjacent components to be concatenated. It may, in fact, be omitted, since adjacency itself is adequate to denote concatenation. We generally use it because we believe it improves human readability.

- asks for the immediately previous letter to be deleted; so #1- + "ies" might be used to create the plural of nouns such as "ferry".

& asks for the immediately previous letter to be doubled; so #1& + "ed" might be used to create the the past participle of verbs such as "omit".

The morphology process begins with an initial expression (or expressions) taken from the lexicon entry of the leaf node. Details follow later.

The morphology process is closely similar to the macro- expansion system well known to computer specialists. Beginning with the initial expression, the process replaces each call, other than those of the fixed string kind, by the expression it specifies. This process continues until only fixed strings remain. The + operator is discarded; the others are carried out in passing. There is no issue about the order or priority of operators, since there is only a single kind of infix operator.

As a morphology expression may contain a primary which evaluates a further expression, it is possible to form a recursion, that is to say some primary may end up calling itself and thus forming an infinite loop. VINCI prevents this by limiting the depth of nesting (i.e. calls within calls) to 10. If this depth is reached, any deeper calls are discarded.

Morphology Tables

As we have noted above, a morphology table is simply a list of morphology expressions. The set of tables is defined in a morphology tables file, each in the form:


    table_name(m1, m2, ..., mk)

where table_name is an identifier and m1, m2, ..., mk are the individual expressions. The definition contains no indication of the shape (i.e. the number of dimensions and size) of the table. This is left to the table element primary which calls for it. So, the table futur is being accessed as a rectangular table with a row for each Nombre value and a column for each Personne value. (There is no necessary requirement for consistency between different calls, but inconsistency would be rather strange on the part of the language describer.)

The order of the elements in the list is sometimes referred to as row-major order; in the case of futur, this is:


    (sing, p1), (sing, p2), (sing, p3),
                      (plur, p1), (plur, p2), (plur, p3)

where a value of the first type is held fixed while values of the second are cycled. This can be generalized to any number of dimensions.

Oh! I almost forgot. The order of values within a type is determined by their order in the type definition.

The items in parentheses in the primary futur(Nombre, Personne) are table indexes The table indexes in this example are simple (i.e. non-compound) attribute types. VINCI searches the attribute list of the leaf node to find matching values, which it uses to select the correct table element.

An index may also be an attribute value, rather than a type. In this case, the attribute list of the leaf node is not searched.

More generally, the index may be a deconstructed compound attribute. So, if a verb leaf node has an attribute such as sing.subj (perhaps to distinguish it from plur.objd), an index such as:


    Nombre/subj

may be used. VINCI locates the matching sing.subj or plur.subj, deleting the subj component to obtain the desired Nombre value.

The index which results from the deconstruction in this example is once again a simple value, but this need not be the case. Consider a more complex example in which a pronoun leaf is given the compound attribute p2.plur.masc.subj, combining several pieces of information, to find a second person plural masculine subject pronoun in a table of pronouns. Based on the previous example, we may write the cumbersome call:


    pron_table(Personne/Nombre/Genre/subj,
                   /Personne.Nombre/Genre/subj,
                       /Personne/Nombre.Genre/subj,
                           /Personne/Nombre/Genre.subj)

in which the same compound attribute pattern is looked up four times, once to obtain each of the four indexes which the table requires. More simply, however, we may just write:


    pron_table(Personne.Nombre.Genre.subj)

VINCI locates the matching compound attribute in the leaf node, and uses this as an index. Indexing by a compound attribute value is the same as if the compounding dots were replaced by commas. This applies however many indexes there are. The indexes may contain deconstruction slashes.

One further convenience should be mentioned. It is permissable for table name to appear without indexes as the last or only item in a lexical field; let us say, field 9. A morphological expression may then contain the primary #9(Nombre, Personne). The indexes in this primary supply the indexes missing from the field.

Contrary to my stated practice, I did read the code here. As far as I can see, VINCI permits this at the end of any morphology expression, as long as it is called by a primary whose context supplies the indexes. But I wouldn't bet on it, especially if the indexes are supplied higher up than the next level of nesting. The case cited in the main paragraph works.

Morphology Rules

A morphology rule is a guarded rule having the form:


    rule rule_name
        guard1: m1;
        guard2: m2;
        ...
        guardk: mk;
    %

where rule_name is an identifier, and m1, m2, ..., mk are morphology expressions. Each guard is terminated by a colon, each subrule (including the last) by a semicolon, and each rule is terminated by the symbol %.

We shall describe a default guard in the next section. If no guard is found true and there is no default guard, the morphology rule produces an empty string.

The set of rules is defined in a morphology rules file.

We now define the form of the guards, in which VINCI allows a large degree of generality.

A more limited form defines the guards of context-free syntax rules. We will come back to these at the end of this section.

Morphology and Context-free Rule Guards

Morphology rule guards are conditions, usually referred to as Boolean expressions, which compute one of two values: true, false. VINCI allows a large degree of generality.

The guards are built of (Boolean) primaries along with the operators and, denoted by commas, and or, denoted by vertical bars. As an example, one of our French verb rules contains the guard:


    (p1 | p2), plur, imparf

This indicates that the subrule is to be used if the verb's attribute list contains:


    either p1 or p2

      and

    plur

      and

    imparf

The parentheses play the same role as those in the numerical expression: (3 + 2) * 7. With them, the value of the expression is 35; without them, 42, because multiplication takes priority over addition. In a Boolean expression, or is analogous to addition; and, to multiplication. Thus, and takes priority over or.

The guard primaries are:

attributes:plur
lexical fields:<2=DET>
current word:"cat"
_pre_ phrases:_pre_ hero
_is_ masc _pre_ victim
default:*

Attribute primaries are, in fact, compound attribute patterns, They are satisfied (i.e. true), if there is a matching attribute in the list attached to the leaf node. Deconstruction slashes serve no purpose here.

Lexical field primaries are very similar to the field restrictions attached to terminal nodes. They take the form <n=string> where n is the number of a field, and string is some string, which may or may not be in double quotes. As in lexical search, they ask whether the string matches one of the substrings of the field separated by commas, semicolons or colons. (Commas, semicolons and colons in double quotes don't count.)

In contrast to the analogous field restrictions, lexical field primaries may refer to fields 1 and 2, but not to field 3. Thus <1="p*"> asks if field 1 begins with "p", and <2=DET> asks whether the word category of the lexicon entry is DET. Note that the latter asks about the word category of the lexicon entry selected for the leaf node, not the leaf node metavariable. These may not be the same if an indirect or preselected word has a different word category. Primaries for higher-numbered fields allow morphology to depend on properties.

A digression.

Why is field 3 an exception? Mainly because, in contrast to the others, the issue of whether value plur matches type Nombre is not simply textual, but involves the attribute type definition. (There is also an implementation matter, but it is more relevant to lexical transformations.)

Unfortunately, this exception caused us to overlook an important feature in the design. We have primaries to inspect the attribute lists of the leaf node and its neighbours, but none to test the attributes of their lexicon entries. In effect, we can only inspect attributes which are on the tree itself. But information as to whether a French verb is reflexive, or whether its compound tenses are formed with avoir or être, which affect morphology, cannot be known until lexicon selection has taken place when the tree is complete. So these data cannot be represented by attribute values, and must instead be stored as properties where the morphology can see them.

Doubtless we will fix this in due course.

Current word primaries, which take the form of a string in double quotes, seek a match between the string an the word already generated for a node by the morphology process. Commonly they are used to look at neighbours of the node (see below). One application is to see if the following word begins with a vowel, in order to determine, for example, if the current word requires elision. In this case, the process must be taking place in a second pass (see Multiple Passes) since it is required that the next word will already have been processed. These primaries may contain wildstars.

_pre_ phrase primaries take the form _is_ C _pre_ D or _pre_ D, where C and D are compound attribute patterns. The latter asks if D can be matched among the preselection tags; the former, whether C is matched by an attribute in the lexicon entry preselected for tag D. Again, deconstruction slashes serve no purpose here.

The default guard is *, which in this context simply represents true.

Designators

To this point, the guard primaries have tested the features only of the leaf node whose word is under construction. Morphology may, however, depend on features of the surrounding words as well. If the French preposition de is followed by determiner le, the two words must be contracted to form the single word du. But this requires each to know about the other: one to produce the contracted form, the other to generate an empty string.

VINCI achieves this by the use of designators. A designator takes one of the forms: n= or -n=, where n is a small number. Positive numbers designate the successive leaf nodes following the current one, negative numbers those preceding it. 0 designates the current word itself.

Basically the idea is to place a designator before a guard primary to indicate that the primary is to be tested against the designated leaf node. The initial default is 0.

For the sake of abbreviation, a designator persists across commas but not across bars. So, if a, b, c, ... are guard primaries:


     1= a, b, 2= c            1= a, 1= b, 2= c
     2= a | b | c             2= a | 0= b | 0= c

It is also distributed across parenthesized primaries:


     2= (a, b, c)             2= a, 2= b, 2= c
     2= (a | b | c)           2= a | 2= b | 2= c

but supeseded by a local designator:


     2= (a, b, 3= c)          2= a, 2= b, 3= c

The designator reverts to its former value on passing over | or ). So, in the guard:


     p, q | a, 2= b, (c, d, 3= e), f | g

2= f and 0= g.

Designators would also be useful in morphological expressions. This would allow new words to be created from pairs, triplets, ... of existing ones by context-free rules and morphology. Lexical transformations have an equivalent feature.

Context-free Guards

The guards in context-free syntax rules are a subset of these, using only the attribute and _pre_ phrase guards, along with the and and or operators and parentheses. Designators are not permitted. The default is normally marked by the symbol >, but < * : presumably also works.

Some Guard Techniques

(a) The special usually occurs before the general. In a not-so-regular French verb we may see a sequence of guards such as:


    (p1 | p2), plur, imparf: ... ;
    imparf:                  ... ;

The first subrule is chosen for the first or second person plural of the imperfect tense; the second for any other imperfect. There is no need to specify Nombre and Personne in the second guard; it cannot be reached for the cases covered in the first.

(b) Many French adjectives require the addition of e to the base word in the feminine, and s in the plural.

An appropriate rule might be:


    rule reg_adj
        fem, plur:  #1 + "es";
        fem, sing:  #1 + "e";
        masc, plur: #1 + "s";
        masc, sing: #1;
    %

This rule requires both attribute values to be present in the leaf node in all four cases. If masc is to be the default gender (i.e. if masc can be left out), then the rule might be:


    rule reg_adj2
        fem, plur:  #1 + "es";
        fem, sing:  #1 + "e";
        plur: #1 + "s";
        sing: #1;
    %

Replacing the last guard by * makes masc, sing the overall default (neither masc nor sing present), but both attributes must be present for the feminine cases. masc and sing can be separate defaults if we use two rules:


    rule reg_adj3
        fem:  $adj_fem;
        plur: #1 + "s";
        *:    #1;
    %

    rule adj_fem
        plur: #1 + "es";
        *:    #1 + "e";
    %

Of course the original rule, reg_adj, with both attributes present can also be replaced by:


    rule reg_adj4
        *: #1 + adj_endings(Nombre, Genre);
    %

using the table adj_endings("", "s", "e", "es"). Indeed, the rule can be dispensed with entirely by including #1 in each table entry.

In fact, every table is equivalent to a rule in which all combinations of the index values are enumerated as guards.

Inconsistencies in Notation

There is a similarity between terminal node attachments, morphology expression primaries and guard primaries, which is not reflected in the corresponding notation. This was caused by introducing extra features in each of these facilities, when others had been long established. It would be difficult to rectify this, both because of existing files and because some changes may call for non-obvious changes elsewhere.

The same applies to the use of parentheses in table definitions and indexes, rather than the square brackets used for attribute lists.

Initial Expressions and Multiple Passes

A problem arises in regard to the order in which morphology should be applied to the leaf nodes. On the face of it, the nodes should be processed from left-to-right. Consider, however, a sequence of nodes containing the French pronoun je and the verb être. If the verb is in the present tense, its correct form is suis and the two words should be rendered je suis. If in the imperfect, its form is étais, beginning with a vowel, and the pronoun must be elided to j'. The implication, then, is that the form of the pronoun cannot be determined until the form of the verb is known.

To overcome problems of this type, morphology may be carried out in several morphology passes. Each pass involves a left-to-right visit to the leaf nodes. A lexicon entry indicates which pass or passes it takes part in, and provides initial morphology expressions for the individual passes.

By default, morphology has two passes, and the initial expressions are in fields 5 and 6 respectively. The omission of an expression from one of these fields indicates that the lexicon entry is not to participate in that pass.

In the example above, the pronoun presumably takes part in the second pass, the verb in the first.

Suppose we extend this example to include the negative adverb ne between the pronoun and the verb. ne has the same elision requirements as je, and must therefore take part in the second pass of morphology. On the face of it, this would appear to relegate je to a third pass. We can avoid this if we arrange for ne to produce n during the first pass. This is not to help in the final form of ne. Its sole purpose is to indicate to a predecessor that it begins with a consonant; in fact, any consonant would do.

If any word form does require more than one pass to create it, the string produced during one pass is available to the next one. This is the purpose of the ! morphology expression primary.

How many passes are enough? This is a question of skill and taste, which will not be discussed here. (At some point, I will revise and post an essay on this written a few years ago.) VINCI does, however, allow the language designer to change the number of passes and their starting fields. Furthermore, the passes may be separately defined for each tree root. For example, the R_3 sentence may be produced using two passes with initial expressions in fields 5 and 6, while R_7 may use three passes with initial expressions in fields 22, 23 and 24.

In one application we have used this to produce both orthographic and phonological representations of the same sentence. Orthographic expressions, stems and initial expressions were stored in one set of lexical fields, phonological ones in another. All that is required is that two tree roots produce identical trees, but with the morphology process initiated by different expressions in the two cases.

Post-morphology

With the completion of the morphology process, the syntax trees are in their final form, and VINCI enters its post-morphology phase. Basically this gathers the leaf node words of each tree into a a single string separated by spaces to form a generated sentence. Some small matters remain to be resolved.

One is the removal of certain spaces. If, for example, the French pronoun "je" is elided to "j'" before a verb beginning with a vowel, say, "ai", the form of the output should be "j'ai", not "j' ai", the space being eliminated.

To achieve this, we make use of the space-eater, the circumflex character ^, which is treated as an alphabetic character in any object language. If this character appears in any final string, it devours all spaces adjacent to it and then disappears. Thus the elided form of "je" should be "j'^".

Punctuation

There are two ways for VINCI to punctuate the generated sentences.

One is simply to remember that VINCI is embedded in the ivi Editor, and to use the features of the Editor to add whatever punctuation is appropriate. If desired, one can combine the keystroke which triggers generation and the sequence of operations which add the punctuation into a single ivi function. The function key will then generate punctuated sentences.

The other is to regard pieces of punctuation as "words" in a separate word category, and arrange for the syntax rules to add punctuation nodes as necessary. A typical lexical entry might be:


    "^."|PUNCT|period||#1||

The space-eater will ensure that the period is joined to the preceding word.

The same approaches apply to capitalization based on syntax. Capital letters at the start of proper names in languages such as English appear as such in the lexicon, and require no special action. Some languages, however, require capitalization at the start of sentences. This can be handled by ivi editing operations. Alternatively VINCI provides a capitalizer, the backquote character, which may be treated as a punctuation symbol and placed as a "word" at the start of the tree. During the post-morphology phase, this symbol capitalizes the letter following it in the generated sentence and then disappears.

This symbol violates our attempts to keep VINCI independent of the object language. The conversion of lowercase letters to their uppercase equivalent assumes a certain relation between the bytes which encode them, which in turn relies on the properties of the ISO 8859-1 encoding.