The VINCI Manual: Preliminaries

Strings, Keywords and Identifiers

The language for which VINCI is generating utterances is called the object language.

The language, or sets of notation, used to describe the object language is called the metalanguage.

Sequences of letters and digits, as is common in the computing field, are called strings.

When strings in the object language appear in the metalanguage, they are enclosed in double quotes: "cat".

The following strings, if they occur outside double quotes, are reserved as keywords:


        INHERIT   inherit   CHOOSE    choose    SELECT    select

        RULE      rule      _and_     _makes_   _ilt_     _lsp_

        _rhs_     _pre_     _in_      _is_      PRIORITY  priority

        TRANSFORMATION      transformation

Note that VINCI does distinguish between uppercase and lowercase letters. The keywords in this list which differ only in regard to case are synonyms. They are retained for the sake of long-existing files.

Identifiers are strings of letters, digits and the underscore occurring outside double quotes and, with one exception, not starting with a digit. They are used for naming items in the metalanguage: attribute types and values, attribute variables, word categories, the tags of lexical pointers, morphology rules and tables, tree nodes and terminal nodes, and so on. (The exception relates to the names of morphology rules which were once numbered rather than named.)

In some cases there is no clash if the same name is used for two different kinds of item. This is not recommended. Nor is the use of underscore as the first symbol, since we may want to add extra keywords in the future.

Identifiers have no structure and no inherent meaning. Thus, an identifier dirobj has no relation to identifiers dir and obj. If a certain attribute value is noted to cause "s" to be added to some English nouns and "were", rather than "was", to appear in some past tenses, it probably denotes the English plural, regardless of whether it is called plur, zyzzt or (misleadingly) fem. Indeed, if one substitutes a new identifier for an existing one systematically throughout a description, it does not affect the object language.

As noted in the Overview page, the following identifiers play a special part as rule names in the syntax:


ROOT PRESELECT QUESTION ANSWER R_3 R_4 R_5 ... R_20

These are therefore reserved identifiers.

The special names are ROOT, PRESELECT, ..., not Root or root, Preselect or preselect, or other variations. These are reserved identifiers, not keywords. Though this may seem to be splitting hairs, it affects the point at which possible errors are detected, and therefore, the error messages which may be displayed. Keywords are "tokenized", i.e. converted to distinct tokens, at a very early stage of language installation. Misspelt keywords commonly damage the whole structure of rules, and trigger error messages during installation. Misspelt reserved identifiers cause problems only during sentence generation, and except for ROOT may not lead to a detected error. They may just give bad results: an ANSWER sentence not generated, a preselection not carried out, etc.

Our Naming Conventions

The authors observe certain naming conventions, not required by VINCI, but helpful when reading our language descriptions. They are followed in this Manual:

We also follow conventions in naming the files which make up a language descriptions. These will be mentioned at the appropriate time.

Characters and Characters Sets

ivi/VINCI assumes that the character set in use for the program follows the ASCII standard, extended to ISO 8859-1 (Latin 1) if European accented characters are in use.

A displayable character is any which has an on-screen representation (as distinct from TAB and other control characters, which do not).

A letter is one of the alphabetical subset of these, including the European accented characters.

With a few exceptions, any letter (including accented ones) may be used in VINCI identifiers, and any displayable character as letters of the object language (i.e. within double quotes).

The exceptions are as follows:

ivi/VINCI uses byte 255 (y-umlaut) as an internal marker. This should not appear in a file, whether within double quotes or not. The same may possibly apply to bytes 254 (Icelandic thorn, lowercase) and 253 (y-acute).

The writer has a distant recollection, but can no longer locate this in the code.

The following have a special role in VINCI files, and should not be used as object language letters: | { } "

The symbols * ^(circumflex) and `(backquote) in an object language string are interpreted as the wildstar, the space-eater and the capitalizer respectively. (See later.) They should not be used as normal letters in the object language.

The ivi Editor, in its reformat operation, regards the minus symbol as a hyphen. This may affect VINCI output if minus is used as an object language letter, and the output is reformatted by ivi.

ivi, incidentally, uses only ASCII characters (which include only unaccented letters) in its commands and error messages. VINCI does so too in the fixed parts of messages, but may also output identifiers and object language strings. It should, therefore, be possible to use fonts with the Cyrillic or Greek alphabets if these only replace the Latin-1 accented characters. The restriction on byte 255 still applies. (If y-umlaut, or any other character represented by byte 255, is essential, it must be represented in ivi/VINCI by a different byte. ivi has a feature which allows this to be displayed by y-umlaut, or whatever, on the screen, but it will have to be converted if it is to be used in any other program.)

Patterns, Matches and Searches

When a VINCI operation requires two objects to be matched, VINCI commonly allows one of them to be a pattern rather than a fixed object. For example, if the objects are strings, it may allow a pattern like "me*t", where * is permitted to match any substring, including the empty one (the one having no characters). So this pattern matches "meat", "meant", "met", "merit" and a host of others.

The symbol * in this context is a wildcard or wildstar.

Other examples:

    "t*"    a string beginning "t"
    "*ing"  a string ending "ing"
    "*"     any string

There may be several wildstars in the same string (but they should not be adjacent): "b*tt*r", but not "b**r".

If VINCI needs to determine which substrings match each *, we must be aware that, if there is more than one *, the result may be ambiguous. For example, "b*an*a" matches "banana" in two different ways.

When a match involves attribute values, VINCI often uses an attribute type as a wildcard representing any value of the type.

When searching a list to locate (or fail to locate) an object on it, VINCI may again allow the object to be a pattern. The pattern may match several objects on the list.

As we have noted elsewhere, a terminal node is really a pattern, a lexical search pattern, which may identify several lexicon entries matching it.

Guarded rules

Guarded rules appear in three different VINCI contexts: context-free rules, syntax transformations and morphology rules. We discuss them here to avoid repeating the same information in three places.

A guarded rule consists of a sequence of subrules, each having a guard, i.e. a condition, and an action. Thus, it has a form such as:


    guard1 : action1;
    guard2 : action2;
    guard3 : action3;
    %

When a guarded rule is to be obeyed, the guards are tested in turn until one is found to be true. The action of the rule is simply the action corresponding to the first true guard. If no guard is true, the rule causes no action or returns no value, according to context. Optionally, the final subrule may a default, marked by having no guard or a guard which is always true. If this default is present, its action is certain to occur if no other does.

Programmers who have not met such rules before should note that they are generalizations with if_then_else, if_then and switch_case statements as special cases.