The Lexicon

As we have seen, a lexicon is a set of records (or lexicon entries) corresponding to words, phrases, morphemes, and so on. The records consist of fields which are textual, each terminated by a bar. Each record occupies a line in a lexicon file.

The records may be created by the ivi Editor in record mode, or by any other text editor (including ivi in text mode). If the former is used, each record will begin with the ivi record flag, \r, but VINCI is programmed to ignore this during installation. If a non-ivi editor is used, be sure that each record ends with a so-called "hard" linebreak, not the phantom visual break which results from an automatic line-wrap.

The number of fields and, with a few exceptions, their contents are determined by the language describer. The exceptions:

field 1
contains the lexicon entry's headword.
field 2
must contain a word category, or terminal, specified in the terminals file. (See Files and Installation.)
field 3
is empty or contains a list of attributes.
field 4
is empty or contains a number: the frequency value.
fields 5 and 6
by default, contain the initial morphological expressions corresponding to every tree root. (They may, of course be empty.) The describer can alter the fields which are used.

Headword is, perhaps, a misnomer. It need not be a headword in the conventional sense used by lexicographers. It can be any string, either in double quotes or not. It is, however, the target used by mandated choices and lexical pointers, and for this purpose, must be a string in double quotes or it won't be found.

We ourselves use forms like "table_1" and "table_2" for different senses of words. This is convenient for pointers, which presumably point to a specific sense.

If field 1 begins with the keyword _ilt_, the lexicon entry represents an inverted lexical transformation. (See later.)

Field 3 specifies the set of attribute values, simple or compound, for which the lexicon entry is valid. Every attribute in the list attached to a terminal node must be present for this lexicon entry to be chosen.

For example, if the terminal node is N[masc, sing], VINCI may select:

    "chat"|N|masc, sing|...

but not:

    "chats"|N|masc, plur|...

In practice, various abbreviations are used to shorten the set, illustrated in the following example:

    masc, >humain, Nombre, Personne.sing

The attribute type Nombre represents all values of that type.

The compound attribute pattern Personne.sing denotes all attributes formed by compounding a Personne value with sing.

>humain represents the value humain and all values greater than humain in its type.

We may also have <edible.objd, denoting all compound attribute formed by compounding the attribute edible, or any lesser value, with objd. These last two presume humain and edible to belong to type defined to be partially ordered.

Genre.Nombre represents all dotted combinations of the two types.

Fields 2, 4, 5 and 6 need no further elaboration.

The other fields may contain whatever the language describer wants. In our lexicons this has included morphological expressions, including stems, lexical pointers, phonetic transcriptions, glosses in other languages, sample contexts, notes, properties and restrictions.

Our phonetic transcriptions and glosses are themselves morphological expressions, the latter commonly strings in double quotes. This allows them to be produced as part of the output. The same applies to sample contexts and notes. If notes and comments are intended just for the human reader, the VINCI comment feature (enclosure in brace brackets) can be used instead, the difference being that comments are discarded during installation and take up no space in the installed lexicon.

We have sometimes used the term properties for identifiers placed in particular fields, which are used to restrict lexical selection and control morphology, though there is no difference at all between these and any other components which can appear in lexical restrictions. Typical properties in a French lexicon might be:

    aspiré, M&M_Chap12

The latter could be used to restrict vocabulary to those words which occur in the early part of a grammar text. The former might mark words with an "aspirated" initial h to prevent morphology eliding words like "je", "ne", "le" before them. Such properties behave like attribute values except that they are not organized into types, are not pre-defined, and have no limit on numbers.

The later fields may also contain attribute lists or terminals, though these can be used as such only if some process (say, a lexical transformation) transfers them to fields 3 and 2 respectively.

The apparent ambiguity which arises because a string (e.g. "cat") might be either a lexical pointer or a morphological expression is not important. The resolution is determined by the way VINCI comes upon them. Indeed the same field may function as a lexical pointer or morphological expression on different occasions.

Lexical Selection, Lexical Pointers and Indirections

This section might be expected to contain sections on these topics. There is, however, nothing to add to material already covered in the Overview and in the Syntax section.

Frequency Variation

Two features are provided to vary temporarily the frequency value in field 4 of a lexicon entry. One is the frequency variation attachment to a terminal node, mentioned in the Syntax section of the Manual; the other is a command, VFreq. They are very similar in nature. The former changes the frequency value of the single entry selected to match the terminal node. Thus we can prevent a particular entry being chosen a second time by setting its frequency value to zero, or raise or lower its probability. The latter changes all entries which match the parameter of the command. In both cases, the values revert to their original form when the lexicon is reinstalled.

As noted in the Syntax web page, frequency variation attachments have one of the forms: /+n or /$n, where n is a number. Their interpretation is as follows:

+0, +1, ..., +9999 Change the frequency to this value
+10000, ... Change the frequency to 9999
$1 (or no vf attachment) Make no change
$2 Add 5 to the frequency
$3 Double the frequency
$4 Halve the frequency (rounding down)
$5, ... Make no change

In all cases, the maximum allowable frequency is 9999, and higher values are set to this.

Generation proceeds normally to the point where the node is looked up in the lexicon. The variation, if any, is then applied to the lexicon entry selected.

The VF command takes as parameter a string which is closely similar to a terminal node; for example:

    VF N[masc]/"m*"/5=#3/+9

As usual, the terminal node is a lexicon search pattern, which selects masculine nouns beginning with m and having #3 in field 5. It will normally include a frequency variation attachment (but see below); it is +9 in this example. Syntax tranformations or _pre_ attachments are irrelevant and forbidden in this context. Indirections are permitted, but serve no purpose except for the trivial field restriction imposed by the first indirection in a sequence.

The command does three things:

The command can, of course, be used to discover the number of lexicon entries of a particular kind -- terminal class, attributes, spelling, specific fields -- and display a list of them, by setting an appropriate parameter and either omitting a vf attachment or aborting the changes.

If the parameter metavariable is not a terminal, an error message is shown on the second-to-last line of the screen, and the command is ignored. Other parameter errors, e.g. "attribute not known", are reported in corefile 7, and the command continues. (So it might be wise to abort changes.) Because the user doesn't see these reports before the command is erased, the command itself is also written to corefile 7.

Warning: There is a limit of about 25,000 on the number of lexicon entries matching a terminal node which can be recorded. The command alters only the first 24,999 and the last.