Generating Oral French

Craig Thomas, Michael Levison, Greg Lessard.

Abstract

The goal of the research reported here was to combine a natural language generator and a speech synthesizer to produce spoken sentences (in French) which sound natural, that is, fairly close to the way in which a native speaker would utter them. For this, it was necessary to analyze samples of human speech to determine prosodic patterns and relate them to grammatical phenomena. The French grammar used by the generator was then adapted so that it generated prosodically marked phonological output. The resulting sentences were passed to the synthesizer and were judged satisfactory by several native speakers of French.

Introduction

Recent years have seen a significant amount of work on multimedia generation, particularly in the form of text to speech (and speech to text) systems, which have as their goal the production of multi-channel communication between human and machine. By all accounts, at least in languages such as English and French where the mapping between orthography and phonology is complex, an entirely satisfactory solution of multimedia input and output is still some years away. (For an overview, see Dutoit, 1996.)

Less work has been done on a variant approach: concept-to-speech (CTS), in which a high-level semantic and syntactic specification is used to drive two streams of output, one orthographic, the other phonetic. Among other advantages, the CTS model removes the need of passing through the phonology-orthography interface. Here we illustrate an implementation of the CTS approach using the VINCI natural language generation environment.

VINCI

VINCI is a natural language generation system that allows users to create a grammar that models a natural language by specifying lexicon, syntax and morphology, as well as limited semantics and lexical relations. The original purpose of VINCI was to create test exercises for language learning, but it has grown to become a general-purpose tool used to examine a variety of linguistic phenomena, including word formation, computer generated humour, and so on. Simply put, users of the system can write grammars describing the features of a language. The system can then generate utterances in the language that was specified.

Within the context of language learning, VINCI generates sets of related utterances ("exercises"), which may contain stimuli requiring a response from the learner ("questions") as well as the expected responses themselves ("answers"). The system includes a complex comparison procedure which can compare an actual learner response with the answer, reporting on differences of word order, as well as lexical and morphological errors (Levison et al., 2000). In a simple exercise, the questions may be randomly constructed sentences of some specified type, and the answers a transformation of the questions. For example, the question may be a declarative sentence, the required answer an interrogative form with the direct object expressed as a pronoun (Figure 1, and Levison and Lessard, 1996). In a more elaborate situation, VINCI has been made to generate a fairy tale based on a randomly chosen hero, villain, victim, and so on, along with actual questions about the plot, allowing a learner to be tested not only for grammar, but also for comprehension (Figure 2, and Lessard and Levison, 2002).

In the work reported here, we have extended the system so that the generated sentences and questions can be produced orally.

MBROLA

The basis for the oral output is its representation in a variant of the International Phonetic Alphabet (IPA). The French lexicon used by VINCI has been augmented with IPA representations for each entry, and a phonological morphology has been created to inflect the phonetic forms. Since there is no difference in concept between the IPA representation and the standard orthography, VINCI readily produces either or both (Figure 3).

With a simple conversion, the IPA form can be used as input to MBROLA (Dutoit, 2002), a generalized speech synthesizer based on diphones, which accepts a database corresponding to the language to be spoken and a phonological representation of an utterance, and produces the requested speech. In this research, we have used a database for a female voice speaking Parisian French, though other French voices might be substituted. The simplest form of input for MBROLA is shown in Figure 4. Each line specifies a phoneme and its duration in milliseconds.

If a sentence like this, lacking prosodic information, were played by MBROLA, it would sound monotone and devoid of character. To counter this, MBROLA permits the addition of "pitch-pairs" (Figure 5). Here, for example, the program is requested to alter the fundamental pitch of the phoneme /a/ 50% of the way through from its base level of 80 Hz to 103 Hz. (The synthesizer interpolates the change linearly.) Such pitch- pairs, along with changes to the phoneme length, are used to implement the components of the prosody. In order to produce them, VINCI must be made to generate prosodic information as well as IPA output.

French Prosody

We are concerned here with three aspects of prosody: phoneme length, pitch and stress.

The lengths of French phonemes have been studied extensively by Simon (1967) and Brichler-Labaeye (1970), who provide details of the lengths, some of which are context-dependent. We have already noted the need for an intermediate program to convert the phonological output produced by VINCI to the form required by MBROLA. Information about lengths is incorporated into this program, which creates the second column of the MBROLA input appropriately.

Information about pitch changes was not available in the form we needed, and we have accordingly done a limited study of our own. It would have been possible to analyze a corpus of oral French. This, however, would have required us to transcribe the individual sentences into standard orthography, and to parse each into its constituent clauses and phrases in order to determine the relation between phrases and changes of pitch.

We therefore used VINCI, along with a simple syntax, to generate a series of sentences. In general, the sentences included a subject, a verb and perhaps a direct object and an adverb. We allowed sentences which were interrogative, declarative or imperative, positive or negative, with subjects being nouns, pronouns or proper names, using transitive or intransitive verbs. Instances of all reasonable combinations were created, 80 in all, and subsets were read by native French speakers, whose readings were digitally recorded as WAV files.

Each file was analyzed by software called Praat (Boersma, 2001). This produces a graph of fundamental frequency (pitch) vs time, on which the words of the sentence can be overlaid (Figure 6).

Qualitative analysis of these "Praat pictures" indicates that the curves can be approximated by about eight levels, and strongly suggests that pitch is a cumulative combination of three factors: word effects, phrase effects and sentence effects. Each word appears to have its own rise- fall curve, which is superimposed upon the phrase curve. Different types of phrases each have their own curve. For example, a noun phrase, which often consists of a determiner and a noun, has a rising pitch on the determiner and a falling pitch on the noun. Sentence effects are similar. Declarative sentences tend to start at the speaker's resting frequency, show a rising pitch and then a fall to a frequency below the resting frequency. Interrogative sentences tend to show an overall rising pitch.

We can expect a fourth effect related to paragraphs. Some evidence for paragraph-level intonation effects was obtained by asking two of the francophones to read generated fairy tales. The sample, however, was too small to draw conclusions with confidence.

Stress in French is a complex phenomenon whose physical instantiations may include syllable length, intensity, and pitch. French is a fixed stress language in that the last syllable of the syntactic unit (typically the clause) receives stress, while all preceding syllables are unstressed. For example, "la FILLE", "la petite FILLE", "la petite fille maLADE", where the stressed syllable is shown in upper case.

In order for VINCI to produce prosodic markings, 28 symbols were added to the phonetic alphabet, representing absolute or relative changes of pitch, syllable breaks and stress markers. Intra-word pitch markers were simply included within the phonological spelling in the lexicon. Pitch variations related to phrases or sentences were inserted as new lexical entries, and these new "words" were introduced into the syntax rules. Stress marks were added, in accordance with the rule mentioned above, by the front-end used to convert VINCI output to MBROLA input. With these additions, VINCI can generate prosodically marked phonological output.

Experiment

A new set of sentences was created similar in form and variety to the ones produced originally. These were synthesized by MBROLA and recorded. Five judges, all native French speakers, were asked to listen and evaluate the quality of the results. Each judge was given a random selection of twelve sentences and asked to rank their quality on a five-point scale (Figure 7). The aggregated results are shown in Figure 8. In one or two cases questions were missed, leading to totals of below 60. Where a rank was N or lower, the judge usually pinpointed the deficiency. Most were minor problems, easily fixed: a phoneme too long, missing phonetic variants in the MBROLA database, and so on. Nonetheless, 68.7% were in the A or SA categories.

Conclusions

The current research has demonstrated the ability to marry a language generator and a speech synthesizer to produce oral output close to human speech. In subsequent work we will examine longer and more complex sentences as well as paragraphs, and take into account features such as emotion.

The system described here has a number of applications which we are currently exploring. One is the testing of second language acquisition not only by means of oral and written stimuli, but by the mapping between the two (the traditional dictée). Another is its use as a tool for the teaching of phonetic transcription to linguists, since the error-checking mechanisms already present in VINCI may be applied equally well to phonetic as to orthographic transcriptions.

References

Boersma, Paul (2001). Praat, a system for doing phonetics by computer. Glot International 5:9/10, 341-345.

Brichler-Labeaye, Catherine (1970). Les voyelles françaises. Paris: Editions Klincksieck.

Dutoit, Thierry (1996). An Introduction to Text-to-Speech Synthesis. Kluwer Academic Publishers. Dordrecht.

Dutoit, Thierry (2002). The MBROLA Project. http://tcts.fpms.ac.be/synthesis/mbrola.html

Lessard, Greg; Levison, Michael (2002). Dynamic Generation of Structured Texts for CALL. Annual Meeting of Association canadienne de linguistique appliquée/Canadian Association of Applied Linguistics, University of Toronto.

Levison, Michael; Lessard, Greg (1996). Using a language generation system for second language learning. Computer Assisted Language Learning, 9:2/3, 181-189.

Levison, Michael; Lessard, Greg; Walker, Derek (2000). A Multi-Level Approach to the Detection of Second Language Learner Errors. Literary and Linguistic Computing, 15:3, 313-322

Simon, Pela (1967) Les consonnes françaises. Paris: Editions Klincksieck.



FIGURES

    Question: L'homme a mangé une pomme.
    Answer: L'homme l'a-t-il mangée?

Figure 1. An example of generation by VINCI. The question is a declarative sentence; the answer, the corresponding interrogative with a pronoun object.

---

    Il était une fois un roi qui s'appelait Pierre. Il était
    beau, fort, bon, et il avait des cheveux roux. Il avait une
    enfant, la princesse. Elle s'appelait Marie. Elle était belle,
    intelligente, faible, bonne, et elle avait des cheveux roux.

    Le roi interdit à la princesse de fuir le château. Par
    contre elle n'obéit pas. Elle alla dans la forêt où
    elle rencontra un sorcier. Il s'appelait Merloc. Il était
    laid, intelligent, fort, méchant, et il avait des cheveux
    blonds. Il l'enleva.

    Pour la sauver, le roi demanda l'aide d'un prince. Heureusement,
    le prince rencontra un lutin qui lui rendit un pantalon magique.
    Le prince l'utilisa pour tuer le sorcier. Il épousa la
    princesse et ils eurent beaucoup d'enfants.

    Question  : Comment s'appelait la princesse?
    Answer    : Elle s'appelait Marie.
    R_3       : Marie

    Question  : Le roi était-il bon ou méchant?
    Answer    : Il était bon.
    R_3       : Le roi était bon.
    R_4       : Bon

Figure 2. French fairy tale. The tale itself is composed of 15 generated sentences based on a single dramatis personae. The last clusters each involve a question and several alternative expected answers.

---

    Question  : Marie jouera lentement
    Answer    : ma-Ri Zu-Ra lât-mâ

Figure 3. A sentence generated by VINCI showing both orthographic and phonological output.

---

    l 80
    a 80
    d 60
    y 80
    l 80
    t 80
    n 80
    @ 80

Figure 4. An example of MBROLA input for the fragment "l'adulte ne". The full sentence is "l'adulte ne pleurerait jamais davantage" (The adult would never cry any more).

---

    l 80 (50, 80)
    a 80 (50, 103)
    d 60 (50, 80)
    y 80
    l 80
    t 80
    n 80
    @ 80 (50, 126)

Figure 5. The previous MBROLA input along with pitch-pairs.

---

The file is Figure6.jpg

Figure 6. A "Praat picture" for the sentence "Maintenant les Tremblay ne chanteront pas" (Now the Tremblays will not sing).

---

    For each sentence you hear, please indicate one of the five replies:
    strongly disagree (SD), disagree (D), neutral to the statement (N),
    agree (A) and finally strongly agree (SA).

    Q1  The voice clearly pronounced words and was understandable.
    Q2  The rise and fall of the voice sounded natural.
    Q3  The speed at which the sentence was read was reasonable.
    Q4  Overall, this sentence sounded very similar to the way
           human might have spoken it.

    Please add any comments you may wish to make on the sentence.

Figure 7. The questions put to the judges.

---

           SD     D     N     A    SA
    Q1      1     9     8    21    20
    Q2      1     9    12    18    18
    Q3      0     5     3    25    25
    Q4      3     9    13    25     8

    total   5    32    36    89    71

Figure 8. The results aggregated for all judges.

---