The properties of the MARY system are explained here along two lines: on the one hand, the architecture of the system from a natural language processing point of view; on the other hand, the workings of the system from a technical viewpoint .
Four parts of the TtS system can be distinguished:
The preprocessing or text normalisation includes the tokeniser, abbreviation expansion, and numeral expansion. At the same time, a rudimentary internal XML structure is built around the input text, eventually translating any SABLE annotation that may be given in the input text.
The natural language processing is responsible of the calculation of speech-relevant data out of the written input text, viz. phone symbols and intonation labels. In a first NLP step, part of speech labelling and shallow parsing (chunking) is performed. Then, a lexicon lookup is performed in the pronounciation lexicon; unknown tokens are morphologically decomposed and phonemised by grapheme to phoneme (letter to sound) rules. Independently from the lexicon lookup, symbols for the intonation and phrase structure are assigned by rule, using punctuation, part of speech info, and the local syntactic info provided by the chunker. Finally, postlexical phonological rules are applied, modifying the phone symbols and/or the intonation symbols as a function of their context.
The NLP analysis is organised in a modular way, containing the following components:
The output of the NLP component is a rich MaryXML structure. (For its syntax, see MaryXML.xsd). Example:
<?xml version="1.0" encoding="UTF-8"?> <maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.4" xml:lang="de"> <p> <s> <phrase> <t g2p_method="lexicon" pos="ART" sampa="'?aI-n@" syn_attach="1" syn_phrase="NP"> Eine </t> <t accent="L+H*" g2p_method="lexicon" pos="ADJA" sampa="'?EC-t@" syn_attach="0" syn_phrase="NP"> echte </t> <t accent="H*" g2p_method="lexicon" pos="NN" sampa="hE-'RaUs-fO6-d6-RUN" syn_attach="0" syn_phrase="NP"> Herausforderung </t> <t pos="$." syn_attach="2" syn_phrase="_"> . </t> <boundary breakindex="5" tone="L-%"/> </phrase> </s> </p> </maryxml>
This rich input is then translated into an acoustic parameter file, by applying a model for duration (the so-called Klatt Rules adapted for German) and for intonation (a ToBI based approach, translating intonation symbols into targets on declination lines that can be attributed precise frequency values).
The output is a parameter file as used in one way or another by many speech synthesis systems. As one type of waveform synthesizer, we use MBROLA as a synthesis system, so the parameter output format is the MBROLA input format. Every phone symbol is assigned a duration in milliseconds; some phone symbols are assigned a (time,frequency) target, where time is in percent of the phone duration and frequency is in Hertz. Example:
_ 10 aI 130 (0,209) n 62 @ 52 (0,187) _ 55 E 84 (50,232) C 71 t 57 @ 52 h 61 E 71 (0,224) R 63 aU 148 (50,174) s 86 f 71 O 68 6 31 d 42 6 60 R 60 U 139 N 78 (100,160) _ 400 #
The system is composed of a main server or "manager" program, a number of modules doing the actual processing, and a client sending input data and receiving processing results. The system, implemented in Java, is