public class AlignerTrainer extends Object
The basic idea is to perform a Levenshtein search for the cheapest path and read off an alignment from that. The costs used in the distance computation are not uniform but estimated in an iterative process, according to -log of the relative frequencies of the respective operations in the previous iteration. Perform several iterations (e.g. 4) of aligning in order to get stable estimates of the costs (and a good alignment in turn).
The algorithm, in its essence, is implemented after a description of Levenshtein distance as it can be found in Wikipedia (see below); consider the costs used in the pseudo-code:
d[i, j] := minimum ( d[i-1, j] + 1, // deletion d[i, j-1] + 1, // insertion d[i-1, j-1] + 1 // substitution )In our implementation there are only two operations, corresponding to deletion and insertion. So, if you look at the matrices in the wiki article, you can only go down and to the right, but not diagonal. Second, the costs are not 1 but set as explained in the following (note that this is a heuristic that seems to work fine but not a derived EM-algorithm).
"insertion" menas in our case, to insert something for (dependent on) the current input symbol. The cost for this operation is lower if the two symbols were already aligned in the preceding iteration, they are set to -log P(output-symbol|"insertion",input-symbol).
"deletion" means in our case to go to the next input symbol. If a deletion operation is performed without an preceding insertion operation (i.e. two subsequent deletion operations) this is called a "skip" and will produce costs, going to the next symbol after an insertion is free (this is to avoid unaligned input symbols). The skip costs are estimated from the preceding iteration and set to -log P(skip|"deletion").
In addition, I made the following optimization, described in Wikipedia:
We can adapt the algorithm to use less space, O(m) instead of O(mn), since it only requires that the previous row and current row be stored at any one time.therefore the three arrays for all information and the swapping statements in the align method. (note that what are rows in Wikipedia are columns here)
Modifier and Type | Field and Description |
---|---|
protected Set<String> |
graphemeSet |
protected List<String[]> |
inSplit |
protected org.apache.log4j.Logger |
logger |
protected List<String> |
optInfo |
protected List<String[]> |
outSplit |
Constructor and Description |
---|
AlignerTrainer()
New AlignerTrainer for pairs of different symbol sets with no optional info.
|
AlignerTrainer(boolean inIsOutAlphabet,
boolean hasOptInfo) |
Modifier and Type | Method and Description |
---|---|
void |
addAlreadySplit(List<String> inStr,
List<String> outStr) |
void |
addAlreadySplit(List<String> inStr,
List<String> outStr,
String optionalInfo) |
void |
addAlreadySplit(String[] inStr,
String[] outStr) |
void |
addAlreadySplit(String[] inStr,
String[] outStr,
String optionalInfo) |
int[] |
align(String[] istr,
String[] ostr)
This computes the alignment that has the lowest distance between two Strings.
|
void |
alignIteration()
One iteration of alignment, using adapted Levenshtein distance.
|
StringPair[] |
getAlignment(int entryNr)
gets an alignment of the graphemes to the phones of an entry.
|
String[] |
getAlignmentString(int entryNr) |
StringPair[] |
getInfoAlignment(int entryNr)
gets an alignment of the graphemes to the phones of an entry.
|
Set<String> |
getInputSyms() |
int |
lexiconSize() |
void |
readLexicon(BufferedReader lexicon,
String splitSym)
This reads a lexicon where input and output strings are separated by a delimiter that can be specified (splitSym).
|
void |
splitAndAdd(String inStr,
String outStr)
This adds the input and output string in the most simple way: symbols are simply the characters of the strings - no
phonemisation/syllabification or whatsoever is performed.
|
public AlignerTrainer(boolean inIsOutAlphabet, boolean hasOptInfo)
inIsOutAlphabet
- boolean indicating as input and output strings should be considered as belonging to the same symbol sets
(alignment between identical symbol is then cost-free)hasOptInfo
- has opt infopublic AlignerTrainer()
public void readLexicon(BufferedReader lexicon, String splitSym) throws IOException
lexicon
- reader for lexiconsplitSym
- symbol to split columns of lexiconIOException
- IOExceptionpublic void splitAndAdd(String inStr, String outStr)
inStr
- inStroutStr
- outStrpublic void addAlreadySplit(List<String> inStr, List<String> outStr, String optionalInfo)
public void alignIteration()
public int lexiconSize()
public StringPair[] getAlignment(int entryNr)
entryNr
- nr of the lexicon entrypublic String[] getAlignmentString(int entryNr)
public StringPair[] getInfoAlignment(int entryNr)
entryNr
- nr of the lexicon entrypublic int[] align(String[] istr, String[] ostr)
istr
- the input stringostr
- the output stringCopyright © 2000–2016 DFKI GmbH. All rights reserved.