Transcription

Bayesian Learning of Non-compositional Phrases with Synchronous ParsingHao ZhangComputer Science DepartmentUniversity of RochesterRochester, NY [email protected] QuirkMicrosoft ResearchOne Microsoft WayRedmond, WA 98052 [email protected] C. MooreMicrosoft ResearchOne Microsoft WayRedmond, WA 98052 [email protected] GildeaComputer Science DepartmentUniversity of RochesterRochester, NY [email protected] combine the strengths of Bayesian modeling and synchronous grammar in unsupervised learning of basic translation phrasepairs. The structured space of a synchronousgrammar is a natural fit for phrase pair probability estimation, though the search space canbe prohibitively large. Therefore we exploreefficient algorithms for pruning this space thatlead to empirically effective results. Incorporating a sparse prior using Variational Bayes,biases the models toward generalizable, parsimonious parameter sets, leading to significantimprovements in word alignment. This preference for sparse solutions together with effective pruning methods forms a phrase alignment regimen that produces better end-to-endtranslations than standard word alignment approaches.1IntroductionMost state-of-the-art statistical machine translation systems are based on large phrase tables extracted from parallel text using word-level alignments. These word-level alignments are most often obtained using Expectation Maximization on theconditional generative models of Brown et al. (1993)and Vogel et al. (1996). As these word-level alignment models restrict the word alignment complexity by requiring each target word to align to zeroor one source words, results are improved by aligning both source-to-target as well as target-to-source,then heuristically combining these alignments. Finally, the set of phrases consistent with the wordalignments are extracted from every sentence pair;these form the basis of the decoding process. Whilethis approach has been very successful, poor wordlevel alignments are nonetheless a common sourceof error in machine translation systems.A natural solution to several of these issues isunite the word-level and phrase-level models intoone learning procedure. Ideally, such a procedurewould remedy the deficiencies of word-level alignment models, including the strong restrictions onthe form of the alignment, and the strong independence assumption between words. Furthermoreit would obviate the need for heuristic combination of word alignments. A unified procedure mayalso improve the identification of non-compositionalphrasal translations, and the attachment decisionsfor unaligned words.In this direction, Expectation Maximization atthe phrase level was proposed by Marcu and Wong(2002), who, however, experienced two major difficulties: computational complexity and controllingoverfitting. Computational complexity arises fromthe exponentially large number of decompositionsof a sentence pair into phrase pairs; overfitting is aproblem because as EM attempts to maximize thelikelihood of its training data, it prefers to directlyexplain a sentence pair with a single phrase pair.In this paper, we attempt to address these two issues in order to apply EM above the word level.97Proceedings of ACL-08: HLT, pages 97–105,Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics

We attack computational complexity by adoptingthe polynomial-time Inversion Transduction Grammar framework, and by only learning small noncompositional phrases. We address the tendency ofEM to overfit by using Bayesian methods, wheresparse priors assign greater mass to parameter vectors with fewer non-zero values therefore favoringshorter, more frequent phrases. We test our modelby extracting longer phrases from our model’s alignments using traditional phrase extraction, and findthat a phrase table based on our system improves MTresults over a phrase table extracted from traditionalword-level alignments.2Phrasal Inversion TransductionGrammarWe use a phrasal extension of Inversion Transduction Grammar (Wu, 1997) as the generative framework. Our ITG has two nonterminals: X andC, where X represents compositional phrase pairsthat can have recursive structures and C is the preterminal over terminal phrase pairs. There are threerules with X on the left-hand side:X [X X],X hX Xi,X C.The first two rules are the straight rule and inverted rule respectively. They split the left-hand sideconstituent which represents a phrase pair into twosmaller phrase pairs on the right-hand side and orderthem according to one of the two possible permutations. The rewriting process continues until the thirdrule is invoked. C is our unique pre-terminal forgenerating terminal multi-word pairs:C e/f .We parameterize our probabilistic model in themanner of a PCFG: we associate a multinomial distribution with each nonterminal, where each outcome in this distribution corresponds to an expansion of that nonterminal. Specifically, we place onemultinomial distribution θX over the three expansions of the nonterminal X, and another multinomialdistribution θC over the expansions of C. Thus, theparameters in our model can be listed asθX (Phi , P[] , PC ),98where Phi is for the inverted rule, P[] for the straightrule, PC for the third rule, satisfying Phi P[] PC 1, andθC (P (e/f ), P (e′ /f ′ ), . . . ),Pwhere e/f P (e/f ) 1 is a multinomial distribution over phrase pairs.This is our model in a nutshell. We can trainthis model using a two-dimensional extension of theinside-outside algorithm on bilingual data, assumingevery phrase pair that can appear as a leaf in a parsetree of the grammar a valid candidate. However, it iseasy to show that the maximum likelihood trainingwill lead to the saturated solution where PC 1 —each sentence pair is generated by a single phrasespanning the whole sentence. From the computational point of view, the full EM algorithm runs inO(n6 ) where n is the average length of the two input sentences, which is too slow in practice.The key is to control the number of parameters,and therefore the size of the set of candidate phrases.We deal with this problem in two directions. Firstwe change the objective function by incorporatinga prior over the phrasal parameters. This has theeffect of preferring parameter vectors in θC withfewer non-zero values. Our second approach wasto constrain the search space using simpler alignment models, which has the further benefit of significantly speeding up training. First we train a lowerlevel word alignment model, then we place hard constraints on the phrasal alignment space using confident word links from this simpler model. Combiningthe two approaches, we have a staged training procedure going from the simplest unconstrained wordbased model to a constrained Bayesian word-levelITG model, and finally proceeding to a constrainedBayesian phrasal model.3Variational Bayes for ITGGoldwater and Griffiths (2007) and Johnson (2007)show that modifying an HMM to include a sparseprior over its parameters and using Bayesian estimation leads to improved accuracy for unsupervisedpart-of-speech tagging. In this section, we describea Bayesian estimator for ITG: we select parameters that optimize the probability of the data givena prior. The traditional estimation method for word

alignment models is the EM algorithm (Brown etal., 1993) which iteratively updates parameters tomaximize the likelihood of the data. The drawbackof maximum likelihood is obvious for phrase-basedmodels. If we do not put any constraint on the distribution of phrases, EM overfits the data by memorizing every sentence pair. A sparse prior over amultinomial distribution such as the distribution ofphrase pairs may bias the estimator toward skeweddistributions that generalize better. In the context ofphrasal models, this means learning the more representative phrases in the space of all possible phrases.The Dirichlet distribution, which is parameterized by a vector of real values often interpreted aspseudo-counts, is a natural choice for the prior, fortwo main reasons. First, the Dirichlet is conjugateto the multinomial distribution, meaning that if weselect a Dirichlet prior and a multinomial likelihoodfunction, the posterior distribution will again be aDirichlet. This makes parameter estimation quitesimple. Second, Dirichlet distributions with small,non-zero parameters place more probability mass onmultinomials on the edges or faces of the probability simplex, distributions with fewer non-zero parameters. Starting from the model from Section 2,we propose the following Bayesian extension, whereA Dir(B) means the random variable A is distributed according to a Dirichlet with parameter B:θX αX Dir(αX ),θC αC Dir(αC ),[X X]hX Xi X Multi(θX ),Ce/f C Multi(θC ).The parameters αX and αC control the sparsity ofthe two distributions in our model. One is the distribution of the three possible branching choices. Theother is the distribution of the phrase pairs. αC iscrucial, since the multinomial it is controlling has ahigh dimension. By adjusting αC to a very smallnumber, we hope to place more posterior mass onparsimonious solutions with fewer but more confident and general phrase pairs.99Having defined the Bayesian model, it remainsto decide the inference procedure. We chose Variational Bayes, for its procedural similarity to EMand ease of implementation. Another potential option would be Gibbs sampling (or some other sampling technique). However, in experiments in unsupervised POS tag learning using HMM structuredmodels, Johnson (2007) shows that VB is more effective than Gibbs sampling in approaching distributions that agree with the Zipf’s law, which is prominent in natural languages.Kurihara and Sato (2006) describe VB for PCFGs,showing the only need is to change the M step ofthe EM algorithm. As in the case of maximum likelihood estimation, Bayesian estimation for ITGs isvery similar to PCFGs, which follows due to thestrong isomorphism between the two models. Specific to our ITG case, the M step becomes:(l 1) exp(ψ(E(X [X X]) αX )),exp(ψ(E(X) sαX ))(l 1) exp(ψ(E(X hX Xi) αX )),exp(ψ(E(X) sαX ))(l 1) exp(ψ(E(X C) αX )),exp(ψ(E(X) sαX ))P̃[]P̃hiP̃CP̃ (l 1) (e/f ) exp(ψ(E(e/f ) αC )),exp(ψ(E(C) mαC ))where ψ is the digamma function (Beal, 2003), s 3 is the number of right-hand-sides for X, and m isthe number of observed phrase pairs in the data. Thesole difference between EM and VB with a sparseprior α is that the raw fractional counts c are replaced by exp(ψ(c α)), an operation that resembles smoothing. As pointed out by Johnson (2007),in effect this expression adds to c a small value thatasymptotically approaches α 0.5 as c approaches , and 0 as c approaches 0. For small values ofα the net effect is the opposite of typical smoothing, since it tends to redistribute probably mass awayfrom unlikely events onto more likely ones.4Bitext Pruning StrategyITG is slow mainly because it considers every pair ofspans in two sentences as a possible chart element.In reality, the set of useful chart elements is much

smaller than the possible scriptO(n4 ), where n isthe average sentence length. Pruning the span pairs(bitext cells) that can participate in a tree (either asterminals or non-terminals) serves to not only speedup ITG parsing, but also to provide a kind of initialization hint to the training procedures, encouraging it to focus on promising regions of the alignmentspace.Given a bitext cell defined by the four boundaryindices (i, j, l, m) as shown in Figure 1a, we prunebased on a figure of merit V (i, j, l, m) approximating the utility of that cell in a full ITG parse. Thefigure of merit considers the Model 1 scores of notonly the words inside a given cell, but also all thewords not included in the source and target spans, asin Moore (2003) and Vogel (2005). Like Zhang andGildea (2005), it is used to prune bitext cells ratherthan score phrases. The total score is the product ofthe Model 1 probabilities for each column; “inside”columns in the range [l, m] are scored according tothe sum (or maximum) of Model 1 probabilities for[i, j], and “outside” columns use the sum (or maximum) of all probabilities not in the range [i, j].Our pruning differs from Zhang and Gildea(2005) in two major ways. First, we perform pruning using both directions of the IBM Model 1 scores;instead of a single figure of merit V , we have two:VF and VB . Only those spans that pass the pruning threshold in both directions are kept. Second,we allow whole spans to be pruned. The figure ofmerit for a span is VF (i, j) maxl,m VF (i, j, l, m).Only spans that are within some threshold of the unrestricted Model 1 scores VF and VB are kept:VB (l, m)VF (i, j) τs and τs .VFVBAmongst those spans retained by this first threshold,we keep only those bitext cells satisfying bothVF (i, j, l, m)VB (i, j, l, m) τb and τb .VF (i, j)VB (l, m)4.1 Fast Tic-tac-toe PruningThe tic-tac-toe pruning algorithm (Zhang andGildea, 2005) uses dynamic programming to compute the product of inside and outside scores forall cells in O(n4 ) time. However, even this can beslow for large values of n. Therefore we describe an100Figure 1: (a) shows the original tic-tac-toe score for abitext cell (i, j, l, m). (b) demonstrates the finite staterepresentation using the machine in (c), assuming a fixedsource span (i, j).improved algorithm with best case n3 performance.Although the worst case performance is also O(n4 ),in practice it is significantly faster.To begin, let us restrict our attention to the forward direction for a fixed source span (i, j). Pruning bitext spans and cells requires VF (i, j), the scoreof the best bitext cell within a given span, as wellas all cells within a given threshold of that bestscore. For a fixed i and j, we need to search overthe starting and ending points l and m of the inside region. Note that there is an isomorphism between the set of spans and a simple finite state machine: any span (l, m) can be represented by a sequence of l OUTSIDE columns, followed by m l 1INSIDE columns, followed by n m 1 OUTSIDE columns. This simple machine has the restricted form described in Figure 1c: it has threestates, L, M , and R; each transition generates either an OUTSIDE column O or an INSIDE columnI. The cost of generating an OUTSIDEat posiPtion a is O(a) P (ta NULL) b6 [i,j] P (ta sb );likewise the cost of generatingP an INSIDE columnis I(a) P (ta NULL) b [i,j] P (ta sb ), with

pre[0, l] : P (tl NULL)pre[i, l] : pre[i 1, l] P (tl si )suf[n 1, l] : 0suf[i, l] : suf[i 1, l] P (tl si )Then for any (i, j), O(a) P (ta NULL) Pb6 [i,j] P (ta sb ) pre[i 1, a] suf[j 1, a].I(a) can be incrementally updated as the sourcespan varies: when i j, I(a) P (ta NULL) P (ta si ). As j is incremented, we add P (ta sj ) toI(a). Thus we have linear time updates for O and I.We can then find the best scoring sequence usingthe familiar Viterbi algorithm. Let δ[a, σ] be the costof the best scoring sequence ending at in state σ attime a:δ[0, σ] : 1 if σ L; 0 otherwiseδ[a, L] : δ[a 1, L] · O(a)δ[a, M ] : δ[a, R] : max {δ[a 1, σ]} · I(a)σ L,Mmax {δ[a 1, σ]} · O(a)σ M,RThen VF (i, j) δ[n 1, R], using the isomorphism between state sequences and spans. This linear time algorithm allows us to compute span pruning in O(n3 ) time. The same algorithm may beperformed using the backward figure of merit aftertransposing rows and columns.Having cast the problem in terms of finite state automata, we can use finite state algorithms for pruning. For instance, fixing a source span we can enumerate the target spans in decreasing order by score(Soong and Huang, 1991), stopping once we encounter the first span below threshold. In practicethe overhead of maintaining the priority queue outweighs any benefit, as seen in Figure 2.An alternate approach that avoids this overhead istoQnenumerate spans by position. Note that δ[m, R] ·a m 1 O(a) is within threshold iff there is aspan with right boundary m′ Qm within threshold. Furthermore if δ[m, M ] · na m 1 O(a) is101900Pruning time (thousands of seconds)O(0) O(n 1) 1 and I(0) I(n 1) 0.Directly computing O and I would take timeO(n2 ) for each source span, leading to an overallruntime of O(n4 ). Luckily there are faster ways tofind the inside and outside scores. First we can precompute following arrays in O(n2 ) time and 0203040Average sentence length50Figure 2: Speed comparison of the O(n4 ) tic-tac-toepruning algorithm, the A* top-x algorithm, and the fasttic-tac-toe pruning. All produce the same set of bitextcells, those within threshold of the best bitext cell.within threshold, then m is the right boundary withinthreshold. Using these facts, we can graduallysweep the right boundary m from n toward 1 untilthe first condition fails to hold. For each value wherethe second condition holds, we pause to search forthe set of left boundaries within threshold.QmQnLikewise for the left edge, δ[l, M ] · a l 1 I(a) ·a m 1 O(a) is within threshold iff there is some′l l identifying a span (l′ , m) withinQthreshold.Finally if V (i, j, l, m) δ[l 1, L] · ma l I(a) ·Qna m 1 O(a) is within threshold, then (i, j, l, m)is a bitext cell within threshold. For right edges thatare known to be within threshold, we can sweep theleft edges leftward until the first condition no longerholds, keeping only those spans for which the second condition holds.The filtering algorithm behaves extremely well.Although the worst case runtime is still O(n4 ), thebest case has improved to n3 ; empirically it seems tosignificantly reduce the amount of time spent exploring spans. Figure 2 compares the speed of the fasttic-tac-toe algorithm against the algorithm in Zhangand Gildea (2005).

Figure 3: Example output from the ITG using non-compositional phrases. (a) is the Viterbi alignment from the wordbased ITG. The shaded regions indicate phrasal alignments that are allowed by the non-compositional constraint; allother phrasal alignments will not be considered. (b) is the Viterbi alignment from the phrasal ITG, with the multi-wordalignments highlighted.5Bootstrapping Phrasal ITG fromWord-based ITGThis section introduces a technique that bootstrapscandidate phrase pairs for phrase-based ITG fromword-based ITG Viterbi alignments. The wordbased ITG uses the same expansions for the nonterminal X, but the expansions of C are limited togenerate only 1-1, 1-0, and 0-1 alignments:C e/f,C e/ǫ,C ǫ/fwhere ǫ indicates that no word was generated.Broadly speaking, the goal of this section is the sameas the previous section, namely, to limit the set ofphrase pairs that needs to be considered in the training process. The tic-tac-toe pruning relies on IBMmodel 1 for scoring a given aligned area. In thispart, we use word-based ITG alignments as anchorpoints in the alignment space to pin down the potential phrases. The scope of iterative phrasal ITG training, therefore, is limited to determining the boundaries of the phrases anchored on the given one-toone word alignments.The heuristic method is based on the NonCompositional Constraint of Cherry and Lin (2007).Cherry and Lin (2007) use GIZA intersectionswhich have high precision as anchor points in the102bitext space to constraint ITG phrases. We use ITGViterbi alignments instead. The benefit is two-fold.First of all, we do not have to run a GIZA aligner.Second, we do not need to worry about non-ITGword alignments, such as the (2, 4, 1, 3) permutationpatterns. GIZA does not limit the set of permutations allowed during translation, so it can producepermutations that are not reachable using an ITG.Formally, given a word-based ITG alignment, thebootstrapping algorithm finds all the phrase pairsaccording to the definition of Och and Ney (2004)and Chiang (2005) with the additional constraintthat each phrase pair contains at most one wordlink. Mathematically, let e(i, j) count the number ofword links that are emitted from the substring ei.j ,and f (l, m) count the number of word links emitted from the substring fl.m . The non-compositionalphrase pairs satisfye(i, j) f (l, m) 1.Figure 3 (a) shows all possible non-compositionalphrases given the Viterbi word alignment of the example sentence pair.6Summary of the PipelineWe summarize the pipeline of our system, demonstrating the interactions between the three main contributions of this paper: Variational Bayes, tic-tactoe pruning, and word-to-phrase bootstrapping. We

7ExperimentsThe training data was a subset of 175K sentencepairs from the NIST Chinese-English training data,automatically selected to maximize character-leveloverlap with the source side of the test data. We puta length limit of 35 on both sides, producing a training set of 141K sentence pairs. 500 Chinese-Englishpairs from this set were manually aligned and usedas a gold standard.7.1 Word Alignment EvaluationFirst, using evaluations of alignment quality, wedemonstrate the effectiveness of VB over EM, andexplore the effect of the prior.Figure 4 examines the difference between EM andVB with varying sparse priors for the word-basedmodel of ITG on the 500 sentence pairs, both after 10 iterations of training. Using EM, because ofoverfitting, AER drops first and increases again asthe number of iterations varies from 1 to 10. Thelowest AER using EM is achieved after the seconditeration, which is .40. At iteration 10, AER for EMincreases to .42. On the other hand, using VB, AERdecreases monotonically over the 10 iterations and1030.6VBEM0.550.50.45AERstart from sentence-aligned bilingual data and runIBM Model 1 in both directions to obtain two translation tables. Then we use the efficient bidirectionaltic-tac-toe pruning to prune the bitext space withineach of the sentence pairs; ITG parsing will be carried out on only this this sparse set of bitext cells.The first stage of training is word-based ITG, using the standard iterative training procedure, exceptVB replaces EM to focus on a sparse prior. After several training iterations, we obtain the Viterbialignments on the training data according to the final model. Now we transition into the second stage– the phrasal training. Before the training starts,we apply the non-compositional constraints over thepruned bitext space to further constrain the spaceof phrase pairs. Finally, we run phrasal ITG iterative training using VB for a certain number of iterations. In the end, a Viterbi pass for the phrasal ITG isexecuted to produce the non-compositional phrasalalignments. From this alignment, phrase pairs areextracted in the usual manner, and a phrase-basedtranslation system is trained.0.40.350.30.250.21e-0091e-0060.001Prior value1Figure 4: AER drops as αC approaches zero; a moresparse solution leads to better results.stabilizes at iteration 10. When αC is 1e 9, VBgets AER close to .35 at iteration 10.As we increase the bias toward sparsity, the AERdecreases, following a long slow plateau. Althoughthe magnitude of improvement is not large, the trendis encouraging.These experiments also indicate that a very sparseprior is needed for machine translation tasks. Unlike Johnson (2007), who found optimal performance when α was approximately 10 4 , we observed monotonic increases in performance as αdropped. The dimensionality of this MT problem issignificantly larger than that of the sequence problem, though, therefore it may take a stronger pushfrom the prior to achieve the desired result.7.2 End-to-end EvaluationGiven an unlimited amount of time, we would tunethe prior to maximize end-to-end performance, using an objective function such as BLEU. Unfortunately these experiments are very slow. Since weobserved monotonic increases in alignment performance with smaller values of αC , we simply fixedthe prior at a very small value (10 100 ) for all translation experiments. We do compare VB against EMin terms of final BLEU scores in the translation experiments to ensure that this sparse prior has a sig-

nificant impact on the output.We also trained a baseline model with GIZA (Och and Ney, 2003) following a regimen of 5 iterations of Model 1, 5 iterations of HMM, and 5iterations of Model 4. We computed Chinese-toEnglish and English-to-Chinese word translation tables using five iterations of Model 1. These values were used to perform tic-tac-toe pruning withτb 1 10 3 and τs 1 10 6 . Over the prunedcharts, we ran 10 iterations of word-based ITG usingEM or VB. The charts were then pruned further byapplying the non-compositional constraint from theViterbi alignment links of that model. Finally we ran10 iterations of phrase-based ITG over the residualcharts, using EM or VB, and extracted the Viterbialignments.For translation, we used the standard phrasal decoding approach, based on a re-implementation ofthe Pharaoh system (Koehn, 2004). The output ofthe word alignment systems (GIZA or ITG) werefed to a standard phrase extraction procedure thatextracted all phrases of length up to 7 and estimated the conditional probabilities of source giventarget and target given source using relative frequencies. Thus our phrasal ITG learns only theminimal non-compositional phrases; the standardphrase-extraction algorithm learns larger combinations of these minimal units. In addition the phraseswere annotated with lexical weights using the IBMModel 1 tables. The decoder also used a trigram language model trained on the target side of the trainingdata, as well as word count, phrase count, and distortion penalty features. Minimum Error Rate training(Och, 2003) over BLEU was used to optimize theweights for each of these models over the development test data.We used the NIST 2002 evaluation datasets fortuning and evaluation; the 10-reference development set was used for minimum error rate training,and the 4-reference test set was used for evaluation.We trained several phrasal translation systems, varying only the word alignment (or phrasal alignment)method.Table 1 compares the four systems: the GIZA baseline, the ITG word-based model, the ITG multiword model using EM training, and the ITG multiword model using VB training. ITG-mwm-VB isour best model. We see an improvement of nearly104GIZA ITG-wordITG-mwm (VB)ITG-mwm 9.0228.47Table 1: Translation results on Chinese-English, usingthe subset of training data (141K sentence pairs) that havelength limit 35 on both sides. (No length limit in translation. )2 points dev set and nearly 1 point of improvementon the test set. We also observe the consistent superiority of VB over EM. The gain is especially largeon the test data set, indicating VB is less prone tooverfitting.8ConclusionWe have presented an improved and more efficientmethod of estimating phrase pairs directly. By bothchanging the objective function to include a biastoward sparser models and improving the pruningtechniques and efficiency, we achieve significantgains on test data with practical speed. In addition,these gains were shown without resorting to externalmodels, such as GIZA . We have shown that VBis both practical and effective for use in MT models.However, our best system does not apply VB to asingle probability model, as we found an appreciable benefit from bootstrapping each model from simpler models, much as the IBM word alignment models are usually trained in succession. We find thatVB alone is not sufficient to counteract the tendencyof EM to prefer analyses with smaller trees usingfewer rules and longer phrases. Both the tic-tac-toepruning and the non-compositional constraint address this problem by reducing the space of possiblephrase pairs. On top of these hard constraints, thesparse prior of VB helps make the model less proneto overfitting to infrequent phrase pairs, and thusimproves the quality of the phrase pairs the modellearns.Acknowledgments This work was done while thefirst author was at Microsoft Research; thanks to Xiaodong He, Mark Johnson, and Kristina Toutanova.The last author was supported by NSF IIS-0546554.

ReferencesMatthew Beal. 2003. Variational Algorithms for Approximate Bayesian Inference. Ph.D. thesis, GatsbyComputational Neuroscience Unit, University CollegeLondon.Peter F. Brown, Stephen A. Della Pietra, Vincent J. DellaPietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311,June.Colin Cherry and Dekang Lin. 2007. Inversion transduction grammar for joint phrasal translation modeling.In Proceedings of SSST, NAACL-HLT 2007 / AMTAWorkshop on Syntax and Structure in Statistical Translation, pages 17–24, Rochester, New York, April. Association for Computational Linguistics.David Chiang. 2005. A hierarchical phrase-based modelfor statistical machine translation. In Proceedings ofACL, pages 263–270, Ann Arbor, Michigan, USA.Sharon Goldwater and Tom Griffiths. 2007. A fullybayesian approach to unsupervised part-of-speech tagging. In Proceedings of the 45th Annual Meeting ofthe Association of Computational Linguistics, pages744–751, Prague, Czech Republic, June. Associationfor Computational Linguistics.Mark Johnson. 2007. Why doesn’t EM find goodHMM POS-taggers? In Proceedings of the 2007 JointConference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 296–305.Philipp Koehn. 2004. Pharaoh: A beam search decoder for phrase-based statistical machine translationmodels. In Proceedings of the 6th Conference of theAssociation for Machine Translation in the Americas(AMTA), pages 115–124, Washington, USA, September.Kenichi Kurihara and Taisuke Sato. 2006. Variationalbayesian grammar induction for natural language. InInternational Colloquium on Grammatical Inference,pages 84–96, Tokyo, Japan.Dan

rating a sparse prior using Variational Bayes, biases the models toward generalizable, parsi-monious parameter sets, leading to signic ant improvements in word alignment. This pref-erence for sparse solutions together with ef-fective pruning methods forms a