Regularization of multilingual topic models
Keywords:multilingual topic model, probabilistic topic model, parallel corpus, comparable corpus, bilingual dictionary, regularization, cross-language search
A multilingual probabilistic topic model based on the additive regularization ARTM allowing to combine both a parallel or comparable corpus and a bilingual translation dictionary is proposed. Two approaches to include information from a bilingual dictionary are discussed: the first one takes into account only the fact of connection between word translations, whereas the second one learns the translation probabilities for each topic. To measure the quality of the proposed multilingual topic model, a cross-language search is performed. For each query document in one language, it is found its translation on an other language. It is shown that the combined translation of words from a bilingual dictionary and the corresponding connected documents improves the cross-lingual search compared to the models using only one information source. The use of learning word translation probabilities for bilingual dictionaries improves the quality of the model and allows one to determine a context (a set of topics) for each pair of word translations, where these translations are appropriate.
- T. Hofmann, “Probabilistic Latent Semantic Indexing,” in Proc. 22nd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Berkeley, August 15-19, 1999 (ACM Press, New York, 1999), pp. 50-57.
- D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” J. Mach. Learn. Res. 3, 993-1022 (2003).
- J. Boyd-Graber and D. M. Blei, “Multilingual Topic Models for Unaligned Text,” in Proc. 25th Conf. on Uncertainty in Artificial Intelligence, Montreal, June 18-21, 2009 (AUAI Press, Arlington, 2009), pp. 75-82.
- J. Jagarlamudi, H. Daumé, and R. Udupa, “From Bilingual Dictionaries to Interlingual Document Representations,” in Proc. 49th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Portland, June 19-24, 2011 (ACL Press, Stroudsburg, 2011), Vol. 2, pp. 147-152.
- X. Ni, J.-T. Sun, J. Hu, and Z. Chen, “Cross Lingual Text Classification by Mining Multilingual Topics from Wikipedia,” in Proc. 4th ACM Int. Conf. on Web Search and Data Mining, Hong Kong, February 9-12, 2011 (ACM Press, New York, 2011), pp. 375-384.
- X. Ni, J.-T. Sun, J. Hu, and Z. Chen, “Mining Multilingual Topics from Wikipedia,” in Proc. 18th ACM Int. Conf. on World Wide Web, Madrid, April 20-24, 2009 (ACM Press, New York, 2009), pp. 1155-1156.
- D. Mimno, H. M. Wallach, J. Naradowsky, et al., “Polylingual Topic Models,” in Proc. 2009 Conf. on Empirical Methods in Natural Language Processing, Singapore, August 6-7, 2009 (ACL Press, Stroudsburg, 2009), Vol. 2, pp. 880-889.
- W. De Smet and M.-F. Moens, “Cross-Language Linking of News Stories on the Web Using Interlingual Topic Modelling,” in Proc. 2nd ACM Workshop on Social Web Search and Mining, Hong Kong, November 2, 2009 (ACM Press, New York, 2009), pp. 57-64.
- D. Zhang, Q. Mei, and C. X. Zhai, “Cross-Lingual Latent Topic Extraction,” in Proc. 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, July 11-16, 2010 (ACL Press, Stroudsburg, 2010), pp. 1128-1137.
- J. Boyd-Graber and P. Resnik, “Holistic Sentiment Analysis across Languages: Multilingual Supervised Latent Dirichlet Allocation,” in Proc. 2010 Conf. on Empirical Methods in Natural Language Processing, Cambridge, Massachusetts, October 9-11, 2010 (ACL Press, Stroudsburg, 2010), pp. 45-55.
- K. V. Vorontsov, “Additive Regularization for Topic Models of Text Collections,” Dokl. Akad. Nauk 456 (3), 268-271 (2014) [Dokl. Math. 89 (3), 301-304 (2014)].
- K. Vorontsov and A. Potapenko, “Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization,” in Analysis of Images, Social Networks and Texts. Communications in Computer and Information Science (Springer, Heidelberg, 2014), Vol. 436, pp. 29-46.
- K. V. Vorontsov and A. A. Potapenko, “Regularization of Probabilistic Topic Models to Improve Interpretability and Determine the Number of Topics,” in Computational Linguistics and Intellectual Technologies (Ross. Gos. Gumanitarn. Univ., Moscow, 2014), Issue 13, pp. 676-687.
- L. Si and R. Jin, “Adjusting Mixture Weights of Gaussian Mixture Model via Regularized Probabilistic Latent Semantic Analysis,” in Lecture Notes in Computer Science (Springer, Heidelberg, 2005), Vol. 3518, pp. 622-631.
- J.-T. Chien and M.-S. Wu, “Adaptive Bayesian Latent Semantic Analysis,” IEEE Trans. Audio, Speech and Lang. Proc. 16 (1), 198-207 (2008).
- Q. Mei, D. Cai, D. Zhang, C.X. Zhai, “Topic Modeling with Network Regularization,” in Proc. 17th Int. Conf. on World Wide Web, Beijing, April 21-25, 2008 (ACM Press, New York, 2008), pp. 101-110.
- Q. Wang, J. Xu, H. Li, and N. Craswell, “Regularized Latent Semantic Indexing,” in Proc. 34th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Beijing, July 24-28 (ACM Press, New York, 2011), pp. 685-694.
- P. Koehn, “Europarl: A Parallel Corpus for Statistical Machine Translation,” in Proc. 10th Machine Translation Summit, Phuket, Thailand, September 12-16, 2005.
http://www.mt-archive.info/MTS-2005-Koehn.pdf/. Cited January 7, 2015.
- G. A. Miller, “WordNet: A Lexical Database for English,” Commun. ACM 38 (11), 39-41 (1995).
- P. Vossen, EuroWordNet: A Multilingual Database with Lexical Semantic Networks (Kluwer, Norwell, 1998).
- R. Navigli and S. P. Ponzetto, “BabelNet: The Automatic Construction, Evaluation and Application of a Wide-Coverage Multilingual Semantic Network,” Artif. Intell. 193, 217-250 (2012).
- J. Lehmann, R. Isele, M. Jakob, et al., “DBpedia - A Large-Scale, Multilingual Knowledge Base Extracted from Wikipedia,” Semantic Web Journal (2014).
http://www.semantic-web-journal.net/system/files/swj499.pdf . Cited January 7, 2015.
- G. de Melo and G. Weikum, “MENTA: Inducing Multilingual Taxonomies from Wikipedia,” in Proc. 19th ACM International Conference on Information and Knowledge Management, Toronto, October 26-30 (ACM Press, New York, 2010), pp. 1099-1108.
- A. K. McCallum, “MALLET: A Machine Learning for Language Toolkit,”
http://mallet.cs.umass.edu . Cited January 7, 2015.