Регуляризация многоязычных тематических моделей

M.A. Dudarenko

doi:10.26089/NumMet.v16r104

https://doi.org/10.26089/NumMet.v16r104

Regularization of multilingual topic models

Authors

M.A. Dudarenko

Keywords:

multilingual topic model

probabilistic topic model

parallel corpus

comparable corpus

bilingual dictionary

regularization

cross-language search

Abstract

A multilingual probabilistic topic model based on the additive regularization ARTM allowing to combine both a parallel or comparable corpus and a bilingual translation dictionary is proposed. Two approaches to include information from a bilingual dictionary are discussed: the first one takes into account only the fact of connection between word translations, whereas the second one learns the translation probabilities for each topic. To measure the quality of the proposed multilingual topic model, a cross-language search is performed. For each query document in one language, it is found its translation on an other language. It is shown that the combined translation of words from a bilingual dictionary and the corresponding connected documents improves the cross-lingual search compared to the models using only one information source. The use of learning word translation probabilities for bilingual dictionaries improves the quality of the model and allows one to determine a context (a set of topics) for each pair of word translations, where these translations are appropriate.

Downloads

PDF (Русский)

Published

2015-01-29

Issue

Vol. 16 (2015): Issue 1.

Section

Section 1. Numerical methods and applications

Author

M.A. Dudarenko

Lomonosov Moscow State University,
Faculty of Computational Mathematics and Cybernetics
• PhD Student

References

T. Hofmann, “Probabilistic Latent Semantic Indexing,” in Proc. 22nd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Berkeley, August 15-19, 1999 (ACM Press, New York, 1999), pp. 50-57.
D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” J. Mach. Learn. Res. 3, 993-1022 (2003).
J. Boyd-Graber and D. M. Blei, “Multilingual Topic Models for Unaligned Text,” in Proc. 25th Conf. on Uncertainty in Artificial Intelligence, Montreal, June 18-21, 2009 (AUAI Press, Arlington, 2009), pp. 75-82.
J. Jagarlamudi, H. Daumé, and R. Udupa, “From Bilingual Dictionaries to Interlingual Document Representations,” in Proc. 49th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Portland, June 19-24, 2011 (ACL Press, Stroudsburg, 2011), Vol. 2, pp. 147-152.
X. Ni, J.-T. Sun, J. Hu, and Z. Chen, “Cross Lingual Text Classification by Mining Multilingual Topics from Wikipedia,” in Proc. 4th ACM Int. Conf. on Web Search and Data Mining, Hong Kong, February 9-12, 2011 (ACM Press, New York, 2011), pp. 375-384.
X. Ni, J.-T. Sun, J. Hu, and Z. Chen, “Mining Multilingual Topics from Wikipedia,” in Proc. 18th ACM Int. Conf. on World Wide Web, Madrid, April 20-24, 2009 (ACM Press, New York, 2009), pp. 1155-1156.
D. Mimno, H. M. Wallach, J. Naradowsky, et al., “Polylingual Topic Models,” in Proc. 2009 Conf. on Empirical Methods in Natural Language Processing, Singapore, August 6-7, 2009 (ACL Press, Stroudsburg, 2009), Vol. 2, pp. 880-889.
W. De Smet and M.-F. Moens, “Cross-Language Linking of News Stories on the Web Using Interlingual Topic Modelling,” in Proc. 2nd ACM Workshop on Social Web Search and Mining, Hong Kong, November 2, 2009 (ACM Press, New York, 2009), pp. 57-64.
D. Zhang, Q. Mei, and C. X. Zhai, “Cross-Lingual Latent Topic Extraction,” in Proc. 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, July 11-16, 2010 (ACL Press, Stroudsburg, 2010), pp. 1128-1137.
J. Boyd-Graber and P. Resnik, “Holistic Sentiment Analysis across Languages: Multilingual Supervised Latent Dirichlet Allocation,” in Proc. 2010 Conf. on Empirical Methods in Natural Language Processing, Cambridge, Massachusetts, October 9-11, 2010 (ACL Press, Stroudsburg, 2010), pp. 45-55.
K. V. Vorontsov, “Additive Regularization for Topic Models of Text Collections,” Dokl. Akad. Nauk 456 (3), 268-271 (2014) [Dokl. Math. 89 (3), 301-304 (2014)].
K. Vorontsov and A. Potapenko, “Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization,” in Analysis of Images, Social Networks and Texts. Communications in Computer and Information Science (Springer, Heidelberg, 2014), Vol. 436, pp. 29-46.
K. V. Vorontsov and A. A. Potapenko, “Regularization of Probabilistic Topic Models to Improve Interpretability and Determine the Number of Topics,” in Computational Linguistics and Intellectual Technologies (Ross. Gos. Gumanitarn. Univ., Moscow, 2014), Issue 13, pp. 676-687.
L. Si and R. Jin, “Adjusting Mixture Weights of Gaussian Mixture Model via Regularized Probabilistic Latent Semantic Analysis,” in Lecture Notes in Computer Science (Springer, Heidelberg, 2005), Vol. 3518, pp. 622-631.
J.-T. Chien and M.-S. Wu, “Adaptive Bayesian Latent Semantic Analysis,” IEEE Trans. Audio, Speech and Lang. Proc. 16 (1), 198-207 (2008).
Q. Mei, D. Cai, D. Zhang, C.X. Zhai, “Topic Modeling with Network Regularization,” in Proc. 17th Int. Conf. on World Wide Web, Beijing, April 21-25, 2008 (ACM Press, New York, 2008), pp. 101-110.
Q. Wang, J. Xu, H. Li, and N. Craswell, “Regularized Latent Semantic Indexing,” in Proc. 34th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Beijing, July 24-28 (ACM Press, New York, 2011), pp. 685-694.
P. Koehn, “Europarl: A Parallel Corpus for Statistical Machine Translation,” in Proc. 10th Machine Translation Summit, Phuket, Thailand, September 12-16, 2005.
http://www.mt-archive.info/MTS-2005-Koehn.pdf/. Cited January 7, 2015.
G. A. Miller, “WordNet: A Lexical Database for English,” Commun. ACM 38 (11), 39-41 (1995).
P. Vossen, EuroWordNet: A Multilingual Database with Lexical Semantic Networks (Kluwer, Norwell, 1998).
R. Navigli and S. P. Ponzetto, “BabelNet: The Automatic Construction, Evaluation and Application of a Wide-Coverage Multilingual Semantic Network,” Artif. Intell. 193, 217-250 (2012).
J. Lehmann, R. Isele, M. Jakob, et al., “DBpedia - A Large-Scale, Multilingual Knowledge Base Extracted from Wikipedia,” Semantic Web Journal (2014).
http://www.semantic-web-journal.net/system/files/swj499.pdf . Cited January 7, 2015.
G. de Melo and G. Weikum, “MENTA: Inducing Multilingual Taxonomies from Wikipedia,” in Proc. 19th ACM International Conference on Information and Knowledge Management, Toronto, October 26-30 (ACM Press, New York, 2010), pp. 1099-1108.
A. K. McCallum, “MALLET: A Machine Learning for Language Toolkit,”
http://mallet.cs.umass.edu . Cited January 7, 2015.