Topic models: adding bigrams and taking account of the similarity between unigrams and bigrams




topic models, PLSA (Probabilistic Latent Semantic Analysis), word association measures, bigrams, topic coherence, perplexity


The results of experimental study of adding bigrams and taking account of the similarity between them and unigrams are discussed. A novel PLSA-SIM algorithm based on a modification of the original PLSA (Probabilistic Latent Semantic Analysis) algorithm is proposed. The proposed algorithm incorporates bigrams and takes into account the similarity between them and unigram components. Various word association measures are analyzed to integrate top-ranked bigrams into topic models. As target text collections, articles from various Russian electronic banking magazines, English parts of parallel corpora Europarl and JRC-Acquiz, and the English digital archive of research papers in computational linguistics (ACL Anthology) are chosen. The computational experiments show that there exists a subgroup of tested measures that produce top-ranked bigrams in such a way that their inclusion into the PLSA-SIM algorithm significantly improves the quality of topic models for all collections. A novel unsupervised iterative algorithm named PLSA-ITER is also proposed for adding the most relevant bigrams. The computational experiments show a further improvement in the quality of topic models compared to the PLSA algorithm.

Author Biographies

M.A. Nokel

N.V. Loukachevitch


  1. D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” J. Mach. Learn. Res. 3, 993-1022 (2003).
  2. X. Wei and W. B. Croft, “LDA-Based Document Models for Ad-hoc Retrieval,” in Proc. 29th Annual Int. ACM-SIGIR Conf. on Research and Development in Information Retrieval, Seattle, USA, August 6-10, 2006 (ACM Press, New York, 2006), pp. 178-185.
  3. J. L. Boyd-Graber, D. M. Blei, and X. Zhu, “A Topic Model for Word Sense Disambiguation,” in Proc. Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, June 28-30, 2007 (ACL Press, Stroudsburg, 2007), pp. 1024-1033.
  4. D. Wang, S. Zhu, T. Li, and Y. Gong, “Multi-Document Summarization Using Sentence-Based Topic Models,” in Proc. ACL-IJCNLP Conf. Short Papers, Singapore, Singapore, August 2-7, 2009 (ACL Press, Stroudsburg, 2009), pp. 297-300.
  5. V. Eidelman, J. Boyd-Graber, and P. Resnik, “Topic Models for Dynamic Translation Model Adaptation,” in Proc. 50th Annual Meeting of the Association of Computational Linguistics, Stroudsburg, USA, Short Papers, July 8-14, 2012 (ACL Press, Stroudsburg, 2012), Vol. 2, pp. 115-119.
  6. S. Zhou, K. Li, and Y. Liu, “Text Categorization Based on Topic Model,” Int. J. Comput. Intell. Syst. 2 (4), 398-409 (2009).
  7. L. Bolelli, Ş. Ertekin, and C. L. Giles, “Topic and Trend Detection in Text Collections Using Latent Dirichlet Allocation,” in Lecture Notes in Computer Science (Springer, Heidelberg, 2009), Vol. 5478, pp. 776-780.
  8. T. Hyunh, M. Fritz, and B. Schiele, “Discovery of Activity Patterns Using Topic Models,” in Proc. 10th Int. Conf. on Ubiquitous Computing, Seoul, South Korea, September 21-24, 2008 (ACM Press, New York, 2008), pp. 10-19.
  9. T. Hofmann, “Probabilistic Latent Semantic Indexing,” in Proc. of the 22nd Annual Int. SIGIR Conf. on Research and Development in Information Retrieval, Berkley, USA, August 15-19, 1999 (ACM Press, New York, 1999), pp. 50-57.
  10. A. Daud, J. Li, L. Zhou, and F. Muhammad, “Knowledge Discovery through Directed Probabilistic Topic Models: A Survey,” Front. Comput. Sci. China 4 (2), 280-301 (2010).
  11. H. M. Wallach, “Topic Modeling: Beyond Bag-of-Words,” in Proc. 23rd Int. Conf. on Machine Learning, Pitsburg, USA, June 25-29, 2006 (ACM Press, New York, 2006), pp. 977-984.
  12. T. L. Griffiths, M. Steyvers, and J. B. Tenenbaum, “Topics in Semantic Representation,” Psychol. Rev. 144 (2), 211-244 (2007).
  13. X. Wang, A. McCallum, and X. Wei, “Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval,” in Proc. 7th IEEE Int. Conf. on Data Mining, Las Vegas, USA, October 28-31, 2007 (IEEE Press, Washington, DC, 2007), pp. 697-702.
  14. Q. He, K. Chang, E. Lim, and A. Banerjee, “Keep It Simple with Time: A Reexamination of Probabilistic Topic Detection Models,” IEEE Trans. Pattern Anal. Mach. Intell. 32 (10), 1795-1808 (2010).
  15. G. Salton, Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer (Addison-Wesley, Boston, 1989).
  16. J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” in Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, USA, June 21-July 18, 1965 and December 27, 1965-January 7, 1966 (Univ. California Press, Berkeley, 1967), pp. 281-297.
  17. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Roy. Stat. Soc., Series B Stat. Methodol. 39 (1), 1-38 (1977).
  18. A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh, “On Smoothing and Inference for Topic Models,” in Proc. 25th Conf. on Uncertainty in Artificial Intelligence, Montreal, Canada, June 18-21, 2009 (AUAI Press, Arlington, 2009), pp. 27-34.
  19. W. Hu, N. Shimizu, H. Nakagawa, and H. Sheng, “Modeling Chinese Documents with Topical Word-Character Models,” in Proc. 22nd Int. Conf. on Computational Linguistics, Manchester, UK, August 18-22, 2008 (ACL Press, Stroudsburg, 2008), pp. 345-352.
  20. M. Johnson, “PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names,” in Proc. 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, July 11-16, 2010 (ACL Press, Stroudsburg, 2010), pp. 1148-1157.
  21. J. H. Lau, T. Baldwin, and D. Newman, “On Collocations and Topic Models,” ACM Trans. Speech Lang. Process. 10 (3), 1-14 (2013).
  22. D. Newman, J. H. Lau, K. Grieser, and T. Baldwin, “Automatic Evaluation of Topic Coherence,” in Proc. 11th Annual Conf. of the North American Chapter of the Association for Computational Linguistics on Human Language Technologies, Los Angeles, USA, June 1-6, 2010 (ACL Press, Stroudsburg, 2010), pp. 100-108.
  23. D. Andrzejewski, X. Zhu, and M. Craven, “Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors,” in Proc. 26th Annual Int. Conf. on Machine Learning, Montreal, Canada, June 14-18, 2009 (ACM Press, New York, 2009), pp. 25-32.
  24. B. Liu, Sentiment Analysis and Opinion Mining (Morgan & Claypool, San Rafael, 2012).
  25. Z. Zhai, B. Liu, H. Xu, and P. Jia, “Grouping Product Features Using Semi-Supervised Learning with Soft-Constraints,” in Proc. 23rd Int. Conf. on Computational Linguistics, Beijing, China, August 23-27, 2010 (ACL Press, Stroudsburg, 2010), pp. 1272-1280.
  26. K. V. Vorontsov and A. A. Potapenko, “EM-like Algorithms for Probabilistic Topic Modeling,” Mashin. Obuchenie Analiz Dannykh 1 (6), 657-686 (2013).
  27. J. Chang, J. Boyd-Graber, C. Wang, et al., “Reading Tea Leaves: How Human Interpret Topic Models,” in Proc. 24th Annual Conf. on Neural Information Processing Systems, Vancouver, Canada, December 6-9, 2010 (Curran Associates, Red Hook, 2010), pp. 288-296.
  28. D. Mimno, H. M. Wallach, E. Talley, et al., “Optimizing Semantic Coherence in Topic Models,” in Proc. Conf. on Empirical Methods in Natural Language Processing, Edinburgh, UK, July 27-29, 2011 (ACL Press, Stroudsburg, 2011), pp. 262-272.
  29. K. Stevens, P. Kegelmeyer, D. Andrzejewski, and D. Butter, “Exploring Topic Coherence over Many Models and Many Topics,” in Proc. Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju, Korea, July 12-14, 2012 (ACL Press, Stroudsburg, 2012), pp. 952-961.
  30. D. Andrzejewski and D. Butter, “Latent Topic Feedback for Information Retrieval,” in Proc. 17th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, San Diego, USA, August 21-24, 2011 (ACM Press, New York, 2011), pp. 600-608.
  31. K. W. Church and P. Hanks, “Word Association Norms, Mutual Information, and Lexicography,” Comput. Linguist. 16 (1), 22-29 (1990).
  32. W. Zhang, T. Yoshida, T.B. Ho, and X. Tang, “Augmented Mutual Information for Multi-Word Extraction,” Int. J. Innov. Comput. Inform. Contr. 5 (2), 543-554 (2009).
  33. G. Bouma, “Normalized (Pointwise) Mutual Information in Collocation Extraction,” in Proc. Biennial GSCL Conf., Potsdam, Germany, September 30-October 2, 2009 (Gunter Narr Verlag, Tübingen, 2009), pp. 31-40.
  34. P. A. Deane, “A Nonparametric Method for Extraction of Candidate Phrasal Terms,” in Proc. 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, USA, June 25-30, 2005 (ACL Press, Stroudsburg, 2005), pp. 605-613.
  35. B. Daille, Combined Approach for Terminology Extraction: Lexical Statistics and Linguistic Filtering , PhD Thesis (Univ. of Paris, Paris, 1995).
  36. J. F. Silva and G. P. Lopes, “A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units from Corpora,” in Proc. 6th Meeting on the Mathematics of Language, Florida, USA, July 23-25, 1999 (ACL Press, Stroudsburg, 1999), pp. 369-381.
  37. F. Smadja, K. R. McKeown, and V. Hatzivassiloglou, “Translating Collocations for Bilingual Lexicons: A Statistical Approach,” Comput. Linguist. 22 (1), 1-38 (1996).
  38. M. Kitamura and Y. Matsumoto, “Automatic Extraction of Word Sequence Correspondences in Parallel Corpora,” in Proc. 4th Annual Workshop on Very Large Corpora, Copenhagen, Denmark, August 4, 1996 (ACL Press, Stroudsburg, 1996), pp. 79-87.
  39. V. Daudaravičius and R. Marcinkevičiené, “Gravity Counts for the Boundaries of Collocations,” Int. J. Corpus Linguist. 9 (2), 321-348 (2004).
  40. S. Kulczińsky, “Zespoly róslin w Pieninach (Die Pflanzenassociationen der Pienenen),” Bull. Int. de L’Acad’emie Polonaise des Sciences et des Letters, Classe des Sciences Mathematiques et Naturelles, Serie B, Suppl. II, No. 2, 57-203 (1927).
  41. P. Jaccard, “Distribution de la Flore Alpine dans le Bassin des Drances et dans Quelques Régions Voisines,” Bull. Soc. Vaudoise Sci. Natur. 37, 241-272 (1901).
  42. W. A. Gale and K. W. Church, “A Program for Aligning Sentences in Bilingual Corpora,” in Proc. 29th Annual Meeting of the Association for Computational Linguistics, Berkley, USA, June 18-21, 1991 (ACL Press, Stroudsburg, 1991), pp. 177-184.
  43. T. Dunning, “Accurate Methods for the Statistics of Surprise and Coincidence,” Comput. Linguist. 19 (1), 61-74 (1993).
  44. M. F. Porter, “An algorithm for suffix stripping,” Program 14 (3), 130-137 (1980).
  45. C. D. Paice, “Another Stemmer,” ACM SIGIR Forum 24 (3), 56-61 (1990).



How to Cite

Нокель М.А., Лукашевич Н.В. Topic Models: Adding Bigrams and Taking Account of the Similarity Between Unigrams and Bigrams // Numerical Methods and Programming (Vychislitel’nye Metody i Programmirovanie). 2015. 16. 215-234. doi 10.26089/NumMet.v16r222



Section 1. Numerical methods and applications