Methods of text fragment relevance estimation based on the topic model analysis in the text summarization problem

Authors

  • I.V. Mashechkin
  • M.I. Petrovskiy
  • D.V. Tsarev

Keywords:

text fragment relevance
automatic text summarization
semantic text models
topic models
latent semantic analysis (LSA)
singular value decomposition (SVD)
non-negative matrix factorization (NMF)
probabilistic topic models
probabilistic latent semantic analysis (PLSA)
latent Dirichlet allocation (LDA)

Abstract

The most state of the art methods of text fragment relevance (significance, importance) estimation based on the topic model analysis are considered. These methods are used for building summaries in the form of generic extracts, i.e., the summaries completely consisting of word fragments copied from an original document. The most popular text mining topic models are considered: the models based on the latent semantic analysis (LSA) and the probabilistic topic models, such as the probabilistic latent semantic analysis (PLSA) and the latent Dirichlet allocation (LDA). A new method of text fragment relevance estimation is proposed. The proposed sentence relevance estimation is based on the normalization of the non-negative matrix factorization (NMF) topic space and on the further weighting of each topic using the sentences representation in the topic space. The non-negative matrix factorization is used as a matrix decomposition in the latent semantic analysis model. A number of experiments performed with the considered methods of text fragment relevance estimation show the superiority of methods based on the latent semantic analysis over the probabilistic topic models in single document summarization problems. The DUC 2001 and DUC 2002 standard datasets with the ROUGE standard metrics are used in the comparative experiments. In addition, the proposed method shows a better summarization quality than the other considered text summarization methods. The work was supported by the Ministry of Education and Science of the Russian Federation (state contract no. 14.514.11.4016) and by the Russian Foundation for Basic Research (project nos. 11–07–00616 and 12–07–00585).


Published

2013-02-20

Issue

Section

Section 1. Numerical methods and applications

Author Biographies

I.V. Mashechkin

M.I. Petrovskiy

D.V. Tsarev


References

  1. Mani I., Maybury M. (Eds.) Advances in automatic text summarization. Cambridge: MIT Press, 1999.
  2. Jezek K., Steinberger J. Automatic text summarization (the state of the art 2007 and new challenges) // Proc. of Znalosti-2008. Bratislava, 2008. 1-12.
  3. Barzilay R., Elhadad M. Using lexical chains for text summarization // Proc. of the ACL/EACL-97 Workshop on Intelligent Scalable Text Summarization. Madrid, 1997. 10-17.
  4. Mihalcea R., Tarau P. Textrank: bringing order into texts // Proc. of the Conference on Empirical Methods in Natural Language Processing. Barcelona, 2004. 404-411.
  5. Steinberger J., Jezek K. Text summarization and singular value decomposition // Lecture Notes on Computer Science. Vol. 3261 Heidelberg: Springer, 2004. 245-254.
  6. Bhandari H., Shimbo M., Ito T., Matsumoto Y. Generic text summarization using probabilistic latent semantic indexing // The Third International Joint Conference on Natural Language Processing. Hyderabad, 2008. 133-140.
  7. Arora R., Ravindran B. Latent Dirichlet allocation based multi-document summarization // Proc. of the Second Workshop on Analytics for Noisy Unstructured Text Data. New York: ACM Press, 2008. 91-97.
  8. Blei D. Probabilistic topic models // Communications of the ACM. 2012. 55, N 4. 77-84.
  9. Hofmann T. Probabilistic latent semantic indexing // Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1999. 50-57.
  10. Arora S., Ge R., Moitra A. Learning topic models -going beyond SVD // Proc. of the 53rd Annual IEEE Symposium on Foundations of Computer Science (FOCS-2012). New Brunswick: IEEE Press, 2012. 1-10.
  11. Xu W., Liu X., Gong Y. Document clustering based on non-negative matrix factorization // Proc. of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2003. 267-273.
  12. Berry M., Browne M., Langville A., Pauca V., Plemmons R. Algorithms and applications for approximate nonnegative matrix factorization // Computational Statistics and Data Analysis. 2007. 52, N 1. 155-173.
  13. WordNet [Электронный ресурс] : A lexical database for English (http://wordnet.princeton.edu/).
  14. Русский WordNet [Электронный ресурс] (http://wordnet.ru/).
  15. RussNet: тезаурус русского языка [Электронный ресурс] (http://project.phil.spbu.ru/RussNet/index_ru.shtml).
  16. Mandala R., Takenobu T., Hozumi T. The use of WordNet in information retrieval // Proc. of ACL Workshop on Usage of WordNet in Natural Language Processing Systems. Montreal, 1998. 31-37.
  17. Rakesh P., Shivapratap G., Divya G., Soman K. Evaluation of SVD and NMF methods for latent semantic analysis // International Journal of Recent Trends in Engineering. 2009. 1, N 3. 308-310.
  18. Blei D., Ng A., Jordan M. Latent Dirichlet allocation // Journal of Machine Learning Research. 2003. 3. 993-1022.
  19. Blei D., Lafferty J. Topic models // Text Mining: Classification, Clustering, and Applications. Boca Raton: CRC Press, 2009. 71-94.
  20. Berry M.W., Dumais S.T., O’Brien G.W. Using linear algebra for intelligent information retrieval // SIAM Review. 1995. 37, N 4. 573-595.
  21. Ye Y. Comparing matrix methods in text-based information retrieval. Tech Rep. School of Mathematical Sciences, Peking University. Beijing, 2000.
  22. Griffiths T.L., Steyvers M. Finding scientific topics // Proc. of the National Academy of Sciences. 2004. 101, N 1. 5228-5235.
  23. Mashechkin I., Petrovskiy M., Popov D., Tsarev D. Automatic text summarization using latent semantic analysis // Programming and Computer Software. 2011. 37, N 6. 299-305.
  24. Tsarev D., Petrovskiy M., Mashechkin I. Using NMF-based text summarization to improve supervised and unsupervised classification // Proc. of the 11th International Conference on Hybrid Intelligent Systems (HIS). Malacca: IEEE Press, 2011. 185-189.
  25. Chisholm E., Kolda T. New term weighting formulas for the vector space method in information retrieval // Technical Report Number ORNL-TM-13756. Oak Ridge National Laboratory. Oak Ridge, 1999.
  26. Landauer T., Dumais S. A solution to Plato’s problem: the latent semantic analysis theory of the acquisition, induction and representation of knowledge // Psychological Review. 1997. 104. 211-240.
  27. Gong Y., Liu X. Generic text summarization using relevance measure and latent semantic analysis // Proc. of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2001. 19-25.
  28. Lee D., Seung H. Learning the parts of objects by non-negative matrix factorization // Nature. 1999. 401. 788-791.
  29. Steinberger J. Text summarization within the LSA framework. PhD Thesis. University of West Bohemia in Pilsen. Pilsen, 2007.
  30. Lee J.-H., Park S., Ahn C.-M., Kim D. Automatic generic document summarization based on non-negative matrix factorization // Information Processing and Management: an International Journal. 2009. 45, N 1. 20-34.
  31. Park S. Personalized summarization agent using non-negative matrix factorization // PRICAI 2008: Trends in Artificial Intelligence. Vol. 5351. Heidelberg: Springer, 2008. 1034-1038.
  32. Park S., Lee J.-H., Kim D.-H., Ahn C.-M. Multi-document summarization using weighted similarity between topic and clustering-based non-negative semantic feature // Advances in Data and Web Management. Vol. 4505. Heidelberg: Springer, 2007. 108-115.
  33. Liu S., Zhou M., Pan S., Qian W., Cai W., Lian X. Interactive, topic-based visual text summarization and analysis // Proc. of the 18th ACM Conference on Information and Knowledge Management. New York: ACM Press, 2009. 543-552.
  34. Daud A., Li J., Zhou L., Muhammad F. Knowledge discovery through directed probabilistic topic models: a survey // Frontiers of Computer Science in China. 2010. 4, N 2. 280-301.
  35. Matlab Topic Modeling Toolbox 1.4 [Электронный ресурс] (http://psiexp.ss.uci.edu/).
  36. Minka T., Lafferty J. Expectation-propagation for the generative aspect model // UAI’02 Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers, 2002. 352-359.
  37. ROUGE: Recall-Oriented Understudy of Gisting Evaluation [Электронный ресурс] (http://www.berouge.com/).
  38. Lin C.-Y. Looking for a few good metrics: automatic summarization evaluation - how many samples are enough? // Proc. of the NTCIR-5 Workshop. Tokyo: National Institute of Informatics, 2004. 1765-1776.
  39. Document Understanding Conferences [Электронный ресурс] (http://duc.nist.gov/).
  40. Lin C.-J. Projected gradient methods for non-negative matrix factorization // Neural Computation. 2007. 19, N 10. 2756-2779.
  41. Ding C., Li T., Peng W. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing // Computational Statistics &; Data Analysis. 2008. 52, N 8. 3913-3927.
  42. Mirzal A. Converged algorithms for orthogonal nonnegative matrix factorizations // Computing Research Repository. Vol. 1010. 2010.
  43. Ding C., Li T., Peng W., Park H. Orthogonal nonnegative matrix tri-factorizations for clustering // KDD’06 Proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2006. 126-135.
  44. Lin C.-J. On the convergence of multiplicative update algorithms for nonnegative matrix factorization // IEEE Transactions on Neural Networks. 2007. 18, N 6. 1589-1596.
  45. Steyvers M., Griffiths T. Probabilistic topic models // Handbook of Latent Semantic Analysis. Vol. 427. Philadelphia: Psychology Press, 2007. 414-440.
  46. Воеводин В.В., Кузнецов Ю.А. Матрицы и вычисления. М.: Наука, 1984.