Improving document ranking based on logs of search engine users

Authors

  • M.S. Ageev

Keywords:

search engines
machine learning
log mining

Abstract

Search engine logs provide significant information about user preferences. We propose an algorithm that improves search engine ranking quality by using log mining and machine learning. The corresponding evaluation shows a significant improvement in the ranking quality on real-world large-scale datasets. The proposed algorithm allows parallel processing of large-scale data using the MapReduce framework. The developed approach is also applicable to a wide range of log mining tasks. This work is supported by the Russian Foundation for Basic Research (project 12-07-31225).


Published

2012-12-06

Issue

Section

Section 1. Numerical methods and applications

Author Biography

M.S. Ageev


References

  1. Агеев М.С., Кураленок И.Е., Некрестьянов И.С. Официальные метрики РОМИП 2010 // Тр. Росс. сем. по оценке методов информационного поиска. Казань: Казан. гос. ун-т., 2010. 182-187.
  2. Гуда С.А., Рябов Д.С. Отчет по конкурсу Relevance Prediction Challenge // (http://download.yandex.ru/company/SnD.pdf).
  3. Гулин А., Карпович П., Расковалов Д., Сегалович И. Оптимизация алгоритмов ранжирования методами машинного обучения // Тр. Росс. сем. по оценке методов информационного поиска. СПб.: НУ ЦСИ, 2009. 163-168.
  4. Марманис Х., Бабенко Д. Алгоритмы интеллектуального интернета. Передовые методики сбора, анализа и обработки данных. СПб.: Символ-плюс, 2011.
  5. Лэм Ч. Hadoop в действии. М.: ДМК Пресс, 2012.
  6. Николенко С.И., Фишков А.А. Обзор моделей поведения пользователей для задачи ранжирования результатов поиска // Тр. Cанкт-Петербургского ин-та информатики и автоматизации РАН. Вып. 22. СПб.: СПИИА РАН, 2012. 139-175.
  7. Маннинг К., Рагхаван П., Шютце Х. Введение в информационный поиск. М.: Вильямс, 2011.
  8. Ageev M., Guo Q., Lagun D., Agichtein E. Find it if you can: a game for modeling different types of web search success using interaction data // Proc. of the 34th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR). New York: ACM, 2011. 345-354.
  9. Agichtein E., Brill E., Dumais S. Improving web search ranking by incorporating user behavior information // Proc. of the 29th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR). New York: ACM, 2006. 19-26.
  10. Armstrong T., Moffat A., Webber W., Zobel J. Improvements that don’t add up: ad-hoc retrieval results since 1998 // Proc. of the 18th Int. ACM Conf. on Information and Knowledge Management (CIKM). New York: ACM, 2009. 601-610.
  11. Atterer R., Wnuk M., Schmidt A. Knowing the user’s every move: user activity tracking for website usability evaluation and implicit interaction // Proc. of the 15th Int. Conf. on World Wide Web (WWW). New York: ACM, 2006. 203-212.
  12. Botao H., Liu N., Chen W. Learning from click model and latent factor model for relevance prediction challenge // Proc. of Int. Conf. on Web Service and Data Mining workshop on Web Search Click Data. New York: ACM, 2012. 12-20.
  13. Burges C.J. C., Shaped T., Renshaw E., Lazier A., Deeds M., Hamilton N., Hullender G. Learning to rank using gradient descent // Proc. of the 22nd Int. Conf. on Machine learning. New York: ACM, 2005. 89-96.
  14. Burges C.J. C., Ragno R., Le Q. Learning to rank with nonsmooth cost functions // Proc. of Int. Conf. on Neural Information Processing Systems (NIPS) Cambridge: MIT Press, 2006. 395-402.
  15. Busa-Fekete R., Kegl B., Elteto T., Szarvas G. Ranking by calibrated AdaBoost // J. of Machine Learning Research - Proc. Track 14. Brookline: Microtome Publishing, 2011. 37-48.
  16. Cao H., Jiang D., Pei J., He Q., Liao Z., Chen E., Li H. Context-aware query suggestion by mining click-through and session data // Proc. of the 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD). New York: ACM, 2008. 875-883.
  17. Chapelle O., Zhang Y. A dynamic bayesian network click model for web search ranking // Proc. of the 18th Int. Conf. on World Wide Web (WWW). New York: ACM, 2009. 1-10.
  18. Chapelle O., Zhang Y. Yahoo! learning to rank challenge overview // J. of Machine Learning Research - Proc. Track 14. Brookline: Microtome Publishing, 2011. 1-24.
  19. Cooper W., Gey F., Dabney D. Probabilistic retrieval based on staged logistic regression // Proc. of the 15th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR). New York: ACM, 1992. 198-210.
  20. Craswell N., Zoeter O., Taylor M., Ramsey B. An experimental comparison of click position-bias models // Proc. of the Int. Conf. on Web Search and Web Data Mining (WSDM). New York: ACM, 2008. 87-94.
  21. Dupret G., Piwowarski B. A user browsing model to predict search engine click data from past observations // Proc. of the 31st Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR). New York: ACM, 2008. 331-338.
  22. Figurnov M., Kirillov A. Linear combination of random forests for the Relevance Prediction Challenge // Proc. of Int. Conf. on Web Service and Data Mining workshop on Web Search Click Data. New York: ACM, 2012. 71-75.
  23. Gulin A., Kuralenok I., Pavlov D. Winning the transfer learning track of Yahoo!’s learning to rank challenge with yetirank // J. of Machine Learning Research - Proc. Track 14. Brookline: Microtome Publishing, 2011. 63-76.
  24. Guo Q., Agichtein E. Ready to buy or just browsing?: detecting web searcher goals from interaction data // Proc. of the 33rd Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR). New York: ACM, 2010. 130-137.
  25. Guo Q., Lagun D., Savenkov D., Liu Q. Improving relevance prediction by addressing biases and sparsity in Web search Click Data // Proc. of Int. Conf. on Web Service and Data Mining workshop on Web Search Click Data. New York: ACM, 2012. 71-75.
  26. Freund Y., Schapire R. Experiments with a new boosting algorithm // Proc. of Int. Conf. on Machine Learning. San Francisco: Morgan Kaufmann, 1996. 148-156.
  27. Joachims T. Optimizing search engines using clickthrough data // Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining. New York: ACM, 2002. 133-142.
  28. Lafferty J., McCallum A., Pereira F. Conditional random fields: probabilistic models for segmenting and labeling sequence data // Proc. of Int. Conf. on Machine Learning (ICML). New York: ACM, 2001. 282-289.
  29. Ling C.X., Huang J., Zhang H. AUC: a statistically consistent and more discriminating measure than accuracy // Proc. of Int. Joint Conf. on Artificial Intelligence (IJCAI). 18. Abington, UK: Lawrence Erlbaum Associates Ltd., 2003. 519-526.
  30. Liu T.-Y. Learning to rank for information retrieval. Hanover, US: Now Publishers, 2009.
  31. Salton G., Buckley C. Term-weighting approaches in automatic text retrieval // J. Inf. Processing and Management. 1988 24, N 5. 513-523.
  32. Serdyukov P., Craswell N., Dupret G. WSCD 2012: workshop on web search click data 2012 // Proc. of Int. Conf. on Web Service and Data Mining. New York: ACM, 2012. 771-772.
  33. Sorokina D., Caruana R., Riedewald M. Additive groves of regression trees // Proc. of European Conf. in Machine Learning. Berlin: Springer, 2007. 323-334.
  34. Taylor M., Guiver J., Robertson S., Minka T. SoftRank: optimizing non-smooth rank metrics // Proc. of Int. Conf. on Web Search and Web Data Mining (WSDM). New York: ACM, 2008. 77-86.
  35. Tsai M.-F., Liu T.-Y., Qin T., Chen H.-H., Ma W.-Y. FRank: a ranking method with fidelity loss // Proc. of the 30th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. (SIGIR) New York: ACM, 2007. 383-390.
  36. Witten I., Frank E., Hall M. Data mining: practical machine learning tools and Techniques. San Francisco: Morgan Kaufmann, 2011.
  37. Xu J., Liu T.-Y., Lu M., Li H., Ma W.-Y. Directly optimizing evaluation measures in learning to rank // Proc. of the 31st Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR). New York: ACM, 2008. 107-114.