{"id":207,"date":"2016-07-01T11:11:33","date_gmt":"2016-07-01T11:11:33","guid":{"rendered":"http:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/?p=207"},"modified":"2020-05-18T10:24:52","modified_gmt":"2020-05-18T10:24:52","slug":"topic-detection-and-tracking-system","status":"publish","type":"post","link":"https:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/?p=207","title":{"rendered":"Topic detection and tracking system"},"content":{"rendered":"<p>Student(s):&nbsp;Daniel Br\u00fcggemann, Yannik Hermey, Carsten Orth, Darius Schneider, Stefan&nbsp;Selzer;<br \/>\nSupervisor(s): Dr. Gerasimos (Jerry) Spanakis;<br \/>\nSemester: 2015-2016;<\/p>\n<figure><img loading=\"lazy\" decoding=\"async\" style=\"float: left;padding-right: 20px\" src=\"https:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/wp-content\/uploads\/2016\/07\/MAI5-2016-1.png\" alt=\"\" width=\"900\" height=\"289\"><figcaption>Fig 1. &#8211; Dynamic Topic model with emerging topics on the documents of August&nbsp;and September 1996 from RCV1 Corpus<\/figcaption><\/figure>\n<figure><img loading=\"lazy\" decoding=\"async\" style=\"float: left;padding-right: 20px\" src=\"https:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/wp-content\/uploads\/2016\/07\/MAI5-2016-2.png\" alt=\"\" width=\"900\" height=\"289\"><figcaption>Fig 2. &#8211; NMF topics over time on the documents of Reuters archive 2015<\/figcaption><\/figure>\n<figure><img loading=\"lazy\" decoding=\"async\" style=\"float: left;padding-right: 20px\" src=\"https:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/wp-content\/uploads\/2016\/07\/MAI5-2016-3.png\" alt=\"\" width=\"900\" height=\"289\"><figcaption>Fig 3. &#8211; Dynamic Topic model with emerging topics on the documents of August&nbsp;and September 1996 from RCV1 Corpus<\/figcaption><\/figure>\n<h4>Problem statement and motivation:<\/h4>\n<p class=\"p1\">Growth of internet came along with an increasingly complex amount of text data&nbsp;from emails, news sources, forums, etc. As a consequence, it is impossible for a&nbsp;single person to keep track of all relevant text data in most cases and moreover&nbsp;to detect changes in trends or topics. Every company (and every person) would&nbsp;be interested to harness this amount of free and cheap data in order to develop&nbsp;intelligent algorithms, which are able to react to emerging topics as fast as&nbsp;possible and at the same time track existing topics over long time spans. There&nbsp;are many techniques about topic extraction (like Nonnegative Matrix&nbsp;Factorization (NMF) or Latent Dirichlet Allocation (LDA) [Blei et al., 2003]) but&nbsp;there are not many extensions to dynamic data handling. The goal of this project&nbsp;is to explore LDA (or other techniques) as a technique to detect topics as they&nbsp;appear and track them through time. Corpus can be the (fully annotated and&nbsp;immediately available) RCV1 Reuters corpus (810.000 documents) and\/or the&nbsp;actual Reuters archive.<\/p>\n<figure><figcaption><\/figcaption><\/figure>\n<h4>Research questions\/hypotheses:<\/h4>\n<ul>\n<li>\n<p class=\"p1\">How to detect, track and visualize topics in a large document&nbsp;collection, that dynamically change in the course of a certain time&nbsp;span?<\/p>\n<\/li>\n<\/ul>\n<h4>Main outcomes:<\/h4>\n<ul>\n<li>\n<p class=\"p1\">This report presents two approaches to detect evolving and dynamically&nbsp;changing topics in a large document collection, and visualizes them in the form&nbsp;of a topic river, allowing for easy and direct association between topics, terms&nbsp;and documents.<\/p>\n<\/li>\n<li>\n<p class=\"p1\">The LDA dynamic topic model was researched and applied to the news articles of&nbsp;the 6 weeks in August and September 1996 of the Reuters corpus RCV1. After&nbsp;applying careful preprocessing, it was possible to identify some of the main&nbsp;events happening at that time. Examples of detected&nbsp;topics are with the corresponding main word descriptors are:<\/p>\n<p><strong>Child abuse in Belgium:<\/strong> child, police, woman, death, family, girl, dutroux<br \/>\n<strong>Tropical storm Edouard:<\/strong> storm, hurricane, north, wind, west, mph, mile<br \/>\n<strong>Peace talks in Palestina:<\/strong> israel, peace, israeli, netanyahu, minister, palestinian, arafat<br \/>\n<strong>Kurdish war in Iraq:<\/strong> iraq, iraqi, iran, kurdish, turkey, northern, arbil<\/p>\n<p class=\"p1\">Summarizing the LDA based approach, the dynamic topic model produces topics,&nbsp;that are on a more generalized level, at least when the same number of topics is&nbsp;chosen, similar to the annotated topics. A high frequency of topic evolvement&nbsp;can not be seen here.<\/p>\n<\/li>\n<li>\n<p class=\"p1\">For NMF over time, the NMF algorithm was applied on separate time steps of the&nbsp;data and then connected using a similarity metric, thus creating a topic river&nbsp;with evolving and emerging topics. By using extensive cleaning of the vocabulary&nbsp;during pre-processing, a fast data processing algorithm was developed that is&nbsp;able to process a year of data with around 3000 text files per day and 50 topics&nbsp;generated per month in circa 15 hours. The generated topics can easily be&nbsp;identified by their most relevant terms and associated with events happening in&nbsp;the corresponding time period. Example of some main topics of 2015 are:<\/p>\n<p>topic #2: &nbsp; games, goals, wingers, play, periods<br \/>\ntopic #3: &nbsp; gmt, federal, banks, diaries, reservers<br \/>\ntopic #15: islamic, goto, pilots, jordan, jordanians<br \/>\ntopic #16: ukraine, russia, russians, sanctions, moscow<br \/>\ntopic #18: euros, greece, ecb, zones, germanic<\/li>\n<li>\n<p class=\"p1\">The visualization with a stacked graph over time already works well. What can be&nbsp;improved is the performance for large data sets in a way, that for cases when not&nbsp;everything can be displayed at once, approximations to the original series are&nbsp;built, that are less expensive to decrease load time. Other additions can be&nbsp;flexible visualizations with user-defined time spans to display or statistics for&nbsp;single topics (even if some tend to be very short-lived). Other improvements&nbsp;include more colors for the graph palette if there are too many topics to display&nbsp;at once, and, if needed, smoothing of the graph lines.<\/p>\n<\/li>\n<li>\n<p class=\"p1\">In this case, the Reuters Corpus was used as test data, but the developed&nbsp;systems are dynamic and reusable and can take an arbitrary corpus of text data&nbsp;to extract a topic river. Topics so far are mainly identified by&nbsp;their most relevant terms, which already gives a sufficient overview on the&nbsp;topic\u2019s content. However, for a more comprehensive and sophisticated&nbsp;description of a topic, it is possible to create story lines or summaries by&nbsp;applying natural language processing techniques on the most relevant&nbsp;documents of a topic.<\/p>\n<\/li>\n<\/ul>\n<h4>References:<\/h4>\n<p class=\"p1\">[Blei and Lafferty, 2006] Blei, D. M. and Lafferty, J. D. (2006). Dynamic topic&nbsp;models. In Proceedings of the 23rd International Conference on Machine&nbsp;Learning, ICML \u201906, pages 113\u2013120, New York, NY, USA. ACM.<\/p>\n<p class=\"p1\">[Blei et al., 2003] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet&nbsp;allocation. J. Mach. Learn. Res., 3:993\u20131022.<\/p>\n<p class=\"p1\">[Cao et al., 2007] Cao, B., Shen, D., Sun, J.-T., Wang, X., Yang, Q., and Chen, Z.&nbsp;(2007). Detect and track latent factors with online nonnegative matrix&nbsp;factorization. IJCAI, page 2689\u20132694.<\/p>\n<p class=\"p1\">[Lewis et al., 2004] Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004). RCV1: A&nbsp;new benchmark collection for text categorization research. J. Mach. Learn. Res.,&nbsp;5:361\u2013397.<\/p>\n<p class=\"p1\">[Saha and Sindhwani, 2012] Saha, A. and Sindhwani, V. (2012). Learning evolving&nbsp;and emerging topics in social media: a dynamic NMF&nbsp;approach with temporal&nbsp;regularization. Proceedings of the fifth ACM international conference on Web&nbsp;search and data mining, page 693\u2013702.<\/p>\n<p class=\"p1\">[Tannenbaum et al., 2015] Tannenbaum, M., Fischer, A., and Scholtes, J. C.&nbsp;(2015). Dynamic topic detection and tracking using non-negative matrix&nbsp;factorization.<\/p>\n<h4>Downloads:<\/h4>\n<p><a href=\"https:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/wp-content\/uploads\/2016\/07\/Final-Report-Topic-Detection-and-Tracking-System-Group-5.pdf\" rel=\"\">Final Report<br \/>\n<\/a><a href=\"https:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/wp-content\/uploads\/2016\/07\/Final-Presentation_-Topic-Detection-and-Tracking-System.pdf\" rel=\"\">Final Presentation<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Student(s):&nbsp;Daniel Br\u00fcggemann, Yannik Hermey, Carsten Orth, Darius Schneider, Stefan&nbsp;Selzer; Supervisor(s): Dr. Gerasimos (Jerry) Spanakis; Semester: 2015-2016; Fig 1. &#8211; Dynamic Topic model with emerging topics on the documents of August&nbsp;and September 1996 from RCV1 Corpus Fig 2. &#8211; NMF topics over time on the documents of Reuters archive 2015 Fig 3. &#8211; Dynamic Topic model [&hellip;]<\/p>\n","protected":false},"author":22,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[26,10,16,28],"tags":[47,35,46,45,48],"class_list":["post-207","post","type-post","status-publish","format-standard","hentry","category-data-analysis","category-machine-learning","category-master-ai-semester-project","category-rai","tag-lda","tag-machine-learning","tag-nmf","tag-text-mining","tag-visualization"],"_links":{"self":[{"href":"https:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/index.php?rest_route=\/wp\/v2\/posts\/207","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/index.php?rest_route=\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=207"}],"version-history":[{"count":7,"href":"https:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/index.php?rest_route=\/wp\/v2\/posts\/207\/revisions"}],"predecessor-version":[{"id":309,"href":"https:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/index.php?rest_route=\/wp\/v2\/posts\/207\/revisions\/309"}],"wp:attachment":[{"href":"https:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=207"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=207"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/project.dke.maastrichtuniversity.nl\/studentprojects\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=207"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}