Category: Data analysis

Data analysis Machine Learning Master AI semester project RAI

Topic detection and tracking system

Post author By Jerry Spanakis
Post date July 1, 2016

Student(s): Daniel Brüggemann, Yannik Hermey, Carsten Orth, Darius Schneider, Stefan Selzer;
Supervisor(s): Dr. Gerasimos (Jerry) Spanakis;
Semester: 2015-2016;

Fig 1. – Dynamic Topic model with emerging topics on the documents of August and September 1996 from RCV1 Corpus

Fig 2. – NMF topics over time on the documents of Reuters archive 2015

Fig 3. – Dynamic Topic model with emerging topics on the documents of August and September 1996 from RCV1 Corpus

Problem statement and motivation:

Growth of internet came along with an increasingly complex amount of text data from emails, news sources, forums, etc. As a consequence, it is impossible for a single person to keep track of all relevant text data in most cases and moreover to detect changes in trends or topics. Every company (and every person) would be interested to harness this amount of free and cheap data in order to develop intelligent algorithms, which are able to react to emerging topics as fast as possible and at the same time track existing topics over long time spans. There are many techniques about topic extraction (like Nonnegative Matrix Factorization (NMF) or Latent Dirichlet Allocation (LDA) [Blei et al., 2003]) but there are not many extensions to dynamic data handling. The goal of this project is to explore LDA (or other techniques) as a technique to detect topics as they appear and track them through time. Corpus can be the (fully annotated and immediately available) RCV1 Reuters corpus (810.000 documents) and/or the actual Reuters archive.

Research questions/hypotheses:

How to detect, track and visualize topics in a large document collection, that dynamically change in the course of a certain time span?

Main outcomes:

This report presents two approaches to detect evolving and dynamically changing topics in a large document collection, and visualizes them in the form of a topic river, allowing for easy and direct association between topics, terms and documents.
The LDA dynamic topic model was researched and applied to the news articles of the 6 weeks in August and September 1996 of the Reuters corpus RCV1. After applying careful preprocessing, it was possible to identify some of the main events happening at that time. Examples of detected topics are with the corresponding main word descriptors are:

Child abuse in Belgium: child, police, woman, death, family, girl, dutroux
Tropical storm Edouard: storm, hurricane, north, wind, west, mph, mile
Peace talks in Palestina: israel, peace, israeli, netanyahu, minister, palestinian, arafat
Kurdish war in Iraq: iraq, iraqi, iran, kurdish, turkey, northern, arbil

Summarizing the LDA based approach, the dynamic topic model produces topics, that are on a more generalized level, at least when the same number of topics is chosen, similar to the annotated topics. A high frequency of topic evolvement can not be seen here.
For NMF over time, the NMF algorithm was applied on separate time steps of the data and then connected using a similarity metric, thus creating a topic river with evolving and emerging topics. By using extensive cleaning of the vocabulary during pre-processing, a fast data processing algorithm was developed that is able to process a year of data with around 3000 text files per day and 50 topics generated per month in circa 15 hours. The generated topics can easily be identified by their most relevant terms and associated with events happening in the corresponding time period. Example of some main topics of 2015 are:

topic #2: games, goals, wingers, play, periods
topic #3: gmt, federal, banks, diaries, reservers
topic #15: islamic, goto, pilots, jordan, jordanians
topic #16: ukraine, russia, russians, sanctions, moscow
topic #18: euros, greece, ecb, zones, germanic
The visualization with a stacked graph over time already works well. What can be improved is the performance for large data sets in a way, that for cases when not everything can be displayed at once, approximations to the original series are built, that are less expensive to decrease load time. Other additions can be flexible visualizations with user-defined time spans to display or statistics for single topics (even if some tend to be very short-lived). Other improvements include more colors for the graph palette if there are too many topics to display at once, and, if needed, smoothing of the graph lines.
In this case, the Reuters Corpus was used as test data, but the developed systems are dynamic and reusable and can take an arbitrary corpus of text data to extract a topic river. Topics so far are mainly identified by their most relevant terms, which already gives a sufficient overview on the topic’s content. However, for a more comprehensive and sophisticated description of a topic, it is possible to create story lines or summaries by applying natural language processing techniques on the most relevant documents of a topic.

References:

[Blei and Lafferty, 2006] Blei, D. M. and Lafferty, J. D. (2006). Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pages 113–120, New York, NY, USA. ACM.

[Blei et al., 2003] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022.

[Cao et al., 2007] Cao, B., Shen, D., Sun, J.-T., Wang, X., Yang, Q., and Chen, Z. (2007). Detect and track latent factors with online nonnegative matrix factorization. IJCAI, page 2689–2694.

[Lewis et al., 2004] Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004). RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res., 5:361–397.

[Saha and Sindhwani, 2012] Saha, A. and Sindhwani, V. (2012). Learning evolving and emerging topics in social media: a dynamic NMF approach with temporal regularization. Proceedings of the fifth ACM international conference on Web search and data mining, page 693–702.

[Tannenbaum et al., 2015] Tannenbaum, M., Fischer, A., and Scholtes, J. C. (2015). Dynamic topic detection and tracking using non-negative matrix factorization.

Downloads:

Final Report
Final Presentation

Tags LDA, Machine learning, NMF, Text Mining, Visualization

BMI Data analysis Master OR semester project

OR@Heart: Torso and heart segmentation for ECG imaging

Post author By Pietro Bonizzi
Post date March 17, 2016
No Comments on OR@Heart: Torso and heart segmentation for ECG imaging

Student(s): Oskar Person, Yanik Dreiling, Justus Schwann, Ullaskrishnan Poikavila;
Supervisor(s): Dr. Pietro Bonizzi, Dr. Joel Karel, Matthijs Cluitmans;
Semester: 2015-2016;

Problem statement and motivation:

Cardiovascular diseases (CVD), or irregularities in heart and related blood vessels, have led to nearly 17.5 million deaths in 2011 alone, and this number is increasing at a steady rate. This makes CVD the leading cause of deaths worldwide. A majority of these deaths could have been prevented by earlier detection of symptoms. Electrocardiogram Imaging (ECGI) is a technique that helps in quickly detecting the cardiac irregularities and expediting the diagnoses. Electrodes are placed on the torso to record cardiac electrical activity. But the skin and body mass in between the heart and torso dampens these currents leading to an incorrect visualization of cardiac motion. The reverse problem of ECGI tries to reconstruct the true cardiac electric activity using the observed currents on torso electrodes and geometric knowledge of the surfaces of heart and torso. Currently, the reconstruction of heart surface and segmentation of electrodes from the torso surface are done manually. This makes it both time and energy consuming and goes against the whole purpose of ECGI quickening cardiac diagnosis. The aim of this project was to automate the process of reconstructing heart and torso surfaces. CT Scans are taken of the patient with the electrodes on, and they represent the input to the automated segmentation and generation of torso and heart surface. The implemented algorithms try to segment out the electrode strips to help reconstruct the torso surface and also detects the edges to help visualize the heart surface. A GUI is also provided to help the user in running the algorithms. The group achieved successful automated segmentation of the body surface electrode strips, generation of a preliminary torso surface, and preliminary segmentation of the heart surface. Future work will be focused on making the torso model more realistic, improving segmentation of the heart surface, and generating the heart model.

Research questions/hypotheses:

The aim of this project was to automate the process of reconstructing heart and torso surfaces. CT Scans are taken of the patient with the electrodes on, and they represent the input to the automated segmentation and generation of torso and heart surface.

Main outcomes:

The implemented algorithms try to segment out the electrode strips to help reconstruct the torso surface and also detects the edges to help visualize the heart surface. A GUI is also provided to help the user in running the algorithms.

Tags ECG, Image processing

Data analysis Machine Learning Master AI semester project RAI

Comparing Two Techniques for Automatic Food Image Classification

Post author By Jerry Spanakis
Post date February 29, 2016
No Comments on Comparing Two Techniques for Automatic Food Image Classification

Student(s): Nadine Hermans, Alexander Kroner, Wim Logister, Carsten Orth, Josephine Rutten;
Supervisor(s): Dr. Gerasimos (Jerry) Spanakis;
Semester: 2015-2016;

Fig 1. – Example instances of the original food image data set.

Problem statement and motivation:

The assignment of this project was to classify an existing data set with food images provided (available at Maastricht University in the context of a research project) into the correct food category. We compared two techniques, Convolutional Neural Networks (LeCun and Bengio, 1995) with a conventional classification setup of using hand-crafted features with a Support Vector Machine (see e.g. Joutou and Yanai, 2009) on a real data set of user-submitted food images. Different data preprocessing steps and different parameter configurations for the two approaches have been applied. Their efficacy has then been validated and compared by setting up and running experiments that use these different steps and parameters.

Fig 2. – An illustration of how an image is decomposed into color, texture and shape, which are concatenated into a numerical feature vector. This vector serves as input for the SVM.

Fig 3. – A visualization of the learned network weights of the first convolutional layer in the deep convolutional neural network.

Research questions/hypotheses:

For this project, we were presented with an existing data set of food images. We then tried to answer the following question:

What is the optimal classification approach to discriminate between the 19 different classes of food items that were defined for this data set?

More specifically:

Which classification model (this project focussed on Support Vector Machines and Convolutional Neural Networks) performs the best on this data set?
which methods of preprocessing and feature extraction enhance the results the most?

Main outcomes:

After experiments with different pre-processing methods, feature-extraction combinations and kernels and parameters, we have concluded that the Support Vector Machine with the Chi-Squared Kernel and parameters C=1 and gamma=0.1 leads to the best results using all the features (Bag of Features, Gabor filter and Histogram features) on the by GrabCut pre-processed dataset. GrabCut (Rother et al.,2004) separates background from the object by coloring the background black.
A final architecture has been decided on after literature research of Convolutional Neural Networks. With this architecture, results achieved by the CNN are a little bit worse than that of the Support Vector Machine, but the results do not differ much.
In both approaches, problems have been encountered with the dataset which is extremely unbalanced. Therefore, most instances will be classified with the most frequent label. Thus, balancing of the data is necessary. Results obtained are better with a reduced balanced dataset (using fewer than all 19 classes and making sure the number of instances per class are equal), increasing the performance while decreasing the number of labels. Moreover, the high variation of types of images within a class and the variation of quality of the pictures makes classification challenging.
Both approaches have their own advantages and disadvantages. While SVMs are faster to train, simple to implement and robust to over-fitting, features have to be extracted and many parameters have to be tuned. CNNs do not have as many parameters to choose and features will be extracted automatically. However, there has to be decided on an architecture and training may take a while.

References

Joutou, T. and Yanai, K. (2009). A food image recognition system with multiple kernel learning. In 2009 16th IEEE International Conference on Image Processing (ICIP), pages 285-288. IEEE.

LeCun, Y. and Bengio, Y. (1995). Convolutional networks for images, speech,and time series. The handbook of brain theory and neural networks, 3361(10).

Rother, C., Kolmogorov, V., and Blake, A. (2004). Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG),23(3):309-314.

Tags Computer vision, Food categorization, Image analysis, Machine learning, Neural network, Support vector machine