There are two types of data - with and without labels. In this post I will focus on the unlabeled text, which should simulate human dialog or an email communication. There is a lot of stuff you can do with these texts, one of them is clustering.
You can use already implemented algorithms like KMeans, but you will struggle with textual representation, because in the human generated text is a lot of missplelling and abbreviations. This makes TF-IDF very sparse and ineffective to catch similarities in the text. For example, “cat” and “kitten” would be as far as “cat” and “house”. Because TF-IDF does not catch the semantic.
To solve this we can use the work of Tomas Mikolov. He created a neural network model which can “catch” the semantics of text into low-dimensional space (usually 300 dim). I want to talk about it in a future post. We will use FastText because it is character-based. We will base our approach on this paper. So we use coordinate-wise mean, because it is simple to implement and provide not bad results.
My data (taken from the Enron dataset) looks like this:
Here is a part where I will take care of embedding. I have already implemented version where I can use TF-IDF weight to adjust embedding:
To get scored for ith document. We have to know the index of the document:
These are our helper functions. How I will show you one iteration of clustering by taking the most similar sentences to my “seed” text and labeled them with some number/tag/whatever. Then I will choose another “seed” and do the same again.