Topic Modeling Fake News
26 June ’18
I decided to change things up a little bit and take on an unsupervised learning task: topic modeling. For this, I explored an endlessly entertaining dataset, a database of fake news articles compiled by Kaggle. It is comprised of ~13K different articles from 200 different sources circa Oct'16 - Nov'16 (a period which coincided with the 2016 US Presidential Election).
First, I set up a pipeline to express each article as a bag-of-words. I included a few preprocessing steps to clean the data (lowercase, remove URLs, remove HTML tags, and detect common phrases). I also used TF-IDF (term frequency - inverse document frequency) to assign higher weights to more "important" words. For the topic model itself, I used non-negative matrix factorization (NMF). Other popular approaches include probabilistic models like LDA.
Whether using NMF or LDA, determining the number of topics to model is always a challenge. If topics are too big, our topics will be unspecific and too broad. If topics are too small, our topics will be overly descriptive and too narrow. We can build models with different numbers of topics, but how do we evaluate model performance? Two options:
- Topic Cohesiveness - The extent to which a topic’s top keywords are semantically related in our overall corpus. We can use word embeddings created with Word2Vec to measure the similarity between keywords. Higher = better.
- Topic Generality - The extent to which each topic's top keywords overlap with other topics' top keywords. We can use mean pairwise Jaccard similarity between topics for this. Lower = better.
Source: D. O’Callaghan, D. Greene, J. Carthy, and P. Cunningham, “An Analysis of the Coherence of Descriptors in Topic Modeling,” Expert Systems with Applications (ESWA), 2015. [link]
I selected a number of topics that balanced these two measures (n=40). The below chart summarizes the frequency and top keywords for each topic in the final model. Overall, I found the topics to be highly cohesive and relevant!
Note: I found that my solutions were highly sensitive to NMF regularization. When using an L1 regularization, I had one very large topic and a long tail of small topics. I had a similar problem to a lesser degree when using L2 regularization. In the end, I chose to use no regularization because I was okay with articles belonging to multiple topics.