Ponente
Descripción
In the more and more influential context of the big data, unsupervised mining of textual data content, like topic modeling, is becoming a strategic task. Topic-models, among which LDA, become thus extensively used for various tasks that might make further rich use of the extracted topics for text. However, some former works that performed subjective evaluation of LDA results have shown that LDA produced poor output whenever it is exploited for the analysis of complex data, like for example, for the analysis of research by the use scientific papers. Increasing the accuracy of topic modeling methods is thus still a main concern for performing suitable topic extraction and enhance the quality of further studies. The objective of this paper is two-folds. First, to propose improvements to an alternative topic-modeling approach based on neural clustering and feature maximization. Second, to propose a quantitative comparison between LDA and this new alternative method, we called CFMf. By using a large-scale reference corpus of full-text philosophy of science articles (N=16,917), we thus compare the resulting topics of this latter method to those obtained with LDA. The results show a highly significant improvement (+50%) along key quantitative performance measures such as coherence, independently of the number of topics. Moreover, exploiting the principles of feature maximization on LDA results, we additionally show that it can correct LDA topic description and furthermore significantly increase LDA performance. We discuss these promising results and highlight rich upcoming research work.