Unsupervised Feature Generation using Knowledge Repositories for Text Categorization
Speaker: 
R.Rajendra Prasath
Institution: 
Norwegian University of Science and Technology (NTNU)
Date: 
25 May 2010 - 12:00pm

In the digital era, text documents are growing every day. In such huge
collection, quickly identifying the relevant text document related to a specific
topic is a challenging task. This leads to Text Categorization - the process
of assigning predefined category labels (topics / themes) to natural language
text documents based on their content. Text categorization has potential
usage in many realtime applications. In this talk, we describe unsupervised
feature generation methods for improving text categorization using vast
repositories of human knowledge available in the world wide web. It is better
to identify the actual context behind the features, rather than mere their
occurrences in the documents, by inducing additional features from external
knowledge sources. In this case, for every feature, its context is extracted as
a knowledge concept through content and structure mining of the machine
readable open source knowledge repository, especially Wikipedia. Then, we
apply the graph based clustering on those features present in the extracted
knowledge concepts; cluster them and then obtain knowledge cluster vectors.
Using these knowledge cluster vectors as generated features, the input text
documents are mapped into a higher dimensional feature space. Then, we
try to possibly capture the actual context of the given documents so as to
classify them based on its context. Even though the knowledge present in
world wide web is not readily available, text mining approaches through
identifying the associated feature relations in the given text fragment result
in better categorization process. In the sequel, we present some of the
preliminary results to show that the unsupervised feature generation using
knowledge repositories identifies the associated feature relations in the given
text fragment and yields improved classification accuracy on the standard
dataset like Reuters 21578.