Gambal (from "Generador d'estrAts per a recuperacio d'inforMacio BAsat en metodes de cLassificacio") is a system for clustering and visualization of html documents. The system encompasses several tools. Some of them are described below as separated entities.
The Gambal System
- Crawling the web:
- Two alternative mechanisms can be used. One mechanism consists on gathering web HTML documents from an initial file of http addresses and then extending the search to pointers in these addresses and so on. The second one consists on accessing web pages from a Google query.
For each retrieved HTML document, the language used is determined and words are preprocessed accordingly (stop words are removed and a stemming algorithm is applied).
- Document representation:
- All documents of interest are transformed into a unified vector representation. At this point non-relevant words (frequency under a given threshold) are removed. Inverse indices are built.
- Similarity computation:
- Three alternative ways are considered for computing similarity between documents. The first one consists on the comparison of the lexical elements in the two elements (due to the internal vectorial representation this is just a element-wise comparison). The second one compares words using a dictionary
The third one compares words taking into account
latent semantics analysis
(at this stage we use for LSA decomposition the available web pages).
- Clustering and/or visualization:
- Three visualization methods have been implemented: (i) Hierarchical Spherical Clustering (HSC); (ii) Self-Organizing Maps (SOM); (iii) Hierarchical Self-Organizing Maps (HSOM). HSC is based on c-means on a Sammon's map (a method for multidimensional scaling).
Visualization tools permits to navigate on the hierarchy, change the level of detail and click and display particular documents. Documents similar to the one clicked are also linked.
See the following references for detailed information about SOM, HSOM and HSC:
- Kohonen, T., (1997), Self-Organizing maps, 2nd edition, Springer-Verlag, Germany
- Merkl, D., Rauber, A., (2000), Document Classification with Unsupervised Neural Networks, in F. Crestani, G. Pasi (Eds.), Soft Computing in Information Retrieval, Physica Verlag and Co, ISBN:3790812994, pp. 102 - 121, Germany.
- Torra, V., Miyamoto, S., (2002), Hierarchical Spherical Clustering, Int. J. of Unc., Fuzz., and Knowledge Based Systems, 10:2 157-172
- Torra, V., Miyamoto, S., Lanau, S., (2004), Exploration of textual databases using a fuzzy hierarchical clustering algorithm in the GAMBAL system, Information Processing and Management, in press.
Figure gives a snapshot of the Gambal system. On the left hand side, the figure shows a particular layer of the HSC (with a representation of the documents) and a list with a selected document together with a list of similar ones. On the right hand side, the figure includes a selected web page from the list.
The Gambal system for information retrieval and web clustering