Self organization of a massive document collection

T Kohonen; S Kaski; K Lagus; J Salojarvi; J Honkela; V Paatero; A Saarela

doi:10.1109/72.846729

Self organization of a massive document collection

IEEE Trans Neural Netw. 2000;11(3):574-85. doi: 10.1109/72.846729.

Authors

T Kohonen¹, S Kaski, K Lagus, J Salojarvi, J Honkela, V Paatero, A Saarela

Affiliation

¹ Neural Networks Research Centre, Helsinki University of Technology, Espoo, Finland.

PMID: 18249786
DOI: 10.1109/72.846729

Abstract

This article describes the implementation of a system that is able to organize vast document collections according to textual similarities. It is based on the self-organizing map (SOM) algorithm. As the feature vectors for the documents statistical representations of their vocabularies are used. The main goal in our work has been to scale up the SOM algorithm to be able to deal with large amounts of high-dimensional data. In a practical experiment we mapped 6,840,568 patent abstracts onto a 1,002,240-node SOM. As the feature vectors we used 500-dimensional vectors of stochastic figures obtained as random projections of weighted word histograms.