Gathatoulie: most singular

Tuesday, February 9, 2010

most singular

"The SVD algorithm preserves as much information as possible about the
relative distances between the document vectors, while collapsing them
down into a much smaller set of dimensions. In this collapse,
information is lost, and content words are superimposed on one
another.

Information loss sounds like a bad thing, but here it is a blessing.
What we are losing is noise from our original term-document matrix,
revealing similarities that were latent in the document collection.
Similar things become more similar, while dissimilar things remain
distinct. This reductive mapping is what gives LSI its seemingly
intelligent behavior of being able to correlate semantically related
terms. We are really exploiting a property of natural language, namely
that words with similar meaning tend to occur together."

-- http://www.knowledgesearch.org/lsi/lsa_explanation.htm

That page continues with a nice tutorial on how to use SVD to make

http://www.knowledgesearch.org/lsi/tutorial.htm

Unfortunately, I haven't been able to get the corresponding code
to build with the latest Ubuntu:
http://semantic-engine.googlecode.com/

Contacting the developers and other project members to bug them.