relative distances between the document vectors, while collapsing them
down into a much smaller set of dimensions. In this collapse,
information is lost, and content words are superimposed on one
another.
Information loss sounds like a bad thing, but here it is a blessing.
What we are losing is noise from our original term-document matrix,
revealing similarities that were latent in the document collection.
Similar things become more similar, while dissimilar things remain
distinct. This reductive mapping is what gives LSI its seemingly
intelligent behavior of being able to correlate semantically related
terms. We are really exploiting a property of natural language, namely
that words with similar meaning tend to occur together."
-- http://www.knowledgesearch.org/lsi/lsa_explanation.htm
That page continues with a nice tutorial on how to use SVD to make
http://www.knowledgesearch.org/lsi/tutorial.htm
Unfortunately, I haven't been able to get the corresponding code
to build with the latest Ubuntu:
http://semantic-engine.googlecode.com/
Contacting the developers and other project members to bug them.
0 comments:
Post a Comment