Support Questions

butkiz · ‎09-17-2015

Dear Colleagues,

In order to run a SSVD in mahout the documents were represented in a tfidf matrix using seq2sparse

(the row-index are the doc-ids and the column-index are the dict-id (word-id)).

The input for SSVD is these tfidf-matrix.

The output of the SSVD job are the matrices U,S,V (transpose).

How i can interprete this output regarding the original tfidf-matrix? Should i multiplice the original one with U, S or V?

What is the conclusion?

Thanks in advance and best regards,

butkiz

srowen · ‎09-17-2015

The output is as you say -- these are the products of the SVD. You can
do what you want with them, and it depends on what you're trying to
achieve. You can look at the matrix V S to study term similarities, or
U S to discover document similarities for example.

View solution in original post

srowen · ‎09-17-2015

I suppose you can cluster term vectors in V S for this purpose, to
discover related terms and thus topics.
This is the type of problem where you might more usually use LDA.

I know you're using Mahout, but if you ever consider using Spark,
there's a chapter on exactly this in our book:
http://shop.oreilly.com/product/0636920035091.do

View solution in original post

srowen · ‎09-17-2015

The output is as you say -- these are the products of the SVD. You can
do what you want with them, and it depends on what you're trying to
achieve. You can look at the matrix V S to study term similarities, or
U S to discover document similarities for example.

butkiz · ‎09-17-2015

Thanks! I try to figure out which terms are related to one topic. Should i multiplice at first the V and S matrices and then compute the distance of the "new" vectors? Whats your understanding?

Thanks and regards,
butkiz

srowen · ‎09-17-2015

I suppose you can cluster term vectors in V S for this purpose, to
discover related terms and thus topics.
This is the type of problem where you might more usually use LDA.

I know you're using Mahout, but if you ever consider using Spark,
there's a chapter on exactly this in our book:
http://shop.oreilly.com/product/0636920035091.do

Cloudera Community

Support Questions

Understanding the mahout SSVD output!