Support Questions

Find answers, ask questions, and share your expertise

Understanding the mahout SSVD output!

avatar
Contributor

Dear Colleagues,

 

In order to run a SSVD in mahout the documents were represented in a tfidf matrix using seq2sparse

(the row-index are the doc-ids and the column-index are the dict-id (word-id)).

 

The input for SSVD is these tfidf-matrix.

The output of the SSVD job are the matrices U,S,V (transpose).

 

How i can interprete this output regarding the original tfidf-matrix? Should i multiplice the original one with U, S or V?

What is the conclusion?

 

Thanks in advance and best regards,

 butkiz

 

2 ACCEPTED SOLUTIONS

avatar
Master Collaborator
The output is as you say -- these are the products of the SVD. You can
do what you want with them, and it depends on what you're trying to
achieve. You can look at the matrix V S to study term similarities, or
U S to discover document similarities for example.

View solution in original post

avatar
Master Collaborator
I suppose you can cluster term vectors in V S for this purpose, to
discover related terms and thus topics.
This is the type of problem where you might more usually use LDA.

I know you're using Mahout, but if you ever consider using Spark,
there's a chapter on exactly this in our book:
http://shop.oreilly.com/product/0636920035091.do

View solution in original post

3 REPLIES 3

avatar
Master Collaborator
The output is as you say -- these are the products of the SVD. You can
do what you want with them, and it depends on what you're trying to
achieve. You can look at the matrix V S to study term similarities, or
U S to discover document similarities for example.

avatar
Contributor
Thanks! I try to figure out which terms are related to one topic. Should i multiplice at first the V and S matrices and then compute the distance of the "new" vectors? Whats your understanding?

Thanks and regards,
butkiz

avatar
Master Collaborator
I suppose you can cluster term vectors in V S for this purpose, to
discover related terms and thus topics.
This is the type of problem where you might more usually use LDA.

I know you're using Mahout, but if you ever consider using Spark,
there's a chapter on exactly this in our book:
http://shop.oreilly.com/product/0636920035091.do