Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Understanding the mahout SSVD output!

avatar
Contributor

Dear Colleagues,

 

In order to run a SSVD in mahout the documents were represented in a tfidf matrix using seq2sparse

(the row-index are the doc-ids and the column-index are the dict-id (word-id)).

 

The input for SSVD is these tfidf-matrix.

The output of the SSVD job are the matrices U,S,V (transpose).

 

How i can interprete this output regarding the original tfidf-matrix? Should i multiplice the original one with U, S or V?

What is the conclusion?

 

Thanks in advance and best regards,

 butkiz

 

2 ACCEPTED SOLUTIONS

avatar
Master Collaborator
The output is as you say -- these are the products of the SVD. You can
do what you want with them, and it depends on what you're trying to
achieve. You can look at the matrix V S to study term similarities, or
U S to discover document similarities for example.

View solution in original post

avatar
Master Collaborator
I suppose you can cluster term vectors in V S for this purpose, to
discover related terms and thus topics.
This is the type of problem where you might more usually use LDA.

I know you're using Mahout, but if you ever consider using Spark,
there's a chapter on exactly this in our book:
http://shop.oreilly.com/product/0636920035091.do

View solution in original post

3 REPLIES 3

avatar
Master Collaborator
The output is as you say -- these are the products of the SVD. You can
do what you want with them, and it depends on what you're trying to
achieve. You can look at the matrix V S to study term similarities, or
U S to discover document similarities for example.

avatar
Contributor
Thanks! I try to figure out which terms are related to one topic. Should i multiplice at first the V and S matrices and then compute the distance of the "new" vectors? Whats your understanding?

Thanks and regards,
butkiz

avatar
Master Collaborator
I suppose you can cluster term vectors in V S for this purpose, to
discover related terms and thus topics.
This is the type of problem where you might more usually use LDA.

I know you're using Mahout, but if you ever consider using Spark,
there's a chapter on exactly this in our book:
http://shop.oreilly.com/product/0636920035091.do