Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Understanding the mahout SSVD output!

Solved Go to solution

Understanding the mahout SSVD output!

Contributor

Dear Colleagues,

 

In order to run a SSVD in mahout the documents were represented in a tfidf matrix using seq2sparse

(the row-index are the doc-ids and the column-index are the dict-id (word-id)).

 

The input for SSVD is these tfidf-matrix.

The output of the SSVD job are the matrices U,S,V (transpose).

 

How i can interprete this output regarding the original tfidf-matrix? Should i multiplice the original one with U, S or V?

What is the conclusion?

 

Thanks in advance and best regards,

 butkiz

 

2 ACCEPTED SOLUTIONS

Accepted Solutions

Re: Understanding the mahout SSVD output!

Master Collaborator
The output is as you say -- these are the products of the SVD. You can
do what you want with them, and it depends on what you're trying to
achieve. You can look at the matrix V S to study term similarities, or
U S to discover document similarities for example.

Re: Understanding the mahout SSVD output!

Master Collaborator
I suppose you can cluster term vectors in V S for this purpose, to
discover related terms and thus topics.
This is the type of problem where you might more usually use LDA.

I know you're using Mahout, but if you ever consider using Spark,
there's a chapter on exactly this in our book:
http://shop.oreilly.com/product/0636920035091.do

3 REPLIES 3

Re: Understanding the mahout SSVD output!

Master Collaborator
The output is as you say -- these are the products of the SVD. You can
do what you want with them, and it depends on what you're trying to
achieve. You can look at the matrix V S to study term similarities, or
U S to discover document similarities for example.

Highlighted

Re: Understanding the mahout SSVD output!

Contributor
Thanks! I try to figure out which terms are related to one topic. Should i multiplice at first the V and S matrices and then compute the distance of the "new" vectors? Whats your understanding?

Thanks and regards,
butkiz

Re: Understanding the mahout SSVD output!

Master Collaborator
I suppose you can cluster term vectors in V S for this purpose, to
discover related terms and thus topics.
This is the type of problem where you might more usually use LDA.

I know you're using Mahout, but if you ever consider using Spark,
there's a chapter on exactly this in our book:
http://shop.oreilly.com/product/0636920035091.do

Don't have an account?
Coming from Hortonworks? Activate your account here