Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Mahout RowSimilarity

Mahout RowSimilarity

New Contributor

Hi all,

 

I'm new to the community and Mahout. Happy to be here. :-)

 

I have the following problem that I'm having difficulty with. I've setup an instance on Amazon with Mahout and can run some basic machine learning tasks (just testing). Now I'm trying to do a specific task and am unsure how to proceed.

 

Imagine I have a data file containing the following columns: user_id, item_id, and rating, where rating is how each user rated the item on a scale of -1 to 1 (the necessity of negative ratings will become apparent in a minute). Ultimately, what I'm trying to do is create a similarity matrix that measures the similarity between all pairs of USERS. To do this, I would like to transform the users' ratings into a matrix (rows are users, columns are items) and then run RowSimilarityJob to find the dot product / cosine between all rows.

 

I feel like my problem is simple and has probably been done 1000 times, but I can't seem to find any documentation directly on the subject. The best I've been able to do so far is use the similaritem function (where I've swapped item for user), but it's mathematically not quite what I want. Help! haha

 

Thanks!

Jonathan

1 REPLY 1

Re: Mahout RowSimilarity

Contributor

Hi Jonathan,

 

you should convert the data file into asequence file using mahout e.g:

mahout seqdirectory --input inputFolder --output outputFolder

or using java class

SequenceFilesFromDirectory

Then you can convert this sequence file(s) into a matrix (sparse vectors) e.g. using:

mahout seq2sparse -i seqfilesFolder -o outputFolder

or using java class

SparseVectorsFromSequenceFiles

After you have vectorize the input you can run your jobs e.g. Kmeans on the outputFolder of seq2sparse.

 

Regards,

 butkiz