Support Questions
Find answers, ask questions, and share your expertise

spark python tf idf

New Contributor

HI i have a problem with TF-IDF implementation with spark (python) when i made the code could someone correct me the code please especially i encountered a problem while dealing with two keys ( document and word) so how to do a reduce by key per word and per document

from pyspark import SparkContext
import math sc = SparkContext()
rdd= sc.wholeTextFiles('')
tf=rdd.flatMap(lambda x :[ (x[0], y) for y x[1].split('\r\n').lowercase()])\ 
 .map(lambda x,word: (x[0],(word, 1)) \ 
 .reduceByKey(lambda x, y: (x[1][1] +y[1][1])) \ 
 .collect()  .persist()
  if=rdd.flatMap(lambda x : [ (x[0], y) for y in x[1].split('\r\n') ])\   
     .map(lambda word: ((x[0],1),word)\  
      .reduceByKey(lambda x, y: (x[0][1] +y[0][1])) \     
    .persist() n: math.log(5/n)).collect()  
tf-idf=tf.join(idf).map(lambda x,y: x*y).collect()