Support Questions

gerard_alexande · ‎11-08-2016

This statement for bigram generation always gives the right result, no matter how much paralleliziation provided. Why does it always give the correct result? Due to sliding itself? Parallel processing for adding +1 to a number is easy to understand, but this here has to do with data over partitions.

Reading a big file or mapPartitions approach will result in minor loss of accuracy (verified practically although obvious), but why not here? It must be simple, but I cannot see it.

val rdd = sc.parallelize(Array("A", "B", "C", "D", "E", "F"),5)   
rdd.sliding(2).collect()

jfrazee · ‎11-08-2016

@Gerard Alexander sliding() keeps track of the partition index, which in this case corresponds to the ordering of the unigrams. Compare rdd.mapPartitionsWithIndex { (i, p) => p.map { e => (i, e) } }.collect() and rdd.sliding(2).mapPartitionsWithIndex { (i, p) => p.map { e => (i, e) } }.collect() to help with the intuition.

View solution in original post

jfrazee · ‎11-08-2016

@Gerard Alexander sliding() keeps track of the partition index, which in this case corresponds to the ordering of the unigrams. Compare rdd.mapPartitionsWithIndex { (i, p) => p.map { e => (i, e) } }.collect() and rdd.sliding(2).mapPartitionsWithIndex { (i, p) => p.map { e => (i, e) } }.collect() to help with the intuition.

gerard_alexande · ‎11-08-2016

Fantastic. OK, I sort of stated "due to sliding itself" but could not in fact make the link. Great stuff.

gerard_alexande · ‎11-09-2016

So, just to finish the dicussion I have an inferior sliding algorithm like this using file parallelization:

val sentences = sc.textFile("/FileStore/tables/flcpmtie1478689806159/small_file.txt",5)
val bigrams =   sentences.map(sentence => sentence.trim.split(' ')).flatMap( wordList =>
  for (i <- List.range(0, (wordList.length - 1))) yield ((wordList(i), wordList(i + 1)), 1)
)

I also always get the correct bigrams, except on line boundaries, but leaving that aside. The code is as suggested by someone else.

This leads me to the idea that flatMap will also present the data in sequence of the partitions as you previously stated. Or is this not so.

Then, do these things all work in parallel as per map? Or is it sequentially looking for the boundaries? Or a combination of both?

I actually thought I might get an error with the above when using partitions.

@jfrazee

Cloudera Community

Support Questions

sc.parallelize effect on parallel algorithms for bigram?