Member since
11-08-2016
6
Posts
0
Kudos Received
0
Solutions
11-09-2016
11:41 AM
So, just to finish the dicussion I have an inferior sliding algorithm like this using file parallelization: val sentences = sc.textFile("/FileStore/tables/flcpmtie1478689806159/small_file.txt",5)
val bigrams = sentences.map(sentence => sentence.trim.split(' ')).flatMap( wordList =>
for (i <- List.range(0, (wordList.length - 1))) yield ((wordList(i), wordList(i + 1)), 1)
) I also always get the correct bigrams, except on line boundaries, but leaving that aside. The code is as suggested by someone else. This leads me to the idea that flatMap will also present the data in sequence of the partitions as you previously stated. Or is this not so. Then, do these things all work in parallel as per map? Or is it sequentially looking for the boundaries? Or a combination of both? I actually thought I might get an error with the above when using partitions. @jfrazee
... View more