<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Best Approach - Operating on values of a Spark Pair RDD (discard key) in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-Approach-Operating-on-values-of-a-Spark-Pair-RDD/m-p/122518#M30843</link>
    <description>&lt;P&gt;If your goal is to simply operate on the value one at a time, then you could use the "values" method of the PairRdd to get a plain Rdd on which you could use the map method. Something like this:&lt;/P&gt;&lt;PRE&gt;JavaPairRDD&amp;lt;LongWritable, BytesWritable&amp;gt; fixedFileRdd = getItSomeHow();
JavaRDD&amp;lt;String&amp;gt; resultantRdd = fixedFileRdd.values().map(
  new Function&amp;lt;BytesWritable, String&amp;gt;() {
    public Boolean call(BytesWritable i) {
      // do stuff
      return System.currentTimeMillis()+"-"+i.copyBytes();
  }});&lt;/PRE&gt;</description>
    <pubDate>Mon, 06 Jun 2016 21:09:40 GMT</pubDate>
    <dc:creator>clukasik</dc:creator>
    <dc:date>2016-06-06T21:09:40Z</dc:date>
    <item>
      <title>Best Approach - Operating on values of a Spark Pair RDD (discard key)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-Approach-Operating-on-values-of-a-Spark-Pair-RDD/m-p/122517#M30842</link>
      <description>&lt;P&gt;Hi All, &lt;/P&gt;&lt;P&gt;Need recommendation on the best approach for solving the below problem. I have included the code snippet that I have done. &lt;/P&gt;&lt;P&gt;I read a hdfs file using a custom input format and in turn get a PairRdd. Now I am interested in operating on the value one at a time and I am not bothered of the key. &lt;/P&gt;&lt;P&gt;Is a java list a scalable data structure to hold the values? Please have a look at the code below and suggest alternates. Also does the parallelize at the end of code give any benefit? &lt;/P&gt;&lt;PRE&gt;JavaPairRDD&amp;lt;LongWritable, BytesWritable&amp;gt; fixedFileRdd = getItSomeHow();
List&amp;lt;String&amp;gt;zeroValue = new ArrayList&amp;lt;String&amp;gt;();
Function2&amp;lt;List&amp;lt;String&amp;gt;, Tuple2&amp;lt;LongWritable, BytesWritable&amp;gt;, List&amp;lt;String&amp;gt;&amp;gt; seqOp = new Function2&amp;lt;List&amp;lt;String&amp;gt;, Tuple2&amp;lt;LongWritable,BytesWritable&amp;gt;, List&amp;lt;String&amp;gt;&amp;gt;() {
public List&amp;lt;String&amp;gt; call(List&amp;lt;String&amp;gt; valueList, Tuple2&amp;lt;LongWritable, BytesWritable&amp;gt; eachKeyValue) throws Exception {
valueList.add(doWhatever(new String(eachKeyValue._2.copyBytes())));
returnvalueList;
}
private String doWhatever(String string) {
// will be an external utility method call, this is for representational purpose only
return System.currentTimeMillis()+"-"+string;
}
};
Function2&amp;lt;List&amp;lt;String&amp;gt;, List&amp;lt;String&amp;gt;, List&amp;lt;String&amp;gt;&amp;gt; combOp = new Function2&amp;lt;List&amp;lt;String&amp;gt;, List&amp;lt;String&amp;gt;, List&amp;lt;String&amp;gt;&amp;gt;() {
public List&amp;lt;String&amp;gt; call(List&amp;lt;String&amp;gt; listOne, List&amp;lt;String&amp;gt; listTwo) throws Exception {
 listOne.addAll(listTwo);
 return listOne;
}
};
List&amp;lt;String&amp;gt; resultantList = fixedFileRdd.aggregate(zeroValue, seqOp , combOp );
JavaRDD&amp;lt;String&amp;gt; resultantRdd = jsc.parallelize(resultantList);
resultantRdd.saveAsTextFile("out-dir");&lt;/PRE&gt;</description>
      <pubDate>Mon, 06 Jun 2016 20:21:38 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-Approach-Operating-on-values-of-a-Spark-Pair-RDD/m-p/122517#M30842</guid>
      <dc:creator>arunak</dc:creator>
      <dc:date>2016-06-06T20:21:38Z</dc:date>
    </item>
    <item>
      <title>Re: Best Approach - Operating on values of a Spark Pair RDD (discard key)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-Approach-Operating-on-values-of-a-Spark-Pair-RDD/m-p/122518#M30843</link>
      <description>&lt;P&gt;If your goal is to simply operate on the value one at a time, then you could use the "values" method of the PairRdd to get a plain Rdd on which you could use the map method. Something like this:&lt;/P&gt;&lt;PRE&gt;JavaPairRDD&amp;lt;LongWritable, BytesWritable&amp;gt; fixedFileRdd = getItSomeHow();
JavaRDD&amp;lt;String&amp;gt; resultantRdd = fixedFileRdd.values().map(
  new Function&amp;lt;BytesWritable, String&amp;gt;() {
    public Boolean call(BytesWritable i) {
      // do stuff
      return System.currentTimeMillis()+"-"+i.copyBytes();
  }});&lt;/PRE&gt;</description>
      <pubDate>Mon, 06 Jun 2016 21:09:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-Approach-Operating-on-values-of-a-Spark-Pair-RDD/m-p/122518#M30843</guid>
      <dc:creator>clukasik</dc:creator>
      <dc:date>2016-06-06T21:09:40Z</dc:date>
    </item>
    <item>
      <title>Re: Best Approach - Operating on values of a Spark Pair RDD (discard key)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-Approach-Operating-on-values-of-a-Spark-Pair-RDD/m-p/122519#M30844</link>
      <description>&lt;P&gt;Thanks &lt;A rel="user" href="https://community.cloudera.com/users/10163/clukasik.html" nodeid="10163"&gt;@clukasik&lt;/A&gt;. That solves the problem. I was going an unwanted circle to address this. 
++ on the second part of the question, does it make any sense in parallelizing a list before actually storing it to a file? As in the last 2 lines of my code. &lt;/P&gt;</description>
      <pubDate>Mon, 06 Jun 2016 21:16:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-Approach-Operating-on-values-of-a-Spark-Pair-RDD/m-p/122519#M30844</guid>
      <dc:creator>arunak</dc:creator>
      <dc:date>2016-06-06T21:16:57Z</dc:date>
    </item>
    <item>
      <title>Re: Best Approach - Operating on values of a Spark Pair RDD (discard key)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-Approach-Operating-on-values-of-a-Spark-Pair-RDD/m-p/122520#M30845</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/10529/akeezhadath.html" nodeid="10529"&gt;@akeezhadat&lt;/A&gt;&lt;P&gt; - use parallelize when you have a collection in the local JVM (driver) that you want to split across the cluster. In your example, it would hurt to bring the RDD contents local (to driver JVM) and then push them back to the cluster (as a distributed data set)&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jun 2016 21:25:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-Approach-Operating-on-values-of-a-Spark-Pair-RDD/m-p/122520#M30845</guid>
      <dc:creator>clukasik</dc:creator>
      <dc:date>2016-06-06T21:25:52Z</dc:date>
    </item>
    <item>
      <title>Re: Best Approach - Operating on values of a Spark Pair RDD (discard key)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-Approach-Operating-on-values-of-a-Spark-Pair-RDD/m-p/122521#M30846</link>
      <description>&lt;P&gt;Thanks &lt;A rel="user" href="https://community.cloudera.com/users/10163/clukasik.html" nodeid="10163"&gt;@clukasik&lt;/A&gt;. Got it!!&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jun 2016 21:28:37 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-Approach-Operating-on-values-of-a-Spark-Pair-RDD/m-p/122521#M30846</guid>
      <dc:creator>arunak</dc:creator>
      <dc:date>2016-06-06T21:28:37Z</dc:date>
    </item>
  </channel>
</rss>

