About arunak

arunak · ‎06-08-2016

Thanks @clukasik. Is there any performance difference in choosing client deploy-mode over the cluster mode.If I use the default client deploy mode, I get the control on where my driver program runs. However, wanted to be sure that it does not cause any performance issue.

arunak · ‎06-08-2016

One single small file.

arunak · ‎06-08-2016

@Rajkumar Singh : Yes, but here the file resides on the machine where we trigger the spark-submit. So I was looking if there is any way to read it in the driver without actually having to move it to all the workers or even to the HDFS.

arunak · ‎06-08-2016

Thanks for the suggestion @Jitendra Yadav But, the file being small <~ 500 KB, I was thinking if we need to have that loaded to HDFS. Was looking for some "hack"

arunak · ‎06-08-2016

Hi, One of the spark application depends on a local file for some of its business logics. We can read the file by referring to it as file:///. But for this to work, the copy of the file needs to be on every worker or every worker need to have access to common shared drive as in a NFS mount. Is there any other way of achieving this?

arunak · ‎06-06-2016

Thanks @clukasik. Got it!!

arunak · ‎06-06-2016

Thanks @clukasik. That solves the problem. I was going an unwanted circle to address this. ++ on the second part of the question, does it make any sense in parallelizing a list before actually storing it to a file? As in the last 2 lines of my code.

arunak · ‎06-06-2016

Hi All, Need recommendation on the best approach for solving the below problem. I have included the code snippet that I have done. I read a hdfs file using a custom input format and in turn get a PairRdd. Now I am interested in operating on the value one at a time and I am not bothered of the key. Is a java list a scalable data structure to hold the values? Please have a look at the code below and suggest alternates. Also does the parallelize at the end of code give any benefit? JavaPairRDD<LongWritable, BytesWritable> fixedFileRdd = getItSomeHow(); List<String>zeroValue = new ArrayList<String>(); Function2<List<String>, Tuple2<LongWritable, BytesWritable>, List<String>> seqOp = new Function2<List<String>, Tuple2<LongWritable,BytesWritable>, List<String>>() { public List<String> call(List<String> valueList, Tuple2<LongWritable, BytesWritable> eachKeyValue) throws Exception { valueList.add(doWhatever(new String(eachKeyValue._2.copyBytes()))); returnvalueList; } private String doWhatever(String string) { // will be an external utility method call, this is for representational purpose only return System.currentTimeMillis()+"-"+string; } }; Function2<List<String>, List<String>, List<String>> combOp = new Function2<List<String>, List<String>, List<String>>() { public List<String> call(List<String> listOne, List<String> listTwo) throws Exception { listOne.addAll(listTwo); return listOne; } }; List<String> resultantList = fixedFileRdd.aggregate(zeroValue, seqOp , combOp ); JavaRDD<String> resultantRdd = jsc.parallelize(resultantList); resultantRdd.saveAsTextFile("out-dir");

arunak · ‎06-01-2016

Got it, added an action first() to make it forcefully trigger. And yes, the reason that you mentioned "spark actions are lazily evaluted" was what stopped me.

arunak · ‎06-01-2016

Good Point!! Let me try that @Rajkumar Singh.

Online	Offline
Last Visited	‎01-10-2020 08:56 AM

Member Since	‎05-17-2016 11:59 AM
Last Visited	‎01-10-2020 08:56 AM
Posts	190
Kudos received	46

Cloudera Community

Re: Composed delimiter , multidilimiter in Hive !!...

Re: How to put running log of Apahce NiFi into Spl...

Re: How to extract Text from JSON

Re: How to expand a single row with a start and en...

Re: Enabling LZO compression using NiFi PutHDFS

Re: Loading Local File to Apache Spark

Re: Loading Local File to Apache Spark

Re: Loading Local File to Apache Spark

Re: Loading Local File to Apache Spark

Loading Local File to Apache Spark

Re: Best Approach - Operating on values of a Spark...

Re: Best Approach - Operating on values of a Spark...

Best Approach - Operating on values of a Spark Pai...

Re: Spark Java Accumulator not incrementing

Re: Spark Java Accumulator not incrementing