About saranpons3

saranpons3 · ‎11-08-2017

Thanks for your suggestions and will try to incorporate your suggestions and come back to you with more questions!

saranpons3 · ‎11-07-2017

kgautam, Thanks for your reply. 1) Currently, I'm not using any combiner. My map phase output <key,value> pair is <string/text,string/text>. As my value is string/text in map phase output <key,value> pair, I think that It will be difficult to write the combiner. Usually,the function of the combiner is same as the reducer. Here, I'm not able to think of writing the combiner for this particular problem. 2) Currently,we tried with this compression for map output "-D mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec". Is this configuration enough to compress map output? Do we have to modify or write some statements in our mapreduce code to use this compression? 3) May i know that where do you get this rule of thumb "A reducer should process 1 GB of data" ? 4) When i have 24 logical cores in one data node, Why you have mentioned 20 * 7? I think that it should be 24*7? 5) How to handle skewed key? Can i handle it using partitioner? Do we have any other way? Thanks.

saranpons3 · ‎11-07-2017

Dear Community, I have a Mapreduce job which processes 1.8TB data set. My map task generates around 2.5 TB of intermediate data and the number of distinct keys would easily cross a billion . I have set a split size to be 128MB. So, total number of splits generated is approximately 14,000/-. I have set a number of reducers to be 166. My cluster size is 8 nodes. 7 nodes are data nodes out of 8 nodes. 1 is a name node. Each data node has got 24 logical cores and 128GB RAM. When the job is running with this configuration, map completes its execution but my reduce phase stucks at 26%. May i know that what should be the split size and number of reducers i should have for this particular problem with my current cluster size. Please provide suggestions. Thanks.

saranpons3 · ‎08-08-2017

Hi Ankit, I'm already using Gzip for compressing my reduce tasks output. But If i use gzip compression for map output i will not be able to split map output among reducers. correct me if i am wrong!!!? so i didnt use compression for map output. how to update sort algorithm? you have any tutorial for doing this? Also, can you explain me how to set io.sort.mb and io.sort.factor parameters?

saranpons3 · ‎07-31-2017

Hello All, I would like to know that where map's intermediate data is written that is context.write() writes data to hard disk or network immediately after its generation? Which Hadoop parameter to be tuned when the amount of intermediate data generated by map() task is over 45 GB and a huge amount of data(for example above 50 GB) to be shuffled over the network in a multinode cluster set up? Will i get any performance improvement if i increase io.sort.mb paramter when Map() task generates huge amount of data? Thanks in advance.

saranpons3 · ‎04-04-2017

Ok. thanks. Now i understood that by default the amount of memory allotted for an executor is 1 GB and this value can be controlled through --executor-memory option. Now, I would like to know that By default, how many executors will be created for an application in a node and what is the total number of executors created in a cluster? How to control number of executors created in a node? Also, by default how many cores will be allotted to an executor in a node(I think that the number of cores allotted for an executor is unlimited in a node.Am i right?)?

saranpons3 · ‎04-04-2017

Hello All, In Hadoop MapReduce, By default, the number of mappers created is depends on number of input splits. For example, if 192 MB is your inpur file size and 1 block is of 64 MB then number of input splits will be 3. So number of mappers will be 3. The same way, I would like to know that, In spark, if i submit an application in standalone cluster(a sort of pseudo distributed) to process 750 MB input data, how many executors will be created in Spark?

saranpons3 · ‎04-04-2017

My set up is as follows: 1 laptop where i am running wordcount scala program through spark-submit command. The input for my application is a text file which is placed in HDFS. I'm using Spark's standalone cluster for managing my cluster. I'm running my application on a kind of pseudo distributed mode. while executing spark-submit command i don't use --executor-memory option. My execution command looks like this? I would like to know that how much memory will be allotted to executor by default when --executor-memory option is not given. My interface looks like the one in the image attached. spark-submit --class Wordcount --master spark://saravanan:7077 /home/hduser/sparkapp/target/scala-2.11/sparkapp_2.11-0.1.jar hdfs://127.0.0.1:9000//inp_wrd hdfs://127.0.0.1:9000//amazon_wrd_count1

Online	Offline
Last Visited	‎11-08-2017 09:22 AM

Member Since	‎03-22-2017 07:42 AM
Last Visited	‎11-08-2017 09:22 AM
Posts	26
Kudos received	3

Cloudera Community

Re: How many reduce tasks a mapreduce job which pr...

Re: How many reduce tasks a mapreduce job which pr...

How many reduce tasks a mapreduce job which proces...

Re: Where context.write(key,value) method's output...

Where context.write(key,value) method's output(int...

Re: What is the default value of --executor -memor...

How many number of executors will be created for a...

What is the default value of --executor -memory wh...