Member since
03-22-2017
26
Posts
3
Kudos Received
0
Solutions
11-08-2017
09:22 AM
Thanks for your suggestions and will try to incorporate your suggestions and come back to you with more questions!
... View more
08-17-2017
02:15 PM
Yes @Saravanan Selvam. If the record is large and if it can't fit into a split file then broken record will be created and placed in the new split file. Also it depends on the compression codec available in HDFS. Inside hadoop there are multiple ways of compressing a file like record compressed and block compressed. However the sync marker will be available to identify the record beginning and end. These record splits are handled by clients by InputFormat.getSplits. I came across a brief and clear explanation same kind of question. Please do check it. https://stackoverflow.com/questions/14291170/how-does-hadoop-process-records-split-across-block-boundaries Hope it Helps!!
... View more
08-02-2017
10:10 AM
@Bala Vignesh N V Already, number of splits is 192. Should i still increase number of splits above 192? My each split is of not equal size. This is because length of each line in my data set is not fixed. I used nlinespermap property to make every map to get same number of lines for processing. But as the length of every line is not fixed, the split size is not same among all mappers. Is it good or bad that utilizing all the cores available in the clusters for map tasks in this situation? How about using Techyon for this problem. will i see any performance improvement if i use tachyon? Thanks in advance for your replies.
... View more
08-08-2017
03:20 PM
Hi Ankit, I'm already using Gzip for compressing my reduce tasks output. But If i use gzip compression for map output i will not be able to split map output among reducers. correct me if i am wrong!!!? so i didnt use compression for map output. how to update sort algorithm? you have any tutorial for doing this? Also, can you explain me how to set io.sort.mb and io.sort.factor parameters?
... View more
04-27-2017
01:50 PM
Thanks for your reply. My input data set is 10.2GB. But after map function in the program, the amount of intermediate data generated is around 40GB. Number of unique keys is around 4,55,55500. I think that Spark doesn't keep this much big intermediate data in memory but keeps in harddisk. Whats the best way to tune spark when we have to reduce massive number of unique keys
... View more
04-05-2017
04:05 AM
1 Kudo
Below post also talks about similar ask: https://community.hortonworks.com/questions/92707/what-are-spark-executors-executor-instances-execut.html#comment-92924
... View more
04-04-2017
06:53 PM
7 Kudos
@Saravanan Selvam, In yarn mode you can control the total number of executors needed for an application with --num-executor option. However, if you do not explicitly specify --num-executor for spark application in yarn mode, it would typically start one executor on each Nodemanager. Spark also has a feature called Dynamic resource allocation. It gives spark application a feature to dynamically scale the set of cluster resources allocated to your application up and down based on the workload. This way you can make sure that application is not over utilizing the resources. http://spark.apache.org/docs/1.2.0/job-scheduling.html#dynamic-resource-allocation
... View more
07-26-2019
04:15 AM
Hello, Below are the default configuration values which will be considered by the spark job if these are not overriden at the time of submitting job to the required values. # - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2) # - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1). # - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G) SPARK_EXECUTOR_INSTANCES -> indicates the number of workers to be started, it means for a job maximum this many number of executors it can take from the cluster resource manager. SPARK_EXECUTOR_CORES -> indicates the number of cores in each executor, it means the spark TaskScheduler will ask this many cores to be allocated/blocked in each of the executor machine. SPARK_EXECUTOR_MEMORY -> indicates the maximum amount of RAM/MEMORY it requires in each executor. All these details are asked by the TastScheduler to the cluster manager (it may be a spark standalone, yarn, mesos and can be kubernetes starting from spark 2.0) to provide before actually the job execution starts. Also, please note that, initial number of executor instances is dependent on "--num-executors" but when the data is more to be processed and "spark.dynamicAllocation.enabled" set true, then it will be dynamically add more executors based on "spark.dynamicAllocation.initialExecutors". Note: Always "spark.dynamicAllocation.initialExecutors" should be configured greater than "--num-executors". spark.dynamicAllocation.initialExecutors spark.dynamicAllocation.minExecutors Initial number of executors to run if dynamic allocation is enabled. If `--num-executors` (or `spark.executor.instances`) is set and larger than this value, it will be used as the initial number of executors. spark.executor.memory 1g Amount of memory to use per executor process, in the same format as JVM memory strings with a size unit suffix ("k", "m", "g" or "t") (e.g. 512m , 2g ). spark.executor.cores 1 in YARN mode, all the available cores on the worker in standalone and Mesos coarse-grained modes. The number of cores to use on each executor. In standalone and Mesos coarse-grained modes, for more detail, see this description.
... View more
04-03-2017
05:57 AM
1 Kudo
@Saravanan Selvam May be you are running into following issues https://issues.apache.org/jira/browse/YARN-3432 https://issues.apache.org/jira/browse/YARN-3243 See this HCC post: https://community.hortonworks.com/articles/74210/resource-manager-ui-shows-used-memory-more-than-to.html
... View more
03-30-2017
12:04 PM
@saravanan Selvam If your cluster is running in pesudomode (Single node cluster) then there is no need to start hadoop. But if the cluster has multi nodes then you do need to start hadoop only then you will be able to access data through pig.So its also based on the cluster this can be decided. In general Local mode doesn'e need hadoop where as mapreduce mode you should start Hadoop service to run pig scripts in mr mode.
... View more