Member since
01-28-2016
10
Posts
0
Kudos Received
0
Solutions
08-25-2016
05:37 AM
Hi, I am using Spark SQL and performing join on hive tables. hive data is highly skewed and joins takes very long time. Appx 63Million records are there and I am observing that few executors are highly loaded while other are less. I want to use rangepartiton over DataFrame to divide the load evenly. Any pointers.. Thank you
... View more
Labels:
08-16-2016
12:12 PM
Hi, I have one query... I have 2 tables which I want to join and insert the result into third table. the only query in my scala code is below. hc.sql("insert overwrite table table3 select a.id,a.col1, b.col2, b.col3 From table1 a left join table2 on a.id = b.id") Below is some stats of two tables... hive> select year , count(*) from table1 GROUP BY year;
year count 2012 23647 2014 18132609 2010 14638 2013 312103 2009 110228 2011 23854 2015 44628890 hive> select year , count(*) from table2 GROUP BY year; year count 2014 8315513 2015 38881691 2012 29 2013 180718 2011 22 Note - Tables are not partitioned. The above query is being executed using spark-submit ( mentioned below) , resulting a highly skewed execution where load is uneven among executors and 2 executors are having 75% of load, hence overall execution taking more than 2 hrs. Same query in hive executes in 15 min. spark-submit --class myDriver --master yarn --deploy-mode cluster --driver-memory 15g --num-executors 25 --executor-cores 5 --executor-memory 15g --driver-cores 3 --conf "spark.executor.memory=-XX:+UseG1GC -XX:+PrintFlagsFinal -XX:PermSize=1g -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -Xms10g -Xmx10g -XX:NewRatio=2 -XX:InitiatingHeapOccupancyPercent=30 -XX:MaxGCPauseMillis=200 -XX:ParallelGCThreads=20 -XX:ConcGCThread=5" --conf spark.yarn.executor.memoryOverhead=2000 --conf spark.sql.shuffle.partitions=50 --conf spark.storage.memoryFraction=0.20 --conf spark.shuffle.memoryFraction=0.60 --conf spark.shuffle.manager=tungsten-sort my.jar I have 15 active nodes, with 6 cores per node for spark, and ~49 G ram per node Any tips to improve on skewness. Regards Praveen Khare
... View more
Labels:
07-19-2016
10:00 AM
why i asked this Question becuase I am runnign my job in client mode and I am not sure if below setting with client mode ContextService.getHiveContext.sql("SET spark.yarn.executor.memoryOverhead = 3000 ");
ContextService.getHiveContext.sql("SET spark.yarn.am.memoryOverhead = 3000"); spark.yarn.executor.memoryOverhead works in cluster mode... spark.yarm.am.memoryOverhead is Same as spark.yarn.driver.memoryOverhead , but for the YARN Application Master in client mode.
... View more
07-19-2016
10:00 AM
why i asked this Question becuase I am runnign my job in client mode and I am not sure if below setting with client mode ContextService.getHiveContext.sql("SET spark.yarn.executor.memoryOverhead = 3000 ");
ContextService.getHiveContext.sql("SET spark.yarn.am.memoryOverhead = 3000"); spark.yarn.executor.memoryOverhead works in cluster mode... spark.yarm.am.memoryOverhead is Same as spark.yarn.driver.memoryOverhead , but for the YARN Application Master in client mode.
... View more
07-19-2016
09:48 AM
Hi Puneet --as per suggestion I tried with --driver-memory 4g --num-executors 15 --total-executor-cores 30 --executor-memory 10g --driver-cores 2 and it failed with Exception in thread "dispatcher-event-loop-3" java.lang.OutOfMemoryError: Java heap space. What I am suspecting is parttioning pushing huge data on on one or more executors, and it failes....I saw in spark job environment and
spark.yarn.driver.memoryOverhead = 384
spark.yarn.executor.memoryOverhead = 384
whihc is very low.......i refered documentation and its says spark.yarn.executor.memoryOverhead executorMemory * 0.10, with minimum of 384 How we can set it to 1G or more
... View more
07-18-2016
01:07 PM
Before your suggestion, I had started a run with same configuration...I got below issues in my logs 16/07/18 09:24:52 INFO RetryInvocationHandler: Exception while invoking renewLease of class ClientNamenodeProtocolTranslatorPB over . Trying to fail over immediately. java.io.IOException: Failed on local exception: java.io.IOException: Connection reset by peer; Host Details : : : Already tried 8 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail] org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category WRITE is not supported in state standby It seems some issues in HDFS/NN or cluster itself.....
... View more
07-18-2016
09:03 AM
Okay...I will try these optiona and update. thank you
... View more
07-18-2016
05:37 AM
Thank Puneet for reply..here is my command & other information spark-submit --master yarn-client --driver-memory 15g --num-executors 25 --total-executor-cores 60 --executor-memory 15g --driver-cores 2 --conf "spark.executor.memory=-XX:+UseG1GC -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -Xms10g -Xmx10g -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThread=20" --class logicdriver logic.jar configuration ..... ContextService.getHiveContext.sql("SET hive.execution.engine=tez");
ContextService.getHiveContext.sql("SET hive.optimize.tez=true");
ContextService.getHiveContext.sql("set hive.vectorized.execution.enabled = true ");
ContextService.getHiveContext.sql("set hive.vectorized.execution.reduce.enabled = true ");
ContextService.getHiveContext.sql("set spark.sql.shuffle.partitions=2050");
ContextService.getHiveContext.sql("SET spark.sql.hive.metastore.version=0.14.0.2.2.4.10-1");
ContextService.getHiveContext.sql("SET hive.warehouse.data.skipTrash=true ");
ContextService.getHiveContext.sql("SET hive.exec.dynamic.partition = true ");
ContextService.getHiveContext.sql("SET hive.exec.dynamic.partition.mode=nonstrict ");
ContextService.getHiveContext.sql("SET spark.driver.maxResultSize= 8192");
ContextService.getHiveContext.sql("SET spark.default.parallelism = 350");
ContextService.getHiveContext.sql("SET spark.yarn.executor.memoryOverhead=1024"); data--------- It reads data from from 2 tables and perform join and put result in Dataframes...then again read new tables and does join on previous Dataframe...this cycle goes for 7-8 times and finally it insert result in hive. first table has - 63245969 records 2nd table has - 49275922 records....all the tables have records in this range.
... View more
07-17-2016
02:07 PM
Hi, I am working on HDP 2.4.2 ( hadoop 2.7, hive 1.2.1 , JDK 1.8, scala 2.10.5 ) . My Spark/Scala job reads hive table ( using Spark-SQL) into DataFrames ,performs few Left joins and insert the final results into a Hive Table which is partitioned. The source tables having apprx 50millions of records. Spark creates 74 stages for this job. It executes 72 stages successfully but hangs at 499th task of 73rd stage, and not able to execute the final stage no 74. I can see many message on console i:e "INFO: BlockManagerInfo : Removed broadcast in memory" . ...it doesn't show any error/exception...even after 1 hours it doesn't come out and only way is to Kill the job. I have total 15 nodes with 40Gb RAM with 6 cores in each node. I am using spark-submit in yarn client mode . Scheduling is configured as FIFO and my job is consuming 79% of resources. Can anybody advise on this. whats could be the issue? Regards Praveen Khare
... View more
Labels:
01-28-2016
01:19 PM
Hi, I am very new to Hortonworks.....I have recently upgraded my laptop with windows 10..can I install HDP 2.3 on win 10.
... View more
Labels: