Created 08-12-2016 09:39 PM
I am evaluating the new LLAP feature of Hive. I provisioned a new cluster in AWS using cloudbreak with the HDP 2.5 techpreview version and the EDW-ANALYTICS: APACHE HIVE 2 LLAP, APACHE ZEPPELIN configuration.
I logged into the master node and did a sudo to the hdfs user:
sudo -u hdfs -s
wget https://github.com/hortonworks/hive-testbench/archive/hive14.zip
unzip hive14.zip
cd hive-testbench-hive14
./tpcds-build.sh
The build succeeds but when I try to generate data, I get an error loading the text into external tables:
[hdfs@ip-10-0-3-85 hive-testbench-hive14]$ ./tpcds-setup.sh 10 ls: `/tmp/tpcds-generate/10': No such file or directory Generating data at scale factor 10. 16/08/12 21:09:42 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-0-3-85.us-west-2.compute.internal:8188/ws/v1/timeline/ 16/08/12 21:09:42 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-3-85.us-west-2.compute.internal/10.0.3.85:8050 16/08/12 21:09:42 INFO input.FileInputFormat: Total input paths to process : 1 16/08/12 21:09:43 INFO mapreduce.JobSubmitter: number of splits:10 16/08/12 21:09:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1471027682172_0026 16/08/12 21:09:43 INFO impl.YarnClientImpl: Submitted application application_1471027682172_0026 16/08/12 21:09:43 INFO mapreduce.Job: The url to track the job: http://ip-10-0-3-85.us-west-2.compute.internal:8088/proxy/application_1471027682172_0026/ 16/08/12 21:09:43 INFO mapreduce.Job: Running job: job_1471027682172_0026 16/08/12 21:09:49 INFO mapreduce.Job: Job job_1471027682172_0026 running in uber mode : false 16/08/12 21:09:49 INFO mapreduce.Job: map 0% reduce 0% 16/08/12 21:10:01 INFO mapreduce.Job: map 10% reduce 0% 16/08/12 21:10:02 INFO mapreduce.Job: map 30% reduce 0% 16/08/12 21:10:03 INFO mapreduce.Job: map 40% reduce 0% 16/08/12 21:10:04 INFO mapreduce.Job: map 50% reduce 0% 16/08/12 21:13:20 INFO mapreduce.Job: map 60% reduce 0% 16/08/12 21:13:23 INFO mapreduce.Job: map 70% reduce 0% 16/08/12 21:14:23 INFO mapreduce.Job: map 80% reduce 0% 16/08/12 21:14:27 INFO mapreduce.Job: map 90% reduce 0% 16/08/12 21:14:40 INFO mapreduce.Job: map 100% reduce 0% 16/08/12 21:24:06 INFO mapreduce.Job: Job job_1471027682172_0026 completed successfully 16/08/12 21:24:06 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=1441630 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=4699 HDFS: Number of bytes written=3718681220 HDFS: Number of read operations=50 HDFS: Number of large read operations=0 HDFS: Number of write operations=89 Job Counters Launched map tasks=10 Other local map tasks=10 Total time spent by all maps in occupied slots (ms)=2721848 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=2721848 Total vcore-milliseconds taken by all map tasks=2721848 Total megabyte-milliseconds taken by all map tasks=4180758528 Map-Reduce Framework Map input records=10 Map output records=0 Input split bytes=1380 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=12044 CPU time spent (ms)=1468280 Physical memory (bytes) snapshot=2743345152 Virtual memory (bytes) snapshot=21381529600 Total committed heap usage (bytes)=2911895552 File Input Format Counters Bytes Read=3319 File Output Format Counters Bytes Written=0 TPC-DS text data generation complete. Loading text data into external tables. make: *** [date_dim] Error 1 make: *** Waiting for unfinished jobs.... make: *** [time_dim] Error 1 Data loaded into database tpcds_bin_partitioned_orc_10.
Created 08-13-2016 12:47 AM
I figured out the answer by looking at the tpcds-setup.sh. I saw the code below and set the DEBUG_SCRIPT environment variable to X to get debug output:
export DEBUG_SCRIPT=X
When I ran the script again, I saw the following error:
Dag submit failed due to Invalid TaskLaunchCmdOpts defined for Vertex Map 1 : Invalid/conflicting GC options found, cmdOpts="-server -Djava.net.preferIPv4Stack=true -Dhdp.version=2.5.0.0-1061 -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/ -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA " stack trace: [org.apache.tez.dag.api.DAG.createDag(DAG.java:866), org.apache.tez.client.TezClientUtils.prepareAndCreateDAGPlan(TezClientUtils.java:694), org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:520), org.apache.tez.client.TezClient.submitDAG(TezClient.java:466), org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:439), org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180), org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160), org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89), org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:75)] retrying... FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask make: *** [date_dim] Error 1 make: *** Waiting for unfinished jobs.... Dag submit failed due to Invalid TaskLaunchCmdOpts defined for Vertex Map 1 : Invalid/conflicting GC options found, cmdOpts="-server -Djava.net.preferIPv4Stack=true -Dhdp.version=2.5.0.0-1061 -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/ -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA " stack trace: [org.apache.tez.dag.api.DAG.createDag(DAG.java:866), org.apache.tez.client.TezClientUtils.prepareAndCreateDAGPlan(TezClientUtils.java:694), org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:520), org.apache.tez.client.TezClient.submitDAG(TezClient.java:466), org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:439), org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180), org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160), org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89), org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:75)] retrying... FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask make: *** [time_dim] Error 1 + echo 'Data loaded into database tpcds_bin_partitioned_orc_10.'
This lead me to the solution:
https://community.hortonworks.com/questions/23988/not-able-to-run-hive-benchmark-test.html
I used the first suggest solution below, reran the script and it is working:
1. Change hive.tez.java.opts in hive-testbench/settings/load-partitioned.sql to use UseParallelGC (recommended). set hive.tez.java.opts=-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/;
Created 08-13-2016 12:47 AM
I figured out the answer by looking at the tpcds-setup.sh. I saw the code below and set the DEBUG_SCRIPT environment variable to X to get debug output:
export DEBUG_SCRIPT=X
When I ran the script again, I saw the following error:
Dag submit failed due to Invalid TaskLaunchCmdOpts defined for Vertex Map 1 : Invalid/conflicting GC options found, cmdOpts="-server -Djava.net.preferIPv4Stack=true -Dhdp.version=2.5.0.0-1061 -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/ -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA " stack trace: [org.apache.tez.dag.api.DAG.createDag(DAG.java:866), org.apache.tez.client.TezClientUtils.prepareAndCreateDAGPlan(TezClientUtils.java:694), org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:520), org.apache.tez.client.TezClient.submitDAG(TezClient.java:466), org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:439), org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180), org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160), org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89), org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:75)] retrying... FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask make: *** [date_dim] Error 1 make: *** Waiting for unfinished jobs.... Dag submit failed due to Invalid TaskLaunchCmdOpts defined for Vertex Map 1 : Invalid/conflicting GC options found, cmdOpts="-server -Djava.net.preferIPv4Stack=true -Dhdp.version=2.5.0.0-1061 -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/ -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA " stack trace: [org.apache.tez.dag.api.DAG.createDag(DAG.java:866), org.apache.tez.client.TezClientUtils.prepareAndCreateDAGPlan(TezClientUtils.java:694), org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:520), org.apache.tez.client.TezClient.submitDAG(TezClient.java:466), org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:439), org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180), org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160), org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89), org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:75)] retrying... FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask make: *** [time_dim] Error 1 + echo 'Data loaded into database tpcds_bin_partitioned_orc_10.'
This lead me to the solution:
https://community.hortonworks.com/questions/23988/not-able-to-run-hive-benchmark-test.html
I used the first suggest solution below, reran the script and it is working:
1. Change hive.tez.java.opts in hive-testbench/settings/load-partitioned.sql to use UseParallelGC (recommended). set hive.tez.java.opts=-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/;
Created 10-04-2016 04:18 AM
that article is not available any longer. not sure why. which GC options where invalid or conflicting?
Created 04-20-2020 11:25 PM
how do you debug scripts? i use bash -x tpcds-setup.sh,but not find the error,and i use your method but it also report errors
Created 05-10-2018 04:27 PM
In my case, "export DEBUG_SCRIPT=X" showed that I had permissioning issues. Hive user didn't have write permissions to the /tmp/hive folder on HDFS. Fixing that fixed this issue.