Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

hive testbench error when generating data

I am evaluating the new LLAP feature of Hive. I provisioned a new cluster in AWS using cloudbreak with the HDP 2.5 techpreview version and the EDW-ANALYTICS: APACHE HIVE 2 LLAP, APACHE ZEPPELIN configuration.

I logged into the master node and did a sudo to the hdfs user:

sudo -u hdfs -s

wget https://github.com/hortonworks/hive-testbench/archive/hive14.zip

unzip hive14.zip

cd hive-testbench-hive14

./tpcds-build.sh

The build succeeds but when I try to generate data, I get an error loading the text into external tables:








[hdfs@ip-10-0-3-85 hive-testbench-hive14]$ ./tpcds-setup.sh 10

ls: `/tmp/tpcds-generate/10': No such file or directory

Generating data at scale factor 10.

16/08/12 21:09:42 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-0-3-85.us-west-2.compute.internal:8188/ws/v1/timeline/

16/08/12 21:09:42 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-3-85.us-west-2.compute.internal/10.0.3.85:8050

16/08/12 21:09:42 INFO input.FileInputFormat: Total input paths to process : 1

16/08/12 21:09:43 INFO mapreduce.JobSubmitter: number of splits:10

16/08/12 21:09:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1471027682172_0026

16/08/12 21:09:43 INFO impl.YarnClientImpl: Submitted application application_1471027682172_0026

16/08/12 21:09:43 INFO mapreduce.Job: The url to track the job: http://ip-10-0-3-85.us-west-2.compute.internal:8088/proxy/application_1471027682172_0026/

16/08/12 21:09:43 INFO mapreduce.Job: Running job: job_1471027682172_0026

16/08/12 21:09:49 INFO mapreduce.Job: Job job_1471027682172_0026 running in uber mode : false

16/08/12 21:09:49 INFO mapreduce.Job:  map 0% reduce 0%

16/08/12 21:10:01 INFO mapreduce.Job:  map 10% reduce 0%

16/08/12 21:10:02 INFO mapreduce.Job:  map 30% reduce 0%

16/08/12 21:10:03 INFO mapreduce.Job:  map 40% reduce 0%

16/08/12 21:10:04 INFO mapreduce.Job:  map 50% reduce 0%

16/08/12 21:13:20 INFO mapreduce.Job:  map 60% reduce 0%

16/08/12 21:13:23 INFO mapreduce.Job:  map 70% reduce 0%

16/08/12 21:14:23 INFO mapreduce.Job:  map 80% reduce 0%

16/08/12 21:14:27 INFO mapreduce.Job:  map 90% reduce 0%

16/08/12 21:14:40 INFO mapreduce.Job:  map 100% reduce 0%

16/08/12 21:24:06 INFO mapreduce.Job: Job job_1471027682172_0026 completed successfully

16/08/12 21:24:06 INFO mapreduce.Job: Counters: 30

        File System Counters

                FILE: Number of bytes read=0

                FILE: Number of bytes written=1441630

                FILE: Number of read operations=0

                FILE: Number of large read operations=0

                FILE: Number of write operations=0

                HDFS: Number of bytes read=4699

                HDFS: Number of bytes written=3718681220

                HDFS: Number of read operations=50

                HDFS: Number of large read operations=0

                HDFS: Number of write operations=89

        Job Counters

                Launched map tasks=10

                Other local map tasks=10

                Total time spent by all maps in occupied slots (ms)=2721848

                Total time spent by all reduces in occupied slots (ms)=0

                Total time spent by all map tasks (ms)=2721848

                Total vcore-milliseconds taken by all map tasks=2721848

                Total megabyte-milliseconds taken by all map tasks=4180758528

        Map-Reduce Framework

                Map input records=10

                Map output records=0

                Input split bytes=1380

                Spilled Records=0

                Failed Shuffles=0

                Merged Map outputs=0

                GC time elapsed (ms)=12044

                CPU time spent (ms)=1468280

                Physical memory (bytes) snapshot=2743345152

                Virtual memory (bytes) snapshot=21381529600

                Total committed heap usage (bytes)=2911895552

        File Input Format Counters

                Bytes Read=3319

        File Output Format Counters

                Bytes Written=0

TPC-DS text data generation complete.

Loading text data into external tables.

make: *** [date_dim] Error 1

make: *** Waiting for unfinished jobs....

make: *** [time_dim] Error 1

Data loaded into database tpcds_bin_partitioned_orc_10.
1 ACCEPTED SOLUTION

I figured out the answer by looking at the tpcds-setup.sh. I saw the code below and set the DEBUG_SCRIPT environment variable to X to get debug output:

export DEBUG_SCRIPT=X

When I ran the script again, I saw the following error:

Dag submit failed due to Invalid TaskLaunchCmdOpts defined for Vertex Map 1 : Invalid/conflicting GC options found, cmdOpts="-server -Djava.net.preferIPv4Stack=true -Dhdp.version=2.5.0.0-1061 -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/ -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA " stack trace: [org.apache.tez.dag.api.DAG.createDag(DAG.java:866), org.apache.tez.client.TezClientUtils.prepareAndCreateDAGPlan(TezClientUtils.java:694), org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:520), org.apache.tez.client.TezClient.submitDAG(TezClient.java:466), org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:439), org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180), org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160), org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89), org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:75)] retrying...

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask

make: *** [date_dim] Error 1

make: *** Waiting for unfinished jobs....

Dag submit failed due to Invalid TaskLaunchCmdOpts defined for Vertex Map 1 : Invalid/conflicting GC options found, cmdOpts="-server -Djava.net.preferIPv4Stack=true -Dhdp.version=2.5.0.0-1061 -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/ -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA " stack trace: [org.apache.tez.dag.api.DAG.createDag(DAG.java:866), org.apache.tez.client.TezClientUtils.prepareAndCreateDAGPlan(TezClientUtils.java:694), org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:520), org.apache.tez.client.TezClient.submitDAG(TezClient.java:466), org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:439), org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180), org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160), org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89), org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:75)] retrying...

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask

make: *** [time_dim] Error 1

+ echo 'Data loaded into database tpcds_bin_partitioned_orc_10.'

This lead me to the solution:

https://community.hortonworks.com/questions/23988/not-able-to-run-hive-benchmark-test.html

I used the first suggest solution below, reran the script and it is working:

1. Change hive.tez.java.opts in hive-testbench/settings/load-partitioned.sql to use UseParallelGC (recommended).
set hive.tez.java.opts=-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/;


View solution in original post

4 REPLIES 4

I figured out the answer by looking at the tpcds-setup.sh. I saw the code below and set the DEBUG_SCRIPT environment variable to X to get debug output:

export DEBUG_SCRIPT=X

When I ran the script again, I saw the following error:

Dag submit failed due to Invalid TaskLaunchCmdOpts defined for Vertex Map 1 : Invalid/conflicting GC options found, cmdOpts="-server -Djava.net.preferIPv4Stack=true -Dhdp.version=2.5.0.0-1061 -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/ -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA " stack trace: [org.apache.tez.dag.api.DAG.createDag(DAG.java:866), org.apache.tez.client.TezClientUtils.prepareAndCreateDAGPlan(TezClientUtils.java:694), org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:520), org.apache.tez.client.TezClient.submitDAG(TezClient.java:466), org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:439), org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180), org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160), org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89), org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:75)] retrying...

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask

make: *** [date_dim] Error 1

make: *** Waiting for unfinished jobs....

Dag submit failed due to Invalid TaskLaunchCmdOpts defined for Vertex Map 1 : Invalid/conflicting GC options found, cmdOpts="-server -Djava.net.preferIPv4Stack=true -Dhdp.version=2.5.0.0-1061 -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/ -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA " stack trace: [org.apache.tez.dag.api.DAG.createDag(DAG.java:866), org.apache.tez.client.TezClientUtils.prepareAndCreateDAGPlan(TezClientUtils.java:694), org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:520), org.apache.tez.client.TezClient.submitDAG(TezClient.java:466), org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:439), org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180), org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160), org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89), org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:75)] retrying...

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask

make: *** [time_dim] Error 1

+ echo 'Data loaded into database tpcds_bin_partitioned_orc_10.'

This lead me to the solution:

https://community.hortonworks.com/questions/23988/not-able-to-run-hive-benchmark-test.html

I used the first suggest solution below, reran the script and it is working:

1. Change hive.tez.java.opts in hive-testbench/settings/load-partitioned.sql to use UseParallelGC (recommended).
set hive.tez.java.opts=-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/;


Super Guru

that article is not available any longer. not sure why. which GC options where invalid or conflicting?

New Contributor

how do you debug scripts? i use bash -x tpcds-setup.sh,but not find the error,and i use your method but it also report errors

New Contributor

In my case, "export DEBUG_SCRIPT=X" showed that I had permissioning issues. Hive user didn't have write permissions to the /tmp/hive folder on HDFS. Fixing that fixed this issue.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.