Support Questions

houfan · ‎01-30-2018

version: clouder-manager-5.13.1

os: ubuntu14.04

here is my operation step:

1.

create table new_tmp(action_type string,event_detail string, uuid string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

2.

load data inpath '/tmp/new.txt' into table new_tmp;

3.

create table new_tmp_orc(action_type string,event_detail string, uuid string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS ORC;

4.

insert into table new_tmp_orc select * from new_tmp;

and It report this error:

Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask

in the log file,/var/log/hive/hadoop-cmf-hive-HIVESERVER2-node1.log.out . it's like this:

2018-01-30 17:15:46,331 ERROR org.apache.hadoop.hive.ql.exec.spark.status.SparkJobMonitor: [HiveServer2-Background-Pool: Thread-58]: Status: Failed
.........( omit some DEBUG and INFO level messages)
2018-01-30 17:15:46,372 ERROR org.apache.hadoop.hive.ql.Driver: [HiveServer2-Background-Pool: Thread-58]: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask

.........(omit some DEBUG and INFO level messages)....

2018-01-30 17:15:46,444 ERROR org.apache.hive.service.cli.operation.Operation: [HiveServer2-Background-Pool: Thread-58]: Error running hive query:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:400)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:238)
at org.apache.hive.service.cli.operation.SQLOperation.access$300(SQLOperation.java:89)
at org.apache.hive.service.cli.operation.SQLOperation$3$1.run(SQLOperation.java:301)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hive.service.cli.operation.SQLOperation$3.run(SQLOperation.java:314)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

I've searched on the internet for days,still don't get a clue........

really appreciate if you could help

NaveenGangam · ‎02-01-2018

Its not quite clear what the issue is. We will probably need HS2 and spark logs to do understand the issue.

However, I am curious if the second step succeeded.

"load data inpath '/tmp/new.txt' into table new_tmp;"

This appears to be a local path but there is no "LOCAL" keyword in the command. Have you verfied that the data was actually inserted into new_tmp table after this step?

Also what version of CDH is this? Thanks

houfan · ‎02-04-2018

@NaveenGangam thanks for your reply

1. the second step is successfully finished. use ''select * from tablename" can show the contents exactly right.

2. no 'LOCAL' keyword is because the '/tmp/new.txt' is the on the HDFS.

3. CDH version is 5.13.1

4.

and question is how could I make this step successed

insert into table new_tmp_orc select * from new_tmp;

is there any documents about loading json data into the orc type table ? cos I couldn't find any one on the internet,,and I feel so hopeless.....

David M. · ‎02-11-2018

It could be a few things. However, a detail log message should be available in the Spark History Server / YARN Resource Manager UI when you click on the failed job. The error will be in one of the Executor logs.

1// Invalid JSON

You could have some invalid JSON that is failing to parse. Hive will not skip erroneous records, it will simply fail the entire job.

2// Not Installing the SerDe

This can be confusing for users, but have you installed the JSON Serde into the Hive auxiliary directory?

The file that contains this JSON Serde class is: hive-hcatalog-core.jar

It can be found in several places in the CDH distribution. It needs to be installed into the Hive auxilliary directory and the HiveServer2 instances subsquently need to be restarted.

https://www.cloudera.com/documentation/enterprise/5-13-x/topics/cm_mc_hive_udf.html

David M. · ‎02-11-2018

Also, we at Cloudera are partial to the Apache Parquet format:

https://www.cloudera.com/documentation/enterprise/5-13-x/topics/cdh_ig_parquet.html

I_am_the_Bug · ‎04-17-2018

I am having the same issue while selecting and inserting data from a JSON SerDe table to Parquet Table. I also have required "hive-hcatalog-core-1.1.0-cdh5.14.0.jar" in hive_aux path (without that jar the JSON table didn't even read the data properly. I lookind into Spark logs and found this:

Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 12, hwslave02.com, executor 1): java.lang.RuntimeException: Error processing row: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"id":985840392984780801,"created_at":"Mon Apr 16 11:20:40 +0000 2018","source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>","favorited":false,"retweeted_status":{"text":"Os dejamos la #OpinionReal\nde @DbenavidesMReal\n\nLeedla amigos! 😉\n\n#HalaMadrid #MadridistaReal \n\nhttps://t.co/pVthhZThxF","user":{"screen_name":"RMadridistaReal","name":"#MadridistaReal"},"retweet_count":15},"entities":{"urls":[{"expanded_url":"http://madridistareal.com/opinionreal-isco-sobresale-en-un-madrid-firme/"}],"user_mentions":[{"screen_name":"RMadridistaReal","name":"#MadridistaReal"},{"screen_name":"DbenavidesMReal","name":"Dani Benavides"}],"hashtags":[{"text":"OpinionReal"},{"text":"HalaMadrid"},{"text":"MadridistaReal"}]},"text":"RT @RMadridistaReal: Os dejamos la #OpinionReal\nde @DbenavidesMReal\n\nLeedla amigos! 😉\n\n#HalaMadrid #MadridistaReal \n\nhttps://t.co/pVthhZThxF","user":{"screen_name":"mariadelmadrid","name":"Carmen Madridista","friends_count":4991,"followers_count":3661,"statuses_count":55872,"verified":false,"utc_offset":7200,"time_zone":"Madrid"},"in_reply_to_screen_name":null}
	at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:154)
	at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
	at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
	at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95)
	at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$15.apply(AsyncRDDActions.scala:120)
	at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$15.apply(AsyncRDDActions.scala:120)
	at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2022)
	at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2022)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"id":985840392984780801,"created_at":"Mon Apr 16 11:20:40 +0000 2018","source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>","favorited":false,"retweeted_status":{"text":"Os dejamos la #OpinionReal\nde @DbenavidesMReal\n\nLeedla amigos! 😉\n\n#HalaMadrid #MadridistaReal \n\nhttps://t.co/pVthhZThxF","user":{"screen_name":"RMadridistaReal","name":"#MadridistaReal"},"retweet_count":15},"entities":{"urls":[{"expanded_url":"http://madridistareal.com/opinionreal-isco-sobresale-en-un-madrid-firme/"}],"user_mentions":[{"screen_name":"RMadridistaReal","name":"#MadridistaReal"},{"screen_name":"DbenavidesMReal","name":"Dani Benavides"}],"hashtags":[{"text":"OpinionReal"},{"text":"HalaMadrid"},{"text":"MadridistaReal"}]},"text":"RT @RMadridistaReal: Os dejamos la #OpinionReal\nde @DbenavidesMReal\n\nLeedla amigos! 😉\n\n#HalaMadrid #MadridistaReal \n\nhttps://t.co/pVthhZThxF","user":{"screen_name":"mariadelmadrid","name":"Carmen Madridista","friends_count":4991,"followers_count":3661,"statuses_count":55872,"verified":false,"utc_offset":7200,"time_zone":"Madrid"},"in_reply_to_screen_name":null}
	at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:507)
	at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:141)
	... 16 more
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.serde2.io.ParquetHiveRecord
	at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:149)
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:717)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
	at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
	at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:98)
	at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
	at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:497)
	... 17 more

Driver stacktrace:

The JSON table contains twitter data.

My Guess is, while converting to parquet it cannot maintain file's nested schema.

Does anyone got a solution or cause of problem to this?

I_am_the_Bug · ‎04-17-2018

I have managed to solve my problem, it was a silly little mistake that I was making. I created JSON table using:

ADD JAR hdfs://hwmaster01.com/user/root/hive-serdes-1.0-SNAPSHOT.jar;
 CREATE TABLE tweets_pqt (
   id BIGINT,
   created_at STRING,
   source STRING,
   favorited BOOLEAN,
   retweeted_status STRUCT<
     text:STRING,
     user:STRUCT<screen_name:STRING,name:STRING>,
     retweet_count:INT>,
   entities STRUCT<
     urls:ARRAY<STRUCT<expanded_url:STRING>>,
     user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
     hashtags:ARRAY<STRUCT<text:STRING>>>,
   text STRING,
   user STRUCT<
     screen_name:STRING,
     name:STRING,
     friends_count:INT,
     followers_count:INT,
     statuses_count:INT,
     verified:BOOLEAN,
     utc_offset:INT,
     time_zone:STRING>,
   in_reply_to_screen_name STRING
 ) 
 ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'

After putting that above mentioned JAR file in Cloudera Manager's "Hive Auxiliary JARs Directory" (on un-managed machine you can find that under "hive.aux.jars.path" property in hive--site.xml)

Then I created the Parquet table with same structure as above with a little change:

ADD JAR hdfs://hwmaster01.com/user/root/hive-serdes-1.0-SNAPSHOT.jar;
 CREATE TABLE tweets_pqt (
   id BIGINT,
   created_at STRING,
   source STRING,
   favorited BOOLEAN,
   retweeted_status STRUCT<
     text:STRING,
     user:STRUCT<screen_name:STRING,name:STRING>,
     retweet_count:INT>,
   entities STRUCT<
     urls:ARRAY<STRUCT<expanded_url:STRING>>,
     user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
     hashtags:ARRAY<STRUCT<text:STRING>>>,
   text STRING,
   user STRUCT<
     screen_name:STRING,
     name:STRING,
     friends_count:INT,
     followers_count:INT,
     statuses_count:INT,
     verified:BOOLEAN,
     utc_offset:INT,
     time_zone:STRING>,
   in_reply_to_screen_name STRING
 ) 
 --ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
 STORED AS PARQUET;

I inserted into parquet table succesfully the moment I commented that line.