Created on 01-30-2018 02:16 AM - edited 09-16-2022 05:48 AM
version: clouder-manager-5.13.1
os: ubuntu14.04
here is my operation step:
1.
create table new_tmp(action_type string,event_detail string, uuid string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
2.
load data inpath '/tmp/new.txt' into table new_tmp;
3.
create table new_tmp_orc(action_type string,event_detail string, uuid string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS ORC;
4.
insert into table new_tmp_orc select * from new_tmp;
and It report this error:
in the log file,/var/log/hive/hadoop-cmf-hive-HIVESERVER2-node1.log.out . it's like this:
2018-01-30 17:15:46,331 ERROR org.apache.hadoop.hive.ql.exec.spark.status.SparkJobMonitor: [HiveServer2-Background-Pool: Thread-58]: Status: Failed
.........( omit some DEBUG and INFO level messages)
2018-01-30 17:15:46,372 ERROR org.apache.hadoop.hive.ql.Driver: [HiveServer2-Background-Pool: Thread-58]: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
.........(omit some DEBUG and INFO level messages)....
2018-01-30 17:15:46,444 ERROR org.apache.hive.service.cli.operation.Operation: [HiveServer2-Background-Pool: Thread-58]: Error running hive query:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:400)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:238)
at org.apache.hive.service.cli.operation.SQLOperation.access$300(SQLOperation.java:89)
at org.apache.hive.service.cli.operation.SQLOperation$3$1.run(SQLOperation.java:301)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hive.service.cli.operation.SQLOperation$3.run(SQLOperation.java:314)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I've searched on the internet for days,still don't get a clue........
really appreciate if you could help
Created 02-01-2018 07:37 AM
Its not quite clear what the issue is. We will probably need HS2 and spark logs to do understand the issue.
However, I am curious if the second step succeeded.
"load data inpath '/tmp/new.txt' into table new_tmp;"
This appears to be a local path but there is no "LOCAL" keyword in the command. Have you verfied that the data was actually inserted into new_tmp table after this step?
Also what version of CDH is this? Thanks
Created 02-04-2018 07:55 PM
@NaveenGangam thanks for your reply
1. the second step is successfully finished. use ''select * from tablename" can show the contents exactly right.
2. no 'LOCAL' keyword is because the '/tmp/new.txt' is the on the HDFS.
3. CDH version is 5.13.1
4.
and question is how could I make this step successed
insert into table new_tmp_orc select * from new_tmp;
is there any documents about loading json data into the orc type table ? cos I couldn't find any one on the internet,,and I feel so hopeless.....
Created 02-11-2018 06:23 PM
It could be a few things. However, a detail log message should be available in the Spark History Server / YARN Resource Manager UI when you click on the failed job. The error will be in one of the Executor logs.
1// Invalid JSON
You could have some invalid JSON that is failing to parse. Hive will not skip erroneous records, it will simply fail the entire job.
2// Not Installing the SerDe
This can be confusing for users, but have you installed the JSON Serde into the Hive auxiliary directory?
The file that contains this JSON Serde class is: hive-hcatalog-core.jar
It can be found in several places in the CDH distribution. It needs to be installed into the Hive auxilliary directory and the HiveServer2 instances subsquently need to be restarted.
https://www.cloudera.com/documentation/enterprise/5-13-x/topics/cm_mc_hive_udf.html
Created 02-11-2018 06:27 PM
Also, we at Cloudera are partial to the Apache Parquet format:
https://www.cloudera.com/documentation/enterprise/5-13-x/topics/cdh_ig_parquet.html
Created 04-17-2018 02:39 AM
I am having the same issue while selecting and inserting data from a JSON SerDe table to Parquet Table. I also have required "hive-hcatalog-core-1.1.0-cdh5.14.0.jar" in hive_aux path (without that jar the JSON table didn't even read the data properly. I lookind into Spark logs and found this:
Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 12, hwslave02.com, executor 1): java.lang.RuntimeException: Error processing row: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"id":985840392984780801,"created_at":"Mon Apr 16 11:20:40 +0000 2018","source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>","favorited":false,"retweeted_status":{"text":"Os dejamos la #OpinionReal\nde @DbenavidesMReal\n\nLeedla amigos! 😉\n\n#HalaMadrid #MadridistaReal \n\nhttps://t.co/pVthhZThxF","user":{"screen_name":"RMadridistaReal","name":"#MadridistaReal"},"retweet_count":15},"entities":{"urls":[{"expanded_url":"http://madridistareal.com/opinionreal-isco-sobresale-en-un-madrid-firme/"}],"user_mentions":[{"screen_name":"RMadridistaReal","name":"#MadridistaReal"},{"screen_name":"DbenavidesMReal","name":"Dani Benavides"}],"hashtags":[{"text":"OpinionReal"},{"text":"HalaMadrid"},{"text":"MadridistaReal"}]},"text":"RT @RMadridistaReal: Os dejamos la #OpinionReal\nde @DbenavidesMReal\n\nLeedla amigos! 😉\n\n#HalaMadrid #MadridistaReal \n\nhttps://t.co/pVthhZThxF","user":{"screen_name":"mariadelmadrid","name":"Carmen Madridista","friends_count":4991,"followers_count":3661,"statuses_count":55872,"verified":false,"utc_offset":7200,"time_zone":"Madrid"},"in_reply_to_screen_name":null} at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:154) at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48) at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$15.apply(AsyncRDDActions.scala:120) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$15.apply(AsyncRDDActions.scala:120) at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2022) at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2022) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"id":985840392984780801,"created_at":"Mon Apr 16 11:20:40 +0000 2018","source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>","favorited":false,"retweeted_status":{"text":"Os dejamos la #OpinionReal\nde @DbenavidesMReal\n\nLeedla amigos! 😉\n\n#HalaMadrid #MadridistaReal \n\nhttps://t.co/pVthhZThxF","user":{"screen_name":"RMadridistaReal","name":"#MadridistaReal"},"retweet_count":15},"entities":{"urls":[{"expanded_url":"http://madridistareal.com/opinionreal-isco-sobresale-en-un-madrid-firme/"}],"user_mentions":[{"screen_name":"RMadridistaReal","name":"#MadridistaReal"},{"screen_name":"DbenavidesMReal","name":"Dani Benavides"}],"hashtags":[{"text":"OpinionReal"},{"text":"HalaMadrid"},{"text":"MadridistaReal"}]},"text":"RT @RMadridistaReal: Os dejamos la #OpinionReal\nde @DbenavidesMReal\n\nLeedla amigos! 😉\n\n#HalaMadrid #MadridistaReal \n\nhttps://t.co/pVthhZThxF","user":{"screen_name":"mariadelmadrid","name":"Carmen Madridista","friends_count":4991,"followers_count":3661,"statuses_count":55872,"verified":false,"utc_offset":7200,"time_zone":"Madrid"},"in_reply_to_screen_name":null} at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:507) at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:141) ... 16 more Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.serde2.io.ParquetHiveRecord at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:149) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:717) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:98) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:497) ... 17 more Driver stacktrace:
The JSON table contains twitter data.
My Guess is, while converting to parquet it cannot maintain file's nested schema.
Does anyone got a solution or cause of problem to this?
Created 04-17-2018 02:51 AM
I have managed to solve my problem, it was a silly little mistake that I was making. I created JSON table using:
ADD JAR hdfs://hwmaster01.com/user/root/hive-serdes-1.0-SNAPSHOT.jar; CREATE TABLE tweets_pqt ( id BIGINT, created_at STRING, source STRING, favorited BOOLEAN, retweeted_status STRUCT< text:STRING, user:STRUCT<screen_name:STRING,name:STRING>, retweet_count:INT>, entities STRUCT< urls:ARRAY<STRUCT<expanded_url:STRING>>, user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>, hashtags:ARRAY<STRUCT<text:STRING>>>, text STRING, user STRUCT< screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING>, in_reply_to_screen_name STRING ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
After putting that above mentioned JAR file in Cloudera Manager's "Hive Auxiliary JARs Directory" (on un-managed machine you can find that under "hive.aux.jars.path" property in hive--site.xml)
Then I created the Parquet table with same structure as above with a little change:
ADD JAR hdfs://hwmaster01.com/user/root/hive-serdes-1.0-SNAPSHOT.jar; CREATE TABLE tweets_pqt ( id BIGINT, created_at STRING, source STRING, favorited BOOLEAN, retweeted_status STRUCT< text:STRING, user:STRUCT<screen_name:STRING,name:STRING>, retweet_count:INT>, entities STRUCT< urls:ARRAY<STRUCT<expanded_url:STRING>>, user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>, hashtags:ARRAY<STRUCT<text:STRING>>>, text STRING, user STRUCT< screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING>, in_reply_to_screen_name STRING ) --ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS PARQUET;
I inserted into parquet table succesfully the moment I commented that line.