Member since
12-18-2017
2
Posts
1
Kudos Received
0
Solutions
04-17-2018
02:51 AM
1 Kudo
I have managed to solve my problem, it was a silly little mistake that I was making. I created JSON table using: ADD JAR hdfs://hwmaster01.com/user/root/hive-serdes-1.0-SNAPSHOT.jar;
CREATE TABLE tweets_pqt (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>,
retweet_count:INT>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' After putting that above mentioned JAR file in Cloudera Manager's "Hive Auxiliary JARs Directory" (on un-managed machine you can find that under "hive.aux.jars.path" property in hive--site.xml) Then I created the Parquet table with same structure as above with a little change: ADD JAR hdfs://hwmaster01.com/user/root/hive-serdes-1.0-SNAPSHOT.jar;
CREATE TABLE tweets_pqt (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>,
retweet_count:INT>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING
)
--ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS PARQUET; I inserted into parquet table succesfully the moment I commented that line.
... View more
04-17-2018
02:39 AM
I am having the same issue while selecting and inserting data from a JSON SerDe table to Parquet Table. I also have required "hive-hcatalog-core-1.1.0-cdh5.14.0.jar" in hive_aux path (without that jar the JSON table didn't even read the data properly. I lookind into Spark logs and found this: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 12, hwslave02.com, executor 1): java.lang.RuntimeException: Error processing row: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"id":985840392984780801,"created_at":"Mon Apr 16 11:20:40 +0000 2018","source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>","favorited":false,"retweeted_status":{"text":"Os dejamos la #OpinionReal\nde @DbenavidesMReal\n\nLeedla amigos! 😉\n\n#HalaMadrid #MadridistaReal \n\nhttps://t.co/pVthhZThxF","user":{"screen_name":"RMadridistaReal","name":"#MadridistaReal"},"retweet_count":15},"entities":{"urls":[{"expanded_url":"http://madridistareal.com/opinionreal-isco-sobresale-en-un-madrid-firme/"}],"user_mentions":[{"screen_name":"RMadridistaReal","name":"#MadridistaReal"},{"screen_name":"DbenavidesMReal","name":"Dani Benavides"}],"hashtags":[{"text":"OpinionReal"},{"text":"HalaMadrid"},{"text":"MadridistaReal"}]},"text":"RT @RMadridistaReal: Os dejamos la #OpinionReal\nde @DbenavidesMReal\n\nLeedla amigos! 😉\n\n#HalaMadrid #MadridistaReal \n\nhttps://t.co/pVthhZThxF","user":{"screen_name":"mariadelmadrid","name":"Carmen Madridista","friends_count":4991,"followers_count":3661,"statuses_count":55872,"verified":false,"utc_offset":7200,"time_zone":"Madrid"},"in_reply_to_screen_name":null}
at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:154)
at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
at org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:95)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$15.apply(AsyncRDDActions.scala:120)
at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$15.apply(AsyncRDDActions.scala:120)
at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2022)
at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2022)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"id":985840392984780801,"created_at":"Mon Apr 16 11:20:40 +0000 2018","source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>","favorited":false,"retweeted_status":{"text":"Os dejamos la #OpinionReal\nde @DbenavidesMReal\n\nLeedla amigos! 😉\n\n#HalaMadrid #MadridistaReal \n\nhttps://t.co/pVthhZThxF","user":{"screen_name":"RMadridistaReal","name":"#MadridistaReal"},"retweet_count":15},"entities":{"urls":[{"expanded_url":"http://madridistareal.com/opinionreal-isco-sobresale-en-un-madrid-firme/"}],"user_mentions":[{"screen_name":"RMadridistaReal","name":"#MadridistaReal"},{"screen_name":"DbenavidesMReal","name":"Dani Benavides"}],"hashtags":[{"text":"OpinionReal"},{"text":"HalaMadrid"},{"text":"MadridistaReal"}]},"text":"RT @RMadridistaReal: Os dejamos la #OpinionReal\nde @DbenavidesMReal\n\nLeedla amigos! 😉\n\n#HalaMadrid #MadridistaReal \n\nhttps://t.co/pVthhZThxF","user":{"screen_name":"mariadelmadrid","name":"Carmen Madridista","friends_count":4991,"followers_count":3661,"statuses_count":55872,"verified":false,"utc_offset":7200,"time_zone":"Madrid"},"in_reply_to_screen_name":null}
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:507)
at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:141)
... 16 more
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.serde2.io.ParquetHiveRecord
at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:149)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:717)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:98)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:497)
... 17 more
Driver stacktrace: The JSON table contains twitter data. My Guess is, while converting to parquet it cannot maintain file's nested schema. Does anyone got a solution or cause of problem to this?
... View more