Created 01-11-2019 06:18 AM
I am getting an error when querying Hive table over JSON Data. I think the root cause would be JSON SerDe. Should I have to change Hive native JSON SerDe to any other JSON SerDe? Please advise
Error: Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected
Created 01-16-2019 04:49 PM
Thank you @jbarnett. I worked on the similar JSON structure before with Spark, but I am checking now the possibility to ingest data using only shell scripts and Hive scripts.
Created 01-11-2019 07:27 PM
Need a bit more information - the schema, a sample of the json file, stack trace. The standard Hive Serde should be able to read most common json. But need more detail to tell.
Created 01-12-2019 04:08 AM
Thank you! Please find below requested details
Hive Schema:
CREATE TABLE IF NOT EXISTS TestJson (id int, nm varchar(30), cty varchar(30), hse varchar(30), yrs varchar(20)) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE;
Sample Json:
[ { "id": 1, "nm": "Edward the Elder", "cty": "United Kingdom", "hse": "House of Wessex", "yrs": "899-925" }, { "id": 2, "nm": "Athelstan", "cty": "United Kingdom", "hse": "House of Wessex", "yrs": "925-940" }, ]
Also tried with below JSON format as well:
{ "id": 1, "nm": "Edward the Elder", "cty": "United Kingdom", "hse": "House of Wessex", "yrs": "899-925" }
{ "id": 2, "nm": "Athelstan", "cty": "United Kingdom", "hse": "House of Wessex", "yrs": "925-940" }
Error stack trace:
org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:264) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:250) at org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:373) at org.apache.ambari.view.hive20.actor.ResultSetIterator.getNext(ResultSetIterator.java:119) at org.apache.ambari.view.hive20.actor.ResultSetIterator.handleMessage(ResultSetIterator.java:78) at org.apache.ambari.view.hive20.actor.HiveActor.onReceive(HiveActor.java:38) at akka.actor.UntypedActor$anonfun$receive$1.applyOrElse(UntypedActor.scala:167) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:416) at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:243) at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:793) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) at com.sun.proxy.$Proxy29.fetchResults(Unknown Source) at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:523) at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:709) at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1617) at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1602) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:520) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:427) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1782) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:411) ... 25 more Caused by: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected at org.apache.hive.hcatalog.data.JsonSerDe.deserialize(JsonSerDe.java:186) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:501) ... 29 more Caused by: java.io.IOException: Start token not found where expected at org.apache.hive.hcatalog.data.JsonSerDe.deserialize(JsonSerDe.java:172)
... 30 more
Created 01-14-2019 05:55 PM
Hmm, This is pretty simple json (the format with one complete document on each line of text file is correct) and it pretty much just worked for me (Sandbox 2.6.0). Also, this serde is usually available by default and doesn't require you to add any extra libs. you can see from the "at org.apache.hive.hcatalog.data.JsonSerDe.deserialize(JsonSerDe.java:172)" That Hive IS finding the serde jar. I would try creating the data file again, uploading and creating an external table then test.
# place in text file and upload into hdfs /data/testjson
{ "id": 1, "nm": "Edward the Elder", "cty": "United Kingdom", "hse": "House of Wessex", "yrs": "899-925" }
{ "id": 2, "nm": "Athelstan", "cty": "United Kingdom", "hse": "House of Wessex", "yrs": "925-940" }
CREATE EXTERNAL TABLE IF NOT EXISTS TestJson (id int, nm varchar(30), cty varchar(30), hse varchar(30), yrs varchar(20)) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE LOCATION '/data/testjson/';
CREATE TABLE IF NOT EXISTS TestJsonInt (id int, nm varchar(30), cty varchar(30), hse varchar(30), yrs varchar(20)) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE; insert overwrite table TestJsonInt select * from TestJson;
Created 01-14-2019 10:31 PM
Hi, I can see the first JSON is incorrect, could you please use the JSON validator and validate the JSON
- Then recreate the table and try again
Created 01-16-2019 04:52 PM
@atrivedi Thank you. do you mean starting with Square bracket "[" instead of curly bracket "{"?
Created 01-15-2019 12:02 AM
Thank you @jbarnett Yes, it worked as well to me the simple json format after correcting the schema but the other format JSON starts with Square bracket that requires a tweak to work it.
Receiving hundreds of files, and 1000s of Array elements in each file as below format:
[{ "id":1, "nm":"Edward the Elder", "cty":"United Kingdom", "hse":"House of Wessex", "yrs":"899925" }, { "id":2, "nm":"Edward the Elder", "cty":"United Kingdom", "hse":"House of Wessex", "yrs":"899925" }]
However, JSON SerDe only supporting single line JSON meaning each JSON doc have to be in a different line otherwise Array of JSON objects in the below updated format:
{ "jsondata":[{ "id":1,
"nm":"Edward the Elder", "cty":"United
Kingdom", "hse":"House of Wessex",
"yrs":"899925" }] }
And, the schema:
CREATE EXTERNAL TABLE IF NOT EXISTS TestJson1 (jsondata array<struct<id:int,nm:varchar(30),cty:varchar(30),hse:varchar(30),yrs:varchar(20)>>)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
Location '/data/3rdPartyData/Hive';
Any thoughts here to manage SerDe/Schema without updating the source files?
Created 01-15-2019 12:45 AM
If the file is one big text string without newline characters, you could treat it as one line and parse with python (See https://community.hortonworks.com/content/kbentry/155544/how-to-troubleshoot-hive-udtf-functions.htm... or define it as a single string and then parse with Json functions and normalize the array. You would have to make sure that the data wasn't too big in any file and also have to consider splits to ensure the whole file got read as one. In any case, not very scalable.
If you have newline characters, you are pretty much stuck as the json serde is based on the text serde and each newline is considered a new record. If that is the case, you are going to have to preprocess maybe with python, or if you need more scale then Spark or PySpark.
Created 01-16-2019 04:49 PM
Thank you @jbarnett. I worked on the similar JSON structure before with Spark, but I am checking now the possibility to ingest data using only shell scripts and Hive scripts.
Created 01-18-2019 08:53 PM
As per my analysis, I understand that org.apache.hive.hcatalog.data.JsonSerDe doesn't support the JSON which starts with square bracket "[".
[{ "id":1, "nm":"Edward the Elder", "cty":"United Kingdom", "hse":"House of Wessex", "yrs":"899925" }, { "id":2, "nm":"Edward the Elder", "cty":"United Kingdom", "hse":"House of Wessex", "yrs":"899925" }]