Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Issue with Hive JSON SerDe

Solved Go to solution

Issue with Hive JSON SerDe

New Contributor

I am getting an error when querying Hive table over JSON Data. I think the root cause would be JSON SerDe. Should I have to change Hive native JSON SerDe to any other JSON SerDe? Please advise

Error: Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Issue with Hive JSON SerDe

New Contributor

Thank you @jbarnett. I worked on the similar JSON structure before with Spark, but I am checking now the possibility to ingest data using only shell scripts and Hive scripts.

9 REPLIES 9

Re: Issue with Hive JSON SerDe

Expert Contributor

Need a bit more information - the schema, a sample of the json file, stack trace. The standard Hive Serde should be able to read most common json. But need more detail to tell.

Re: Issue with Hive JSON SerDe

New Contributor

@jbarnett

Thank you! Please find below requested details

Hive Schema:

CREATE TABLE IF NOT EXISTS TestJson (id int, nm varchar(30), cty varchar(30), hse varchar(30), yrs varchar(20)) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE;

Sample Json:

[ { "id": 1, "nm": "Edward the Elder", "cty": "United Kingdom", "hse": "House of Wessex", "yrs": "899-925" }, { "id": 2, "nm": "Athelstan", "cty": "United Kingdom", "hse": "House of Wessex", "yrs": "925-940" }, ]

Also tried with below JSON format as well:

{ "id": 1, "nm": "Edward the Elder", "cty": "United Kingdom", "hse": "House of Wessex", "yrs": "899-925" }

{ "id": 2, "nm": "Athelstan", "cty": "United Kingdom", "hse": "House of Wessex", "yrs": "925-940" }

Error stack trace:

org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected

org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected
	at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:264)
	at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:250)
	at org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:373)
	at org.apache.ambari.view.hive20.actor.ResultSetIterator.getNext(ResultSetIterator.java:119)
	at org.apache.ambari.view.hive20.actor.ResultSetIterator.handleMessage(ResultSetIterator.java:78)
	at org.apache.ambari.view.hive20.actor.HiveActor.onReceive(HiveActor.java:38)
	at akka.actor.UntypedActor$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
	at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
	at akka.dispatch.Mailbox.run(Mailbox.scala:220)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected
	at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:416)
	at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:243)
	at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:793)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
	at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
	at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
	at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
	at com.sun.proxy.$Proxy29.fetchResults(Unknown Source)
	at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:523)
	at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:709)
	at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1617)
	at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1602)
	at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
	at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
	at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected
	at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:520)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:427)
	at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146)
	at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1782)
	at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:411)
	... 25 more
Caused by: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected
	at org.apache.hive.hcatalog.data.JsonSerDe.deserialize(JsonSerDe.java:186)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:501)
	... 29 more
Caused by: java.io.IOException: Start token not found where expected
	at org.apache.hive.hcatalog.data.JsonSerDe.deserialize(JsonSerDe.java:172) 

... 30 more

Re: Issue with Hive JSON SerDe

Expert Contributor

Hmm, This is pretty simple json (the format with one complete document on each line of text file is correct) and it pretty much just worked for me (Sandbox 2.6.0). Also, this serde is usually available by default and doesn't require you to add any extra libs. you can see from the "at org.apache.hive.hcatalog.data.JsonSerDe.deserialize(JsonSerDe.java:172)" That Hive IS finding the serde jar. I would try creating the data file again, uploading and creating an external table then test.

# place in text file and upload into hdfs /data/testjson

{ "id": 1, "nm": "Edward the Elder", "cty": "United Kingdom", "hse": "House of Wessex", "yrs": "899-925" }

{ "id": 2, "nm": "Athelstan", "cty": "United Kingdom", "hse": "House of Wessex", "yrs": "925-940" }

CREATE EXTERNAL TABLE IF NOT EXISTS TestJson (id int, nm varchar(30), cty varchar(30), hse varchar(30), yrs varchar(20)) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE LOCATION '/data/testjson/';

CREATE TABLE IF NOT EXISTS TestJsonInt (id int, nm varchar(30), cty varchar(30), hse varchar(30), yrs varchar(20)) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE; insert overwrite table TestJsonInt select * from TestJson;

Re: Issue with Hive JSON SerDe

New Contributor

@Ravi Yandra

Hi, I can see the first JSON is incorrect, could you please use the JSON validator and validate the JSON

- Then recreate the table and try again

Re: Issue with Hive JSON SerDe

New Contributor

@atrivedi Thank you. do you mean starting with Square bracket "[" instead of curly bracket "{"?

Re: Issue with Hive JSON SerDe

New Contributor

Thank you @jbarnett Yes, it worked as well to me the simple json format after correcting the schema but the other format JSON starts with Square bracket that requires a tweak to work it.

Receiving hundreds of files, and 1000s of Array elements in each file as below format:

[{ "id":1, "nm":"Edward the Elder", "cty":"United Kingdom", "hse":"House of Wessex", "yrs":"899925" }, { "id":2, "nm":"Edward the Elder", "cty":"United Kingdom", "hse":"House of Wessex", "yrs":"899925" }]

However, JSON SerDe only supporting single line JSON meaning each JSON doc have to be in a different line otherwise Array of JSON objects in the below updated format:

{ "jsondata":[{ "id":1, "nm":"Edward the Elder", "cty":"United Kingdom", "hse":"House of Wessex", "yrs":"899925" }] }

And, the schema:

CREATE EXTERNAL TABLE IF NOT EXISTS TestJson1 (jsondata array<struct<id:int,nm:varchar(30),cty:varchar(30),hse:varchar(30),yrs:varchar(20)>>)

ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'

Location '/data/3rdPartyData/Hive';

Any thoughts here to manage SerDe/Schema without updating the source files?

Re: Issue with Hive JSON SerDe

Expert Contributor

If the file is one big text string without newline characters, you could treat it as one line and parse with python (See https://community.hortonworks.com/content/kbentry/155544/how-to-troubleshoot-hive-udtf-functions.htm... or define it as a single string and then parse with Json functions and normalize the array. You would have to make sure that the data wasn't too big in any file and also have to consider splits to ensure the whole file got read as one. In any case, not very scalable.

If you have newline characters, you are pretty much stuck as the json serde is based on the text serde and each newline is considered a new record. If that is the case, you are going to have to preprocess maybe with python, or if you need more scale then Spark or PySpark.

Re: Issue with Hive JSON SerDe

New Contributor

Thank you @jbarnett. I worked on the similar JSON structure before with Spark, but I am checking now the possibility to ingest data using only shell scripts and Hive scripts.

Highlighted

Re: Issue with Hive JSON SerDe

New Contributor

As per my analysis, I understand that org.apache.hive.hcatalog.data.JsonSerDe doesn't support the JSON which starts with square bracket "[".

[{ "id":1, "nm":"Edward the Elder", "cty":"United Kingdom", "hse":"House of Wessex", "yrs":"899925" }, { "id":2, "nm":"Edward the Elder", "cty":"United Kingdom", "hse":"House of Wessex", "yrs":"899925" }]