Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Error loading csv file using hive

avatar
Contributor

I created a table using java client:

CREATE TABLE csvdemo (id Int, name String, email String) STORED AS PARQUET

I use the java hadoop file system to copy the csv file from local into hdfs

When I run this load command it looks successful (running from ambari):

load data inpath '/user/admin/MOCK_DATA.csv' into table csvdemo;

But when I try to read from it using:

select * from csvdemo limit 1;

I get this error:

org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.RuntimeException: hdfs://my-host:8020/apps/hive/warehouse/csvdemo/MOCK_DATA.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [103, 111, 118, 10]


org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.RuntimeException: hdfs://my-host:8020/apps/hive/warehouse/csvdemo/MOCK_DATA.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [103, 111, 118, 10]
	at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:264)
	at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:250)
	at org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:373)
	at org.apache.ambari.view.hive20.actor.ResultSetIterator.getNext(ResultSetIterator.java:119)
	at org.apache.ambari.view.hive20.actor.ResultSetIterator.handleMessage(ResultSetIterator.java:78)
	at org.apache.ambari.view.hive20.actor.HiveActor.onReceive(HiveActor.java:38)
	at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
	at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
	at akka.dispatch.Mailbox.run(Mailbox.scala:220)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.RuntimeException: hdfs://my-host:8020/apps/hive/warehouse/csvdemo/MOCK_DATA.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [103, 111, 118, 10]
	at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:414)
	at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:233)
	at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:784)
	at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
	at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
	at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
	at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
	at com.sun.proxy.$Proxy29.fetchResults(Unknown Source)
	at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:520)
	at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:709)
	at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1557)
	at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1542)
	at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
	at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
	at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: java.lang.RuntimeException: hdfs://my-host:8020/apps/hive/warehouse/csvdemo/MOCK_DATA.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [103, 111, 118, 10]
	at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:520)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:427)
	at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146)
	at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1765)
	at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:409)
	... 24 more
Caused by: java.lang.RuntimeException: hdfs://my-host:8020/apps/hive/warehouse/csvdemo/MOCK_DATA.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [103, 111, 118, 10]
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:423)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:372)
	at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:255)
	at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:97)
	at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:83)
	at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:71)
	at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:694)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:332)
	at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:458)
	... 28 more

1 ACCEPTED SOLUTION

avatar
Contributor

@Jay SenSharma @Shu

So - now it works. I just connected to the admin user instead of hive user and now it all works fine.

Thanks for you help.

View solution in original post

14 REPLIES 14

avatar
Master Guru

Hi @Yair Ogen,

Right now csvdemo table is a parquet table and expects the data that is loading to csvdemo table should be in parquet format data but you are loading csv data to parquet table.

1.create table a normal text table.

CREATE TABLE csvdemo (id Int, name String, email String) 
row format delimited 
fields terminated by ',' 
STORED AS TEXTFILE;

2.load the data into text table

load data inpath '/user/admin/MOCK_DATA.csv'into table csvdemo;

3.then create an another table as parquet format.

CREATE TABLE csvdemo_prq (id Int, name String, email String) stored as parquet;

4.insert into parquet table select from text table

insert into csvdemo_prq select * from csvdemo;

then your csvdemo_prq table is in parquet format, hive doesn't convert csv data directly into parquet.

avatar
Contributor

Thanks @Shu. Is the parquet worth the double load? I mean If I run just the first load it seems to work fine - so what am I loosing here?

Also - I have an issue running this load from java client (permissions issue?)

java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
	at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:283) ~[hive-jdbc-2.0.0.jar:2.0.0]

avatar
Contributor

@Shu

That that the initial loads works for me - what are the parquet format pros and cons? Is the creating new table and copying data from the text based table into it worth the trouble?

avatar
Master Guru

@Yair Ogen,

If you are using Hortonworks distribution then its better to use ORC format as it is optimized for TEZ execution engine and if you are considering about file size then ORC is more compressed than Parquet.

Best Practices of ORC in HCC :-

https://community.hortonworks.com/articles/75501/orc-creation-best-practices.html

Pros and Cons of Parquet format:-

https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-t...

Orc vs Parquet:-

https://community.hortonworks.com/questions/2067/orc-vs-parquet-when-to-use-one-over-the-other.html

avatar
Contributor

Super, Thanks.

avatar
Master Guru

Yair Ogen,
Can you do

desc formatted csvdemo;

You can get the location of the table where it is stored in hdfs and change permissions to that parent directory.

Remove recursive -r and try to run the below command.

hadoop fs -chmod 777 <table location directory>

avatar
Contributor

@Shu
I run

hadoop fs -chmod 777 /apps/hive/warehouse/csvdemo

Still ,same error in my java client:

java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask

avatar
Master Guru
@Yair Ogen

can you once share screenshot after executing the below command

hadoop fs -ls /apps/hive/warehouse/

and attach the hive logs from directory.

/var/log/hive/

(or)

Go to resource manager UI and attach the error logs.

for reference:-

https://community.hortonworks.com/questions/49759/viewing-logs-for-hive-query-executions.html

avatar
Contributor

@Shu

the ls output:

Found 1 items
drwxrwxrwx   - hive hadoop          0 2017-10-01 18:01 /apps/hive/warehouse/csvdemo

which of these do you need?

drwxr-xr-x  2 hive hadoop     4096 Oct  2 00:00 ./
drwxrwxr-x 40 root syslog     4096 Oct  2 06:25 ../
-rw-r--r--  1 hive hadoop        0 Oct  1 13:51 atlas_hook_failed_messages.log
-rw-r--r--  1 hive hadoop      312 Oct  1 14:36 hive.err
-rw-r--r--  1 hive hadoop   581226 Oct  2 11:51 hivemetastore.log
-rw-r--r--  1 hive hadoop   290424 Sep 28 23:57 hivemetastore.log.2017-09-28
-rw-r--r--  1 hive hadoop  1195294 Sep 29 23:57 hivemetastore.log.2017-09-29
-rw-r--r--  1 hive hadoop  1173882 Sep 30 23:57 hivemetastore.log.2017-09-30
-rw-r--r--  1 hive hadoop  1309208 Oct  1 23:57 hivemetastore.log.2017-10-01
-rw-r--r--  1 hive hadoop       31 Oct  1 14:36 hive.out
-rw-r--r--  1 hive hadoop   112944 Oct  1 18:02 hive-server2.err
-rw-r--r--  1 hive hadoop  1189192 Oct  2 11:51 hiveserver2.log
-rw-r--r--  1 hive hadoop 32095865 Sep 27 16:30 hiveserver2.log.2017-09-27
-rw-r--r--  1 hive hadoop 88093181 Sep 28 23:57 hiveserver2.log.2017-09-28
-rw-r--r--  1 hive hadoop  2368106 Sep 29 23:57 hiveserver2.log.2017-09-29
-rw-r--r--  1 hive hadoop  2368598 Sep 30 23:57 hiveserver2.log.2017-09-30
-rw-r--r--  1 hive hadoop 23657102 Oct  1 23:57 hiveserver2.log.2017-10-01
-rw-r--r--  1 hive hadoop        0 Oct  1 14:37 hive-server2.out