Created 10-01-2017 01:47 PM
I created a table using java client:
CREATE TABLE csvdemo (id Int, name String, email String) STORED AS PARQUET
I use the java hadoop file system to copy the csv file from local into hdfs
When I run this load command it looks successful (running from ambari):
load data inpath '/user/admin/MOCK_DATA.csv' into table csvdemo;
But when I try to read from it using:
select * from csvdemo limit 1;
I get this error:
org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.RuntimeException: hdfs://my-host:8020/apps/hive/warehouse/csvdemo/MOCK_DATA.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [103, 111, 118, 10] org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.RuntimeException: hdfs://my-host:8020/apps/hive/warehouse/csvdemo/MOCK_DATA.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [103, 111, 118, 10] at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:264) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:250) at org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:373) at org.apache.ambari.view.hive20.actor.ResultSetIterator.getNext(ResultSetIterator.java:119) at org.apache.ambari.view.hive20.actor.ResultSetIterator.handleMessage(ResultSetIterator.java:78) at org.apache.ambari.view.hive20.actor.HiveActor.onReceive(HiveActor.java:38) at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.hive.service.cli.HiveSQLException: java.io.IOException: java.lang.RuntimeException: hdfs://my-host:8020/apps/hive/warehouse/csvdemo/MOCK_DATA.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [103, 111, 118, 10] at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:414) at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:233) at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:784) at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) at com.sun.proxy.$Proxy29.fetchResults(Unknown Source) at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:520) at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:709) at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1557) at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1542) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: java.lang.RuntimeException: hdfs://my-host:8020/apps/hive/warehouse/csvdemo/MOCK_DATA.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [103, 111, 118, 10] at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:520) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:427) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1765) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:409) ... 24 more Caused by: java.lang.RuntimeException: hdfs://my-host:8020/apps/hive/warehouse/csvdemo/MOCK_DATA.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [103, 111, 118, 10] at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:423) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:372) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:255) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:97) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:83) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:71) at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:694) at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:332) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:458) ... 28 more
Created 10-03-2017 04:36 AM
So - now it works. I just connected to the admin user instead of hive user and now it all works fine.
Thanks for you help.
Created 10-01-2017 01:57 PM
Hi @Yair Ogen,
Right now csvdemo table is a parquet table and expects the data that is loading to csvdemo table should be in parquet format data but you are loading csv data to parquet table.
1.create table a normal text table.
CREATE TABLE csvdemo (id Int, name String, email String) row format delimited fields terminated by ',' STORED AS TEXTFILE;
2.load the data into text table
load data inpath '/user/admin/MOCK_DATA.csv'into table csvdemo;
3.then create an another table as parquet format.
CREATE TABLE csvdemo_prq (id Int, name String, email String) stored as parquet;
4.insert into parquet table select from text table
insert into csvdemo_prq select * from csvdemo;
then your csvdemo_prq table is in parquet format, hive doesn't convert csv data directly into parquet.
Created 10-01-2017 02:07 PM
Thanks @Shu. Is the parquet worth the double load? I mean If I run just the first load it seems to work fine - so what am I loosing here?
Also - I have an issue running this load from java client (permissions issue?)
java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:283) ~[hive-jdbc-2.0.0.jar:2.0.0]
Created 10-03-2017 04:42 AM
That that the initial loads works for me - what are the parquet format pros and cons? Is the creating new table and copying data from the text based table into it worth the trouble?
Created 10-03-2017 12:11 PM
If you are using Hortonworks distribution then its better to use ORC format as it is optimized for TEZ execution engine and if you are considering about file size then ORC is more compressed than Parquet.
Best Practices of ORC in HCC :-
https://community.hortonworks.com/articles/75501/orc-creation-best-practices.html
Pros and Cons of Parquet format:-
Orc vs Parquet:-
https://community.hortonworks.com/questions/2067/orc-vs-parquet-when-to-use-one-over-the-other.html
Created 10-03-2017 12:21 PM
Super, Thanks.
Created 10-01-2017 02:54 PM
Yair Ogen,
Can you do
desc formatted csvdemo;
You can get the location of the table where it is stored in hdfs and change permissions to that parent directory.
Remove recursive -r and try to run the below command.
hadoop fs -chmod 777 <table location directory>
Created 10-01-2017 03:02 PM
@Shu
I run
hadoop fs -chmod 777 /apps/hive/warehouse/csvdemo
Still ,same error in my java client:
java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
Created 10-01-2017 04:29 PM
can you once share screenshot after executing the below command
hadoop fs -ls /apps/hive/warehouse/
and attach the hive logs from directory.
/var/log/hive/
(or)
Go to resource manager UI and attach the error logs.
for reference:-
https://community.hortonworks.com/questions/49759/viewing-logs-for-hive-query-executions.html
Created 10-02-2017 08:53 AM
the ls output:
Found 1 items drwxrwxrwx - hive hadoop 0 2017-10-01 18:01 /apps/hive/warehouse/csvdemo
which of these do you need?
drwxr-xr-x 2 hive hadoop 4096 Oct 2 00:00 ./ drwxrwxr-x 40 root syslog 4096 Oct 2 06:25 ../ -rw-r--r-- 1 hive hadoop 0 Oct 1 13:51 atlas_hook_failed_messages.log -rw-r--r-- 1 hive hadoop 312 Oct 1 14:36 hive.err -rw-r--r-- 1 hive hadoop 581226 Oct 2 11:51 hivemetastore.log -rw-r--r-- 1 hive hadoop 290424 Sep 28 23:57 hivemetastore.log.2017-09-28 -rw-r--r-- 1 hive hadoop 1195294 Sep 29 23:57 hivemetastore.log.2017-09-29 -rw-r--r-- 1 hive hadoop 1173882 Sep 30 23:57 hivemetastore.log.2017-09-30 -rw-r--r-- 1 hive hadoop 1309208 Oct 1 23:57 hivemetastore.log.2017-10-01 -rw-r--r-- 1 hive hadoop 31 Oct 1 14:36 hive.out -rw-r--r-- 1 hive hadoop 112944 Oct 1 18:02 hive-server2.err -rw-r--r-- 1 hive hadoop 1189192 Oct 2 11:51 hiveserver2.log -rw-r--r-- 1 hive hadoop 32095865 Sep 27 16:30 hiveserver2.log.2017-09-27 -rw-r--r-- 1 hive hadoop 88093181 Sep 28 23:57 hiveserver2.log.2017-09-28 -rw-r--r-- 1 hive hadoop 2368106 Sep 29 23:57 hiveserver2.log.2017-09-29 -rw-r--r-- 1 hive hadoop 2368598 Sep 30 23:57 hiveserver2.log.2017-09-30 -rw-r--r-- 1 hive hadoop 23657102 Oct 1 23:57 hiveserver2.log.2017-10-01 -rw-r--r-- 1 hive hadoop 0 Oct 1 14:37 hive-server2.out