Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

create hive orc table

avatar
Rising Star

Im trying to create a table in hive with orc format and load this table with data that I have in a ".tbl" file. In the ".tbl" files each row have this format:

1|Customer#000000001|IVhzIApeRb ot,c,E|15|711.56|BUILDING|to the even, regular platelets. regular, ironic epitaphs nag e|

I create a hive table with orc format like this:

create table if not exists partsupp (PS_PARTKEY BIGINT, PS_SUPPKEY BIGINT, PS_AVAILQTY INT, PS_SUPPLYCOST DOUBLE, PS_COMMENT STRING)STORED AS ORC TBLPROPERTIES ("orc.compress"="SNAPPY")

Now Im trying to load data into the table like this:

LOAD DATA LOCAL INPATH '/tables/partsupp/partsupp.tbl' [OVERWRITE] INTO TABLE partsupp;

My questions are, do you know if this is a correct method to do this? And if it is, do you know why this error is happening when I do the load data inpatch command?

Failed: Parse exception mismatched input '[' expecting into near '/tables/partsupp/partsupp.tbl in load statement
1 ACCEPTED SOLUTION

avatar
Master Guru

[OVERWRITE] in a syntax diagram means that you can optionally use the OVERWRITE keyword. If you want to append data remove the keyword. If you want to overrwrite the table remove the square brackets.

( Also in new hive versions snappy compression is slower or equivalent and much more space hungry than the default zip )

View solution in original post

7 REPLIES 7

avatar
Master Guru

[OVERWRITE] in a syntax diagram means that you can optionally use the OVERWRITE keyword. If you want to append data remove the keyword. If you want to overrwrite the table remove the square brackets.

( Also in new hive versions snappy compression is slower or equivalent and much more space hungry than the default zip )

avatar
Rising Star

Thanks for your answer, Im using hive 1.2.1. And I read that parquet and orc formats because they are columnar are fastest. And I want to query this data from spark later. But so, its better store the data in the table as text?

avatar
Master Guru

Use ORC in HDP ( and parquet in CDH ). They are much much faster than text. Spark also supports reading from ORC.

Some tips on how to get the best performance out of your table setup ( partitioning, predicate pushdown ... )

http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

Oh and finally if you want to use mostly SQL then Hive on Tez with all new features ( CBO, Vectorization, ORC, ... ) beats the pants of Spark no questions asked. Spark is cool if you want to do something like Streaming, Data mining do some programming style data transformation or your data sets are small and you can pin them in memory for repeated queries.

avatar
Rising Star

Thanks for your help again. Im doing this and then I will compare with hive on tez to check the difference. But now I didnt understand well what you said, Im still a beginner in big data, and I read that store the tables in hive is better because then the queries are fastest because orc is a compressed format so the data size is smaller. But you are saying that dont, and we should use orc in hadoop? I have a .tbl file so I should convert that file into orc before store into hadoop?

avatar
Master Guru

Storing data in Hive is nothing but storing data in HDFS ( hadoop ). Just look in /apps/hive/warhouse/databasename/tablename

Hive is very simple that way. No magical tablespaces or other things. A table is just a folder in HDFS with some files in it and an entry in the metastore telling hive what format the data in that folder has.

So Using ORC in Hive is the same as storing ORC in HDFS. So yes make an external table on the tbl file and transform it into an ORC table. All good.

avatar
Rising Star

Thanks really. so that is what Im trying to do I guess. And your code in your first answer its working now. But when I execute a query in hive just to test if the data is inside the table lile "select * from partsupp", dont return any results, because it shows this error: "Failed: Execution error, return code 2 from org.apache.hadoop.hive.sql.exec.mr.MapRedTask". Do you have any idea for this?

avatar
Master Guru

@John CodUnfortunately too vague. Also it looks to me like you use MapReduce? May I ask which distribution you are using? If its CDH then you won't have Tez and hive will be slow. Cloudera has their own query engine Impala and are now going to Hive on Spark so they do not really support the latest of the Open Source Hive. On CDH I would go with Parquet+Impala then. ( or switch to Hive and HDP or any other OpenHadoop distribution)