- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
create hive orc table
- Labels:
-
Apache Hive
Created 05-08-2016 01:28 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Im trying to create a table in hive with orc format and load this table with data that I have in a ".tbl" file. In the ".tbl" files each row have this format:
1|Customer#000000001|IVhzIApeRb ot,c,E|15|711.56|BUILDING|to the even, regular platelets. regular, ironic epitaphs nag e|
I create a hive table with orc format like this:
create table if not exists partsupp (PS_PARTKEY BIGINT, PS_SUPPKEY BIGINT, PS_AVAILQTY INT, PS_SUPPLYCOST DOUBLE, PS_COMMENT STRING)STORED AS ORC TBLPROPERTIES ("orc.compress"="SNAPPY")
Now Im trying to load data into the table like this:
LOAD DATA LOCAL INPATH '/tables/partsupp/partsupp.tbl' [OVERWRITE] INTO TABLE partsupp;
My questions are, do you know if this is a correct method to do this? And if it is, do you know why this error is happening when I do the load data inpatch command?
Failed: Parse exception mismatched input '[' expecting into near '/tables/partsupp/partsupp.tbl in load statement
Created 05-08-2016 03:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[OVERWRITE] in a syntax diagram means that you can optionally use the OVERWRITE keyword. If you want to append data remove the keyword. If you want to overrwrite the table remove the square brackets.
( Also in new hive versions snappy compression is slower or equivalent and much more space hungry than the default zip )
Created 05-08-2016 03:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[OVERWRITE] in a syntax diagram means that you can optionally use the OVERWRITE keyword. If you want to append data remove the keyword. If you want to overrwrite the table remove the square brackets.
( Also in new hive versions snappy compression is slower or equivalent and much more space hungry than the default zip )
Created 05-08-2016 03:32 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your answer, Im using hive 1.2.1. And I read that parquet and orc formats because they are columnar are fastest. And I want to query this data from spark later. But so, its better store the data in the table as text?
Created 05-08-2016 03:50 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Use ORC in HDP ( and parquet in CDH ). They are much much faster than text. Spark also supports reading from ORC.
Some tips on how to get the best performance out of your table setup ( partitioning, predicate pushdown ... )
http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data
Oh and finally if you want to use mostly SQL then Hive on Tez with all new features ( CBO, Vectorization, ORC, ... ) beats the pants of Spark no questions asked. Spark is cool if you want to do something like Streaming, Data mining do some programming style data transformation or your data sets are small and you can pin them in memory for repeated queries.
Created 05-08-2016 05:48 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your help again. Im doing this and then I will compare with hive on tez to check the difference. But now I didnt understand well what you said, Im still a beginner in big data, and I read that store the tables in hive is better because then the queries are fastest because orc is a compressed format so the data size is smaller. But you are saying that dont, and we should use orc in hadoop? I have a .tbl file so I should convert that file into orc before store into hadoop?
Created 05-08-2016 06:25 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Storing data in Hive is nothing but storing data in HDFS ( hadoop ). Just look in /apps/hive/warhouse/databasename/tablename
Hive is very simple that way. No magical tablespaces or other things. A table is just a folder in HDFS with some files in it and an entry in the metastore telling hive what format the data in that folder has.
So Using ORC in Hive is the same as storing ORC in HDFS. So yes make an external table on the tbl file and transform it into an ORC table. All good.
Created 05-08-2016 06:47 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks really. so that is what Im trying to do I guess. And your code in your first answer its working now. But when I execute a query in hive just to test if the data is inside the table lile "select * from partsupp", dont return any results, because it shows this error: "Failed: Execution error, return code 2 from org.apache.hadoop.hive.sql.exec.mr.MapRedTask". Do you have any idea for this?
Created 05-09-2016 11:26 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@John CodUnfortunately too vague. Also it looks to me like you use MapReduce? May I ask which distribution you are using? If its CDH then you won't have Tez and hive will be slow. Cloudera has their own query engine Impala and are now going to Hive on Spark so they do not really support the latest of the Open Source Hive. On CDH I would go with Parquet+Impala then. ( or switch to Hive and HDP or any other OpenHadoop distribution)
