Created 07-27-2016 10:26 AM
Hello community,
How can I upload ORC files to Hive?
I was given an ORC file to import into hive. The upload table functionality in Ambari, which I always used, supports only csv, json and xml.
I thought of saving the file in HDFS in the respective database directory and then create table in hive and load the data. Unfortunately, I cannot open the the file (I have only notepad) to see it's schema.
Thank you.
Created 07-27-2016 10:46 AM
If you already have data in ORC format, you can just create a Hive table on top of that, probably as an external table.
As you don't currently have the schema with which to create your table, you will need to get hold of this. Fortunately, the schema is embedded in the ORC file, and you can read it with the orc dump utlity. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUti... for info on how to get the schema out of your ORC file so you can create a table.
Created 07-27-2016 10:46 AM
If you already have data in ORC format, you can just create a Hive table on top of that, probably as an external table.
As you don't currently have the schema with which to create your table, you will need to get hold of this. Fortunately, the schema is embedded in the ORC file, and you can read it with the orc dump utlity. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUti... for info on how to get the schema out of your ORC file so you can create a table.
Created 07-27-2016 10:59 AM
Damn not fast enough, was about to write this, you get the column counts, types and some statistics out of it, you will have to invent the column names though.
[root@sandbox ~]# hadoop fs -ls /apps/hive/warehouse/torc Found 2 items -rwxrwxrwx 3 root hdfs 16653 2016-03-14 15:35 /apps/hive/warehouse/torc/000000_0 -rwxrwxrwx 3 root hdfs 16653 2016-03-14 15:35 /apps/hive/warehouse/torc/000000_0_copy_1 [root@sandbox ~]# hive --orcfiledump /apps/hive/warehouse/torc/000000_0 WARNING: Use "yarn jar" to launch YARN applications. Processing data file /apps/hive/warehouse/torc/000000_0 [length: 16653] Structure for /apps/hive/warehouse/torc/000000_0 File Version: 0.12 with HIVE_8732 16/07/27 10:57:36 INFO orc.ReaderImpl: Reading ORC rows from /apps/hive/warehouse/torc/000000_0 with {include: null, offset: 0, length: 9223372036854775807} 16/07/27 10:57:36 INFO orc.RecordReaderFactory: Schema is not specified on read. Using file schema. Rows: 823 Compression: ZLIB Compression size: 262144 Type: struct<_col0:string,_col1:string,_col2:int,_col3:int> Stripe Statistics: Stripe 1: Column 0: count: 823 hasNull: false Column 1: count: 823 hasNull: false min: 00-0000 max: 53-7199 sum: 5761 Column 2: count: 823 hasNull: false min: Accountants and auditors max: Zoologists and wildlife biologists sum: 28550 Column 3: count: 823 hasNull: false min: 340 max: 134354250 sum: 403062800 Column 4: count: 819 hasNull: true min: 16700 max: 192780 sum: 39282210
Created 07-27-2016 03:28 PM
Thanx Simon,
do you know perhaps of a tool that can enable me to read the ORC files in windows environment (I am not supposed to have access to Linux terminal)? and even hopefully automate the data (I have hundrends of ORC files)?
Created 07-27-2016 10:48 AM
You can create external table and map schema and move file to HDFS,
CREATE EXTERNAL TABLE IF NOT EXISTS Cars( Name STRING, Miles_per_Gallon INT, Cylinders INT, Displacement INT, Horsepower INT, Weight_in_lbs INT, Acceleration DECIMAL, Year DATE, Origin CHAR(1)) COMMENT 'Data about cars from a public database' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE location '/user/<username>/visdata';
hdfs dfs -copyFromLocal cars.csv /user/<username>/visdata
Now create ORC table :-
CREATE TABLE IF NOT EXISTS mycars( Name STRING, Miles_per_Gallon INT, Cylinders INT, Displacement INT, Horsepower INT, Weight_in_lbs INT, Acceleration DECIMAL, Year DATE, Origin CHAR(1)) COMMENT 'Data about cars from a public database' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC;
Insert the data from the external table to the Hive ORC table.
INSERT OVERWRITE TABLE mycars SELECT * FROM cars;