Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How can I upload ORC files to Hive?

avatar
Contributor

Hello community,

How can I upload ORC files to Hive?

I was given an ORC file to import into hive. The upload table functionality in Ambari, which I always used, supports only csv, json and xml.

I thought of saving the file in HDFS in the respective database directory and then create table in hive and load the data. Unfortunately, I cannot open the the file (I have only notepad) to see it's schema.

Thank you.

1 ACCEPTED SOLUTION

avatar
Guru

If you already have data in ORC format, you can just create a Hive table on top of that, probably as an external table.

As you don't currently have the schema with which to create your table, you will need to get hold of this. Fortunately, the schema is embedded in the ORC file, and you can read it with the orc dump utlity. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUti... for info on how to get the schema out of your ORC file so you can create a table.

View solution in original post

4 REPLIES 4

avatar
Guru

If you already have data in ORC format, you can just create a Hive table on top of that, probably as an external table.

As you don't currently have the schema with which to create your table, you will need to get hold of this. Fortunately, the schema is embedded in the ORC file, and you can read it with the orc dump utlity. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUti... for info on how to get the schema out of your ORC file so you can create a table.

avatar
Master Guru

Damn not fast enough, was about to write this, you get the column counts, types and some statistics out of it, you will have to invent the column names though.

[root@sandbox ~]# hadoop fs -ls /apps/hive/warehouse/torc
Found 2 items
-rwxrwxrwx   3 root hdfs      16653 2016-03-14 15:35 /apps/hive/warehouse/torc/000000_0
-rwxrwxrwx   3 root hdfs      16653 2016-03-14 15:35 /apps/hive/warehouse/torc/000000_0_copy_1
[root@sandbox ~]# hive --orcfiledump /apps/hive/warehouse/torc/000000_0
WARNING: Use "yarn jar" to launch YARN applications.
Processing data file /apps/hive/warehouse/torc/000000_0 [length: 16653]
Structure for /apps/hive/warehouse/torc/000000_0
File Version: 0.12 with HIVE_8732
16/07/27 10:57:36 INFO orc.ReaderImpl: Reading ORC rows from /apps/hive/warehouse/torc/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
16/07/27 10:57:36 INFO orc.RecordReaderFactory: Schema is not specified on read. Using file schema.
Rows: 823
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:string,_col1:string,_col2:int,_col3:int>
Stripe Statistics:
  Stripe 1:
    Column 0: count: 823 hasNull: false
    Column 1: count: 823 hasNull: false min: 00-0000 max: 53-7199 sum: 5761
    Column 2: count: 823 hasNull: false min: Accountants and auditors max: Zoologists and wildlife biologists sum: 28550
    Column 3: count: 823 hasNull: false min: 340 max: 134354250 sum: 403062800
Column 4: count: 819 hasNull: true min: 16700 max: 192780 sum: 39282210

avatar
Contributor

Thanx Simon,

do you know perhaps of a tool that can enable me to read the ORC files in windows environment (I am not supposed to have access to Linux terminal)? and even hopefully automate the data (I have hundrends of ORC files)?

avatar

You can create external table and map schema and move file to HDFS,

CREATE EXTERNAL TABLE IF NOT EXISTS Cars(
        Name STRING, 
        Miles_per_Gallon INT,
        Cylinders INT,
        Displacement INT,
        Horsepower INT, 
        Weight_in_lbs INT,
        Acceleration DECIMAL,
        Year DATE,
        Origin CHAR(1))
    COMMENT 'Data about cars from a public database'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    location '/user/<username>/visdata';
hdfs dfs -copyFromLocal cars.csv /user/<username>/visdata

Now create ORC table :-

CREATE TABLE IF NOT EXISTS mycars(
        Name STRING, 
        Miles_per_Gallon INT,
        Cylinders INT,
        Displacement INT,
        Horsepower INT, 
        Weight_in_lbs INT,
        Acceleration DECIMAL,
        Year DATE,
        Origin CHAR(1))
    COMMENT 'Data about cars from a public database'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    STORED AS ORC;

Insert the data from the external table to the Hive ORC table.

INSERT OVERWRITE TABLE mycars SELECT * FROM cars;