Support Questions
Find answers, ask questions, and share your expertise

ORC table created from Avro table, some confusions

Expert Contributor

Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1

The steps that I have executed so far :

  • Using Sqoop, SQL Server table imported to HDFS AS AVRO. The resultant files are as shown :

4124-sqoop-avro-files.png

  • Created an EXTERNAL TABLE in Hive in say schema_1 using 'ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'. This table has the data and 'select' yields rows properly
  • Copied the create script of the above table
  • Created another Hive schema, say schema_2 and in this, created an empty ORC EXTERNAL TABLE
hive>
    >
    >
    > CREATE EXTERNAL TABLE `dimsampledesc`(
    >   `smapiname_ver` string COMMENT '',
    >   `smapicolname` string COMMENT '',
    >   `charttype` int COMMENT '',
    >   `x_indexet` int COMMENT '',
    >   `y_indexet` int COMMENT '',
    >   `x_tick` string COMMENT '',
    >   `y_tick` string COMMENT '',
    >   `x_tickrange` string COMMENT '',
    >   `x_tickrangefrom` string COMMENT '',
    >   `x_tickrangetom` string COMMENT '',
    >   `y_tickrange` string COMMENT '',
    >   `y_tickrangefrom` string COMMENT '',
    >   `y_tickrangetom` string COMMENT '',
    >   `indexcount` int COMMENT '',
    >   `x_indexcount` int COMMENT '',
    >   `y_indexcount` int COMMENT '',
    >   `x_symbol` string COMMENT '',
    >   `x_symbolname` string COMMENT '',
    >   `x_symboldescr` string COMMENT '',
    >   `y_symbol` string COMMENT '',
    >   `y_symbolname` string COMMENT '',
    >   `y_symboldescr` string COMMENT '',
    >   `smapiname` string COMMENT '',
    >   `incorrect_ver_fl` boolean COMMENT '')
    > STORED AS ORC
    > LOCATION
    >   '/datastore/hdfstohive/reio/odpdw/may2016/DimSampleDesc';
OK
Time taken: 0.167 seconds
hive>
  • Selected data from schema_1's Avro table and inserted into schema_2's empty ORC table
hive>
    > insert into dimsampledesc select * from odp_dw_may2016.dimsampledesc;
Query ID = hive_20160509165854_63961c2e-427d-4ad8-8ddb-3333c27a64ea
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1446726117927_0024)
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      2          2        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 7.50 s
--------------------------------------------------------------------------------
Loading data to table odp_dw_may2016_orc.dimsampledesc
Table odp_dw_may2016_orc.dimsampledesc stats: [numFiles=2, numRows=46629, totalSize=333637, rawDataSize=73668605]
OK
Time taken: 8.966 seconds
hive>
  • Now, the contents of the 'location' in the orc table are as follows

4125-hive-orc-location.png

  • The select queries yield properly on the schema_2. orc table
  • Now even if I drop the schema_1 avro table, the schema_2 orc table is unaffected

The ultimate objective : Import all the SQL Server tables onto HDFS, experiment with different Hive formats(ORC, Avro, Parquet), strategies(buckets, partitioning etc.) and finalize the approach for getting the best performance on Hive(the SQL Server queries will be executed on Hive)

Accordingly, I have the following queries

  • The first step will always be to import data from SQL Server to HDFS and I wish to go with the Avro format(this may help in future in case some other codes in diff. languages wishes to read the data, without using Hive). Are there any pitfalls that I have failed to consider e.g: don't use Avro at all, stick to ORC during sqoop import itself !
  • As evident from the images, the avro files and the hive orc files(output of insert into select) differ in size and format. Is this approach correct ? Will it hamper the creation of tables and performance in future when partitioning, bucketing come into picture ?
  • I'm planning to drop the avro tables once the Hive ORC tables are created, of course, this will keep the data intact. Can these Avro tables provide better performance over Hive ORC tables meaning do I need to focus only on HIve orc and partitioning or also evaluate SQL Server queries against Avro tables
  • I have not considered storing the hdfs data as text or csv files but is it the case that the underlying hdfs format(text, Avro) has a performance implication, for example, stored as text will be faster than stored as Avro ?
2 REPLIES 2

@Kaliyug Antagonist

1) Avro is a serialization platform that defines a row based file format. The pro of Avro is schema evolution but I don't see a real need for this in real world projects. In my opinion, schemas doesn't evolve so often and there's more basic approaches to handle it than the file format. I will be happy to hear other opinions on this. Is schema evolution a requirement in your contexte ?

ORC and Parquet are in the other hand columnar formats. Columnar file formats are best suited for storing data that will be accessed with SQL (which is your case case: SQL server data and Hive). Also, ORC has been designed to give Hive the best performance. Together with Cost Based Optimization and Tez, these projects introduced lot of innovation in Hive (also known as project Stinger). Benchmarks show that ORC files with Hive and Tez give the best performance.

I really like your approach of testing and benchmarking the different options. Even if benchmarks give you an overview for comparing performance, the best thing to do is to test with your data and your queries to have something meaningful for your business. You can share your results with us after by writing an article.

2) Size difference is normal since the file format are different. Look into compression option with each file format too.

3) I don't see any problem in dropping Avro table. The data is still here so you won't loose much. ORC will be more efficient as I said (do your benchmarks).

4) storing as text will be faster since there's less operation to do at write time. However, practice shows that read time is the most important. You will write you data once, but you will read it several time and maybe everady

Hope this helps

Expert Contributor

In our team, we all agree that ORC is the way to go but my main concern is in which format should the data be imported onto HDFS ? I didn't sqoop import directly in Hive as :

  • I want to keep the imported data intact, even if I delete everything in Hive and start afresh, the data should be available
  • To create an EXTERNAL Hive table, I dunno how to achieve 1. and this in sqoop, hence, I chose to first import in Avro format then move to ORC
  • In the above approach, after the ORC table is populated by a select * from avro table, will this ORC table have any impact of the underlying AVRO files? I don't think so as the data has been inserted in Hive ORC table and now is 100% ORC - can you confirm/reject ?