Support Questions

Find answers, ask questions, and share your expertise

how to create and store the avro files in hive table?

avatar

i was trying to create hive table for storing avro file and i have stored my avro shema(.avsc file),my avro file in single location.

could anyone help me to create the table in hive?

1 ACCEPTED SOLUTION

avatar
Master Guru

If you already have your Avro file and Avro schema, upload them to HDFS and use

CREATE EXTERNAL TABLE my_avro_tbl
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  LOCATION '/user/...'
  TBLPROPERTIES ('avro.schema.url'='hdfs://name-node.fqdn:8020/user/.../schema.avsc');

If your Avro file already contains the schema in its header you can just say

CREATE EXTERNAL TABLE tbl-name(... declarations ...) STORED AS AVRO LOCATION '...';

without specifying the schema. I have been testing it last few days and can confirm that it works on HDP-2.4 (Hive-1.2) for all scalar types like string, int, float, double, boolean etc. If you are using some complex types (like union) it might not work.

View solution in original post

13 REPLIES 13

avatar

Hi @manikandan ayyasamy,

Have you looked to the Hive doc on AvroSerDe : https://cwiki.apache.org/confluence/display/Hive/AvroSerDe ? you have examples on how to do it following your Hive version.

Please try to implement it and ask the community if you have a particular problem with your approach.

Note also that ORC can give the best performance with cost base optimization and Vectorization. Give it a try if this is a viable option.

avatar

i have tried the following,but i am not able to create the table.

create external table tweets(name string,time int,tweet string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.AvroSerDe'

STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

TBLPROPERTIES ( 'avro.schema.url'='http://trvlhaddv2gw1.tsh.thomson.com:8888/filebrowser/view/user/c400351/tweet.avsc');

i did not get any error messages.

avatar

What's the problem you are facing If you didn't get any error messages ? can you see the table tweets when you do :

show tables;

avatar

@manikandan ayyasamy

Ok, from reading your several posts and comments I think that Hive is not able to access your schema on HDFS. I see that you are trying to give the schema that's stored on HDFS. However, you are using http and port 8888 which is not good. Can you try with this modification

TBLPROPERTIES ('avro.schema.url'='hdfs://trvlhaddv2gw1.tsh.thomson.com:8020/PATH_TO_YOUR_AVSC_FILE');

avatar
Contributor

Hi Mani,

Below are my observations

1. specifying schema in your DDL and also the Avro schema. i.e column names as part of DDL and Avro URL. use only one.

2. Is the schema in provided URL accessible from HDFS? you can move it HDFS and test

3. The serde is pointing to hcatalog, try using org.apache.hadoop.hive.serde2.avro.AvroSerDe

avatar
Master Guru

You are missing the "LOCATION" for your external table. Uplaod your Avro file somewhere in HDFS and provide the directory where your Avro file is located as your LOCATION. There is also no need to declare name, time etc., they are given in your avsc file. See my answer. After that try "SELECT * FROM tweets LIMIT 10;".

avatar
Master Guru

If you already have your Avro file and Avro schema, upload them to HDFS and use

CREATE EXTERNAL TABLE my_avro_tbl
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  LOCATION '/user/...'
  TBLPROPERTIES ('avro.schema.url'='hdfs://name-node.fqdn:8020/user/.../schema.avsc');

If your Avro file already contains the schema in its header you can just say

CREATE EXTERNAL TABLE tbl-name(... declarations ...) STORED AS AVRO LOCATION '...';

without specifying the schema. I have been testing it last few days and can confirm that it works on HDP-2.4 (Hive-1.2) for all scalar types like string, int, float, double, boolean etc. If you are using some complex types (like union) it might not work.

avatar
Rising Star

@Predrag Minovic So I've been battling with Avro and Hive for days and am getting nowhere. See thread here and here. If you can shed any light, it would be greatly appreciated.

avatar
Master Guru

Mark, here is what I did: Having a csv file as an input I created a Hive text table, loaded csv file onto the text table, created an Avro table "STORED AS AVRO", and inserted all recored from the text table into the avro table. I also tried to create an Avro table using a schema file. Then I tested external tables using avro table data and scheme file. I'll create an article and post this week.