Support Questions

Find answers, ask questions, and share your expertise

Is there a way to create Hive table based on Avro data directly ?

avatar

I have a dataset that is almost 600GB in Avro format in HDFS. Whay is the most efficient way to create a Hive table directly on this dataset ?

For smaller datasets, I can move my data to disk, use Avro tools to extract schema, upload schema to HDFS and create Hive table based on that schema. Is there a way to directly extract Avro schema from a dataset in HDFS without writing java code ?

1 ACCEPTED SOLUTION

avatar
Master Mentor

You can try the following, cat your large file, grab a few lines output to new file on local fs. Ill be curious to know if that works with avro serialization.

http://stackoverflow.com/questions/22852063/how-to-copy-first-few-lines-of-a-large-file-in-hadoop-to...

Then use avro-tools to extract schema.

View solution in original post

10 REPLIES 10

avatar
CREATE EXTERNAL TABLE avro_hive_table
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'TBLPROPERTIES
('avro.schema.url'='hdfs://localdomain/user/avro/schemas/activity.avsc')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'/user/avro/applog_avro';

avatar

Thanks @Ancil McBarnett @Sunile Manjee. I don't have .avsc file for schema. How can I extract Avro schema for this data ?

avatar
Super Collaborator

@Ancil McBarnett

Are you sure AVRO backed tables can be created as external tables? If I run your statement I get problems on the LOCATION predicate. Hive does not expect the LOCATION clause it seems

Edit:

Never mind, you can but the order of the statements matters:

This works:

CREATE EXTERNAL TABLE as_avro 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' 
LOCATION '/user/root/as_avro' 
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/root/avro.avsc');

avatar
Master Guru

Try this:

CREATE EXTERNAL TABLE  tableName 
PARTITIONED BY (ingestiondatetime BIGINT, recordtype STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serd2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/file.avsc');

avatar
Super Collaborator

This statement misses the 'LOCATION' clause so it is not an external table

avatar
Master Mentor

You can try the following, cat your large file, grab a few lines output to new file on local fs. Ill be curious to know if that works with avro serialization.

http://stackoverflow.com/questions/22852063/how-to-copy-first-few-lines-of-a-large-file-in-hadoop-to...

Then use avro-tools to extract schema.

avatar

Thank You @Artem Ervits @Sunile Manjee @Ancil McBarnett. I was able to get my solution using a combination of commands.

hdfs dfs -cat $1 | head --bytes 10K > $SAMPLE_FILE 
java -jar $AVRO_TOOLS_PATH/avro-tools-1.7.7.jar getschema $SAMPLE_FILE > $AVRO_SCHEMA_FILE 
hdfs dfs -put $AVRO_SCHEMA_FILE $AVRO_SCHEMA_DIR

head command needs to be used with --bytes option to get first 10K bytes. Then I used Avro tools to retrieve schema and copied schema back to HDFS.

avatar
Master Mentor

You taught me something new, awesome community!

avatar
New Contributor

Perhaps this page can help you:

create hive table based avro