Support Questions

Find answers, ask questions, and share your expertise

Is there a way to create Hive table based on Avro data directly ?

avatar

I have a dataset that is almost 600GB in Avro format in HDFS. Whay is the most efficient way to create a Hive table directly on this dataset ?

For smaller datasets, I can move my data to disk, use Avro tools to extract schema, upload schema to HDFS and create Hive table based on that schema. Is there a way to directly extract Avro schema from a dataset in HDFS without writing java code ?

1 ACCEPTED SOLUTION

avatar
Master Mentor

You can try the following, cat your large file, grab a few lines output to new file on local fs. Ill be curious to know if that works with avro serialization.

http://stackoverflow.com/questions/22852063/how-to-copy-first-few-lines-of-a-large-file-in-hadoop-to...

Then use avro-tools to extract schema.

View solution in original post

10 REPLIES 10

avatar
CREATE EXTERNAL TABLE avro_hive_table
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'TBLPROPERTIES
('avro.schema.url'='hdfs://localdomain/user/avro/schemas/activity.avsc')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
'/user/avro/applog_avro';

avatar

Thanks @Ancil McBarnett @Sunile Manjee. I don't have .avsc file for schema. How can I extract Avro schema for this data ?

avatar
Super Collaborator

@Ancil McBarnett

Are you sure AVRO backed tables can be created as external tables? If I run your statement I get problems on the LOCATION predicate. Hive does not expect the LOCATION clause it seems

Edit:

Never mind, you can but the order of the statements matters:

This works:

CREATE EXTERNAL TABLE as_avro 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' 
LOCATION '/user/root/as_avro' 
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/root/avro.avsc');

avatar
Master Guru

Try this:

CREATE EXTERNAL TABLE  tableName 
PARTITIONED BY (ingestiondatetime BIGINT, recordtype STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serd2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/file.avsc');

avatar
Super Collaborator

This statement misses the 'LOCATION' clause so it is not an external table

avatar
Master Mentor

You can try the following, cat your large file, grab a few lines output to new file on local fs. Ill be curious to know if that works with avro serialization.

http://stackoverflow.com/questions/22852063/how-to-copy-first-few-lines-of-a-large-file-in-hadoop-to...

Then use avro-tools to extract schema.

avatar

Thank You @Artem Ervits @Sunile Manjee @Ancil McBarnett. I was able to get my solution using a combination of commands.

hdfs dfs -cat $1 | head --bytes 10K > $SAMPLE_FILE 
java -jar $AVRO_TOOLS_PATH/avro-tools-1.7.7.jar getschema $SAMPLE_FILE > $AVRO_SCHEMA_FILE 
hdfs dfs -put $AVRO_SCHEMA_FILE $AVRO_SCHEMA_DIR

head command needs to be used with --bytes option to get first 10K bytes. Then I used Avro tools to retrieve schema and copied schema back to HDFS.

avatar
Master Mentor

You taught me something new, awesome community!

avatar

Perhaps this page can help you:

create hive table based avro