Created 03-09-2016 03:42 PM
I have a dataset that is almost 600GB in Avro format in HDFS. Whay is the most efficient way to create a Hive table directly on this dataset ?
For smaller datasets, I can move my data to disk, use Avro tools to extract schema, upload schema to HDFS and create Hive table based on that schema. Is there a way to directly extract Avro schema from a dataset in HDFS without writing java code ?
Created 03-09-2016 04:00 PM
You can try the following, cat your large file, grab a few lines output to new file on local fs. Ill be curious to know if that works with avro serialization.
Then use avro-tools to extract schema.
Created 03-09-2016 03:45 PM
CREATE EXTERNAL TABLE avro_hive_table ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'TBLPROPERTIES ('avro.schema.url'='hdfs://localdomain/user/avro/schemas/activity.avsc') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/user/avro/applog_avro';
Created 03-09-2016 03:57 PM
Thanks @Ancil McBarnett @Sunile Manjee. I don't have .avsc file for schema. How can I extract Avro schema for this data ?
Created 09-23-2016 10:45 AM
Are you sure AVRO backed tables can be created as external tables? If I run your statement I get problems on the LOCATION predicate. Hive does not expect the LOCATION clause it seems
Edit:
Never mind, you can but the order of the statements matters:
This works:
CREATE EXTERNAL TABLE as_avro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/user/root/as_avro' TBLPROPERTIES ('avro.schema.url'='hdfs:///user/root/avro.avsc');
Created 03-09-2016 03:46 PM
Try this:
CREATE EXTERNAL TABLE tableName
PARTITIONED BY (ingestiondatetime BIGINT, recordtype STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serd2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/file.avsc');
Created 09-23-2016 11:07 AM
This statement misses the 'LOCATION' clause so it is not an external table
Created 03-09-2016 04:00 PM
You can try the following, cat your large file, grab a few lines output to new file on local fs. Ill be curious to know if that works with avro serialization.
Then use avro-tools to extract schema.
Created 03-09-2016 06:22 PM
Thank You @Artem Ervits @Sunile Manjee @Ancil McBarnett. I was able to get my solution using a combination of commands.
hdfs dfs -cat $1 | head --bytes 10K > $SAMPLE_FILE java -jar $AVRO_TOOLS_PATH/avro-tools-1.7.7.jar getschema $SAMPLE_FILE > $AVRO_SCHEMA_FILE hdfs dfs -put $AVRO_SCHEMA_FILE $AVRO_SCHEMA_DIR
head command needs to be used with --bytes option to get first 10K bytes. Then I used Avro tools to retrieve schema and copied schema back to HDFS.
Created 03-09-2016 11:25 PM
You taught me something new, awesome community!
Created 12-22-2016 05:41 AM
Perhaps this page can help you: