Support Questions

Find answers, ask questions, and share your expertise

Converting Hive ORC data to AVRO and JSON format?

avatar
Contributor

Hi,

Our data resides in Hive which is in ORC format. Need to convert this data to AVRO and JSON format. Is there a way to achieve this conversion?

4 REPLIES 4

avatar
Explorer

Hi

The most simple way is probably to:

create two HIVE tables in JSON and AVRO format using correct SERDE (CREATE TABLE xxx AS TABLE yyy)

then INSERT OVERWRITE from original ORC table

SERDE:

https://github.com/rcongiu/Hive-JSON-Serde

https://cwiki.apache.org/confluence/display/Hive/AvroSerDe

Also orc-content offer basic ORC to JSON file by file : https://orc.apache.org/docs/tools.html

Rgds,

avatar
Contributor

@Benoit Rousseau Thanks for looking into it. We don't intend to create another table. The idea is to do the conversion on fly using PySpark. Neither we want to use any other tool. Any help/ suggestion in order to convert the data using PySpark will be appreciated.

avatar
Explorer

@Vijay Parmar

Hi,

If you just want an ephemeral table you can use CREATE TEMPORARY EXTERNAL table.

Using SPARK is also possible once ORC is loaded as RDD.

JSON : rdd.toDF.toJSON.saveAsTextFile()

AVRO: import com.databricks.spark.avro._; rdd.toDF.write.avro()

Here is a nice gist that explains it for SCALA

https://gist.github.com/mannharleen/b1f2e60457cb2b08a2f14db40b7ffa0f

Writing JSON in PySpark is write.format('json').save()

Here is the API for SPARK-AVRO available in SCALA and Python : https://github.com/databricks/spark-avro

Writing Avro in PySpark is

write.format("com.databricks.spark.avro").save()<br>

avatar
Master Guru

In Apache NiFi use the SelectHiveQL processor and then use convertRecord to JSON. when you put it out of the table it's automatically AVRo with a schema.

https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.h...

https://community.hortonworks.com/articles/149891/handling-hl7-records-and-storing-in-apache-hive-fo...