- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Converting Hive ORC data to AVRO and JSON format?
- Labels:
-
Apache Hive
Created ‎01-08-2018 10:09 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Our data resides in Hive which is in ORC format. Need to convert this data to AVRO and JSON format. Is there a way to achieve this conversion?
Created ‎01-08-2018 10:50 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
The most simple way is probably to:
create two HIVE tables in JSON and AVRO format using correct SERDE (CREATE TABLE xxx AS TABLE yyy)
then INSERT OVERWRITE from original ORC table
SERDE:
https://github.com/rcongiu/Hive-JSON-Serde
https://cwiki.apache.org/confluence/display/Hive/AvroSerDe
Also orc-content offer basic ORC to JSON file by file : https://orc.apache.org/docs/tools.html
Rgds,
Created ‎01-08-2018 11:33 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Benoit Rousseau Thanks for looking into it. We don't intend to create another table. The idea is to do the conversion on fly using PySpark. Neither we want to use any other tool. Any help/ suggestion in order to convert the data using PySpark will be appreciated.
Created ‎01-09-2018 11:37 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
If you just want an ephemeral table you can use CREATE TEMPORARY EXTERNAL table.
Using SPARK is also possible once ORC is loaded as RDD.
JSON : rdd.toDF.toJSON.saveAsTextFile()
AVRO: import com.databricks.spark.avro._; rdd.toDF.write.avro()
Here is a nice gist that explains it for SCALA
https://gist.github.com/mannharleen/b1f2e60457cb2b08a2f14db40b7ffa0f
Writing JSON in PySpark is write.format('json').save()
Here is the API for SPARK-AVRO available in SCALA and Python : https://github.com/databricks/spark-avro
Writing Avro in PySpark is
write.format("com.databricks.spark.avro").save()<br>
Created ‎01-10-2018 05:17 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In Apache NiFi use the SelectHiveQL processor and then use convertRecord to JSON. when you put it out of the table it's automatically AVRo with a schema.
