Created 01-08-2018 10:09 PM
Hi,
Our data resides in Hive which is in ORC format. Need to convert this data to AVRO and JSON format. Is there a way to achieve this conversion?
Created 01-08-2018 10:50 PM
Hi
The most simple way is probably to:
create two HIVE tables in JSON and AVRO format using correct SERDE (CREATE TABLE xxx AS TABLE yyy)
then INSERT OVERWRITE from original ORC table
SERDE:
https://github.com/rcongiu/Hive-JSON-Serde
https://cwiki.apache.org/confluence/display/Hive/AvroSerDe
Also orc-content offer basic ORC to JSON file by file : https://orc.apache.org/docs/tools.html
Rgds,
Created 01-08-2018 11:33 PM
@Benoit Rousseau Thanks for looking into it. We don't intend to create another table. The idea is to do the conversion on fly using PySpark. Neither we want to use any other tool. Any help/ suggestion in order to convert the data using PySpark will be appreciated.
Created 01-09-2018 11:37 PM
Hi,
If you just want an ephemeral table you can use CREATE TEMPORARY EXTERNAL table.
Using SPARK is also possible once ORC is loaded as RDD.
JSON : rdd.toDF.toJSON.saveAsTextFile()
AVRO: import com.databricks.spark.avro._; rdd.toDF.write.avro()
Here is a nice gist that explains it for SCALA
https://gist.github.com/mannharleen/b1f2e60457cb2b08a2f14db40b7ffa0f
Writing JSON in PySpark is write.format('json').save()
Here is the API for SPARK-AVRO available in SCALA and Python : https://github.com/databricks/spark-avro
Writing Avro in PySpark is
write.format("com.databricks.spark.avro").save()<br>
Created 01-10-2018 05:17 AM
In Apache NiFi use the SelectHiveQL processor and then use convertRecord to JSON. when you put it out of the table it's automatically AVRo with a schema.