Support Questions

Find answers, ask questions, and share your expertise

Hadoop File Formats

Explorer

I need to consider how to write my data to Hadoop.

I'm using Spark, I got a message from Kafka topic, each message in JSON record.

I have around 200B records per day.

The data fields may be change (not alot but may be change in the future),

I need fast write and fast read, low size in disk.

What should I choose? Avro or Parquet?

If I choose Parquet/Avro, Should I need to create the table with all fields of my JSON?

If no, What is the way to create the table with Parquet format and Avro format?

Thanks!!

1 REPLY 1

Expert Contributor

Hi @Ya ko,

Why not considering the new ORC.

https://www.slideshare.net/Hadoop_Summit/orc-improvement-in-apache-spark-23-95295487

Then you will get the best performance when querying from hive. And yes you have to define your table with all heh fields.

The slide 20 show how to specify the new orc library, you will have to just all the location setting to point where your data will be stored in hdfs.

Michel