Support Questions
Find answers, ask questions, and share your expertise

Hadoop file formats Read/Write

Hadoop file formats Read/Write


I need to consider how to write my data to Hadoop.

I'm using Spark, I got a message from Kafka topic, each message in JSON record.

I have around 200B records per day.

The data fields may be change (not alot but may be change in the future),

I need fast write and fast read, low size in disk.

What should I choose? Avro or Parquet?

If I choose Parquet/Avro, Should I need to create the table with all fields of my JSON?

If no, What is the way to create the table with Parquet format and Avro format?



Re: Hadoop file formats Read/Write

Super Guru

If you need fast reads and writes, I recommend you use HBase instead. Store the json in a table column. You can create a hive table on top of the hbase table at any point using the hive hbase serde. All is included in your HDP stack.

Using HDFS for fast reads/writes is setting yourself up for failure. Use the correct tool based on your use case.

Re: Hadoop file formats Read/Write


Thanks for you answer.

And if I want to use only HDFS for read/write, Its not recommended for me with amount of 200B records per day?