I need to consider how to write my data to Hadoop.
I'm using Spark, I got a message from Kafka topic, each message in JSON record.
I have around 200B records per day.
The data fields may be change (not alot but may be change in the future),
I need fast write and fast read, low size in disk.
What should I choose? Avro or Parquet?
If I choose Parquet/Avro, Should I need to create the table with all fields of my JSON?
If no, What is the way to create the table with Parquet format and Avro format?
If you need fast reads and writes, I recommend you use HBase instead. Store the json in a table column. You can create a hive table on top of the hbase table at any point using the hive hbase serde. All is included in your HDP stack.
Using HDFS for fast reads/writes is setting yourself up for failure. Use the correct tool based on your use case.
Thanks for you answer.
And if I want to use only HDFS for read/write, Its not recommended for me with amount of 200B records per day?