Created 07-02-2018 08:14 AM
I need to consider how to write my data to Hadoop.
I'm using Spark, I got a message from Kafka topic, each message in JSON record.
I have around 200B records per day.
The data fields may be change (not alot but may be change in the future),
I need fast write and fast read, low size in disk.
What should I choose? Avro or Parquet?
If I choose Parquet/Avro, Should I need to create the table with all fields of my JSON?
If no, What is the way to create the table with Parquet format and Avro format?
Thanks!!
Created 07-02-2018 04:03 PM
If you need fast reads and writes, I recommend you use HBase instead. Store the json in a table column. You can create a hive table on top of the hbase table at any point using the hive hbase serde. All is included in your HDP stack.
Using HDFS for fast reads/writes is setting yourself up for failure. Use the correct tool based on your use case.
Created 07-03-2018 06:04 AM
Thanks for you answer.
And if I want to use only HDFS for read/write, Its not recommended for me with amount of 200B records per day?