In real world how HDFS storage is used ? what's the most used file format for HDFS storage? It might be different per requirement though wondering across big data community what's the common understanding on optimal solution for Data Storage in HDFS which suits most of big data project requirements.
Good Question. There are lots of use cases for HDFS in the real world. It's important to realize it's a filesystem and that it can be used in many ways to accomplish many things.
Store JSON data and apply a hive schema on top of the data to make it available through SQL and tools like Tableau.
On the same cluster and same HDFS file system you could simultaneously have data in CSV being ingested and compressed using ORC to store the data in a more efficient manner for use in some other process.
It really just boils down to choosing what makes the most sense for the Business that is implementing the solution.
Great question. since Hadoop is a distributed file system, let's think about the characteristics a file format should provide.
1. Like most file formats regardless of Hadoop, binary formats provide more efficiency from storage as well as read/write perspective.
2. One thing that Hadoop does is split the file in it's file blocks and distribute it to different machines. So, it would be nice that Hadoop file formats are splittable. ORC, CSV (not recommended), Sequence and Avro are all splittable.
You also should consider the type of engine you would use to process data.
1. If you are running Hive to read data and you will only select subset of columns, then ORC makes most sense. After all, it is a columnar format, optimized specifically when you need to read some columns together.
2. When you are going to read all columns in a table, then Avro makes more sense (some might suggest Sequence files too but I would recommend you test both sequence and Avro and see which gives you best performance).
3. If you are using HBase, then it will store it's own "HFiles" in Hadoop. You don't to worry about it. So basically with HBase, you don't worry about data format. Same with SOLR.
4. When you are landing data in Hadoop, it's better to use record format like sequence or Avro as it's more efficient in writing. Columnar formats are slow to write. You might have SLAs that require faster writing. In those cases, use Avro or Sequence (CSV, Tab delimited or other text formats will also work).
5. Finally, make sure you compress data. That will be more storage efficient and will not affect performance as most jobs are likely to be I/O bound and decompression happens in CPU (If you are CPU bound then decompression should also be considered but do a cost/benefit analysis).
So, basically, it boils down to
1. Avro, Sequence when you are reading all columns in a table.
2. ORC when you read a subset of columns.
3. Avro, Sequence or plain text formats like CSV, Tab Delimited when landing data in staging area.
4. You can also store all sorts of binary data in their raw format (images, videos, audios, etc). You have to have a way (an engine) of reading/processing that data.
hope this helps.
Thanks a lot @mqureshi for detailed response. It helped.
One interesting thing we obsereved as the compression level between CSV and AVRO is 97 and 90 % respectively. This actually makes compressed csv 3 times smaller in size.
in case we need 10 TB or compressed CSV , we need around 26-30 TB for compressed AVRO.
Also any binary including AVRO we loose the opportunityu to read the file in raw format.
This may be true, but did you checked your performance? What is the cost to performance ratio. Also, which compression algorithm did you use? I would recommend using Snappy. It gives you good balance between performance as well as storage.
Storage in Hadoop is relatively inexpensive to begin with so usually, after compression numbers are not so much significant. Also, what happens to your compression if you change data format to ORC?
You should test all three for storage and performance. Using CSV may not be your best option. If my answer helped, can you please accept it.
One more things comes to mind for why your CSV compresses better. Assume Integer data type. In CSV, everything is string. a one digit integer will actually take only one byte. That same one digit integer will be stored in avro using four bytes given it's a fixed format. So that might cause some savings in disk space. But then when you are doing some calculations, you first have to cast CSV strings into integer, which will significantly impact performance. On the other hand, there will be no casting required in Avro and your queries will be faster.
I would add to @mqureshi extensive response that ACID transactions are only possible when using ORC as the file format. Also, Parquet is another option. Parquet, and other columnar formats handle a common Hadoop situation very efficiently. There is an entire debate about Parquet vs. ORC. Parquet is still the choice for some products in big data tools ecosystem, but the ORC is getting more and more popular.