Hi, We have a situation where HDFS file systems are approaching it's maximum size limit. FS is ~90% full. Want to check if any compression is enabled. And also how to enable it.
Any help is highly appreciated.
Thanks in advance.
1. HDFS will support compression and you can check the supporting compression codes in path
Cloudera Manager -> HDFS -> Configuration -> Search for "compression"
2. HDFS is the primary storage system for all the Hadoop applications/tools. So we cannot enable compression on HDFS just like that as all other tools have dependency on HDFS. It means, we can compress only individual/group of file(s) but not all the files in HDFS by enabling some configuration (someone correct me if i'm wrong)
3. You need to check the extention of the individual/group of file(s) to confim it has been compressed or not
4. To make it simple, you need to narrow down further like
a. Which particular application (Ex: MapReduce, Spark, Hive, Impala, Hbase, etc) is consuming more space
b. Ex: Consider Hive is consuming more space, then which particular DB/table in hive is consuming more space
c. List all the top DB/table(s) consuming more space.
d. Once you identified the table, there are so many methods to compress individual tables like Parquet, RCFile, ORCFile, etc. (If you are using both Hive & Impala then my vote is for snappy Parquet)
e. In case if you are using sqoop to import data, then apply parquet during import data. This will compress your data and it will also improve the performance
5. You can refer below documentation for more details
Sqoop: (searhc for parquet)
I want to know which are the files Uncompressed in hdfs from cloudera manager reports(Direcory usage) . Any option is available in Cloudera manager.