Support Questions

ripu · ‎01-22-2017

Hi Team,

consider a hadoop cluster with default block size of 64Mb , we have a case wherein we would like to make use of hadoop for storing historical data and retrieving it as per need

historical data would be in form of archive containing many small files (millions) , so thats the reason we would like to reduce default block size in hadoop to 32MB ?

I also understand that changing default size to 32MB may adversely affect if we plan to use that cluster for applications which ,

store files which are huge in size ,

so can anyone advise what to do in such situations

hduraiswamy · ‎01-23-2017

@ripunjay godhani

Here is the general answer - reducing the default block size will result in creation of too many blocks which results an overhead on Name Node. By architecture each node (in newer architecture it will be each storage type per node - but that conversation is for a different time) on the Hadoop cluster will report a storage report and block report back to the Name Node, which will then be used when retrieving/accessing the data at a later time. So, as you would imagine this will increase the chattiness between name node and data node, as well as increase the meta data on the Name node iteself.

Also, when you start hitting 100's of millions of file range, then your Name node will start filling up the memory and may result in going through a major garbage collection, which is a stop the world operation and may result in your whole cluster being down for few minutes.. there are ways around this - like increasing the memory size of NN or changing the gc, but none of those are economical or easy.

These are all the down sides of reducing the block size - or even a small file problem, in general.

And now coming to your specific use case - why do you think you have so many smaller files? Is there a way you can merge multiple of those into a larger file? I know one of my customer had similar issue while storing tick symbols - they mitigated this by combining the tick data on a hourly basis. Another customer had a source file FTP-ed that is quite small and they mitigated by gzipping bunch of those file into a really large one. Also, archiving data to Hive is another option.

The bottom line being the small file issue on the hadoop must be viewed as a combination of technical + business problem, and you will be best off by looking to ways to eliminate this situation from business standpoint as well. Simply playing around the block size is not going to give you the most mileage.

Lastly, if you felt this answer to be helpful, please upvote and accept the answer. Thank you!

View solution in original post

mqureshi · ‎01-22-2017

@ripunjay godhani

Before I answer your question, please read the following discussion which will help you understand why larger block sizes are required for Hadoop.

https://community.hortonworks.com/questions/51408/hdfs-federation-1.html

Now, assuming you have read above link, yo understand why small files will not work with Hadoop. So not only that you need a 64 MB block size, you actually should bump it up to 128 MB (That is the default in HDP).

This is not a bad news for your use case. There are literally 1000 plus deployments at this point where historical data is archived in Hadoop. Why do you have small files? Are those files small because the whole table is a few MB (less than 64 MB)? What is the total amount of data are you looking to offload into Hadoop? Once we know this, we can answer better but offloading historical data is a classic hadoop use case and you shouldn't run into small files problem.

ripu · ‎01-29-2017

@mqureshiHi our application generate small size xml files which are stored on NAS and XML associated metadata in DB .

plan is to extract metadata from DB into 1 file and compress xml into 1 huge archive say 10GB , suppose each archive is 10GB and data is 3 months old

so i wanted to know best solution for storing and accessing this archived data in hadoop --> HDFS/HIVE/Hbase

Please advise what do you think will be the better approach for reading this archived data

suppose i am storing this archived data in hive so how do i retrieve this archived data

Please guide me for storing archived data in hive

also guide for Retrieving/reading archived data from hive when needed

mqureshi · ‎01-29-2017

This sounds pretty simple. Here is how I would do it but you can follow your own path.

1. Import XML archive data into Hadoop. My next step is optional but to me that's the right way to do it.

2. I will flatten XML into Avro and then ORC (lot of material available on this). I will use nested type to retain XML structure and its going to be more efficient when reading.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-ComplexTyp...

https://orc.apache.org/docs/types.html

Like I said, this is optional. You can keep your data in XML and directly read XML from hive.

3. I will initially keep compression enabled with Snappy but might disable it if the data set is not too large and queries bottleneck on CPU.

That's pretty much it. It's pretty straight forward use case.

ripu · ‎01-29-2017

@mqureshiThanks a lot for your help and for guiding 🙂 . thanks for explaining in detail

hduraiswamy · ‎01-23-2017

@ripunjay godhani

Here is the general answer - reducing the default block size will result in creation of too many blocks which results an overhead on Name Node. By architecture each node (in newer architecture it will be each storage type per node - but that conversation is for a different time) on the Hadoop cluster will report a storage report and block report back to the Name Node, which will then be used when retrieving/accessing the data at a later time. So, as you would imagine this will increase the chattiness between name node and data node, as well as increase the meta data on the Name node iteself.

Also, when you start hitting 100's of millions of file range, then your Name node will start filling up the memory and may result in going through a major garbage collection, which is a stop the world operation and may result in your whole cluster being down for few minutes.. there are ways around this - like increasing the memory size of NN or changing the gc, but none of those are economical or easy.

These are all the down sides of reducing the block size - or even a small file problem, in general.

And now coming to your specific use case - why do you think you have so many smaller files? Is there a way you can merge multiple of those into a larger file? I know one of my customer had similar issue while storing tick symbols - they mitigated this by combining the tick data on a hourly basis. Another customer had a source file FTP-ed that is quite small and they mitigated by gzipping bunch of those file into a really large one. Also, archiving data to Hive is another option.

The bottom line being the small file issue on the hadoop must be viewed as a combination of technical + business problem, and you will be best off by looking to ways to eliminate this situation from business standpoint as well. Simply playing around the block size is not going to give you the most mileage.

Lastly, if you felt this answer to be helpful, please upvote and accept the answer. Thank you!

ripu · ‎01-29-2017

@hduraiswamy

i appreciate your inputs

Please advice on how to store and read archive data from hive

while storing data in hive , should i save it as .har in hdfs ?

our application generatee small size xml files which are stored on NAS and XML associated metadata in DB .

plan is to extract metadata from DB into 1 file and compress xml into 1 huge archive say 10GB , suppose each archive is 10GB and data is 3 months old

so i wanted to know best solution for storing and accessing this archived data in hadoop --> HDFS/HIVE/Hbase

Please advise what do you think will be the better approach for reading this archived data

suppose i am storing this archived data in hive so how do i retrieve this archived data

Please guide me for storing archived data in hive

also guide for Retrieving/reading archived data from hive when needed

hduraiswamy · ‎02-01-2017

I think this question is similar to this one https://community.hortonworks.com/questions/79103/what-is-the-best-way-to-store-small-files-in-hadoo... and I have posted my answer there.

ripu · ‎02-02-2017

yes similar to this https://community.hortonworks.com/questions/79103/what-is-the-best-way-to-store-small-files-in-hadoo...

aervits · ‎01-26-2017

Something I'd like to suggest is the following based on the assumption storage savings is the primary goal here

1. Leverage HDFS tiered storage tier called ARCHIVE http://www.ebaytechblog.com/2015/01/12/hdfs-storage-efficiency-using-tiered-storage/

2. Erasure Coding is a new mechanism soon to be delivered in HDP that promises same fault tolerance guarantees as replication factor of 3 but with 1.5x storage savings. Which means you no longer need to store 3 block replicas but only 1.5 of that. https://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.htm

I'd consider these paths before reducing block size.

Cloudera Community

Support Questions

cases where changing hadoop block size is not recommended ?