About hduraiswamy

hduraiswamy · ‎01-23-2017

@ripunjay godhani Here is the general answer - reducing the default block size will result in creation of too many blocks which results an overhead on Name Node. By architecture each node (in newer architecture it will be each storage type per node - but that conversation is for a different time) on the Hadoop cluster will report a storage report and block report back to the Name Node, which will then be used when retrieving/accessing the data at a later time. So, as you would imagine this will increase the chattiness between name node and data node, as well as increase the meta data on the Name node iteself. Also, when you start hitting 100's of millions of file range, then your Name node will start filling up the memory and may result in going through a major garbage collection, which is a stop the world operation and may result in your whole cluster being down for few minutes.. there are ways around this - like increasing the memory size of NN or changing the gc, but none of those are economical or easy. These are all the down sides of reducing the block size - or even a small file problem, in general. And now coming to your specific use case - why do you think you have so many smaller files? Is there a way you can merge multiple of those into a larger file? I know one of my customer had similar issue while storing tick symbols - they mitigated this by combining the tick data on a hourly basis. Another customer had a source file FTP-ed that is quite small and they mitigated by gzipping bunch of those file into a really large one. Also, archiving data to Hive is another option. The bottom line being the small file issue on the hadoop must be viewed as a combination of technical + business problem, and you will be best off by looking to ways to eliminate this situation from business standpoint as well. Simply playing around the block size is not going to give you the most mileage. Lastly, if you felt this answer to be helpful, please upvote and accept the answer. Thank you!

hduraiswamy · ‎10-18-2016

Thanks my friend!

hduraiswamy · ‎10-18-2016

hduraiswamy · ‎10-18-2016

hduraiswamy · ‎10-18-2016

@vpemawat If you are not using log4j: If you are looking to delete the files for good then there is not many options available other than rm -rf; however there are few tweaks that you can do to make it faster you can perhaps run multiple rm scripts in parallel (multiple threads) In order to do this, you should be able to logically separate the log files either by folder or name format Once you have done that, you can run multiple rm commands in background like something below nohup rm -fr app1-2016* > /tmp/nohup.out 2>&1 & nohup rm -fr app1-2015* > /tmp/nohup.out 2>&1 & If using log4j: You should probably be 'DailyRollingFileAppender" with 'maxBackupIndex' - this will essentially limit the max file size of your log and then purge the older contents. More details here: http://www.codeproject.com/Articles/81462/DailyRollingFileAppender-with-maxBackupIndex Outside of this, you should consider the below 2 things for future use cases Organize the logs by folder (normally broken down like /logs/appname/yyyy/mm/dd/hh/<log files> Have a mechanism that will either delete the old log files, or archive it to a different log archive server Hopefully this helps. If it does, please 'accept' and 'upvote' the answer. Thank you!!

hduraiswamy · ‎10-09-2016

You are welcome. Glad it worked.

hduraiswamy · ‎10-09-2016

Go to your Ambari >> Kafka >> Configs and look for the port the Kafka broker listens on If it is HDP sandbox 2.4, it will most probably be on 6667 and therefore you should be running the below command instead ./kafka-console-producer.sh --broker-list sandbox.hortonworks.com:6667 --topic test1 Let me know if this works. Else, post with the proper exception and we can look deeper. If this answer helps you, please don't forget to upvote / accept this answer

hduraiswamy · ‎09-26-2016

@Bala Vignesh N V Unfortunately, you cannot run multiple insert commands on the same destination table at the same time (technically you can, but the job will get executed one after the other) however, if you are using external file, you can achieve parallelism by writing multiple files into your destination folder and creating a hive external table on top of your destination folder. It will look something like this: CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') LOCATION '/logs/mywebapp/' where '/logs/mywebapp/' will be your hdfs directory and you will write multiple files (one for each of your parallel jobs) into this directory. ** If this answers your question, please don't forget to upvote and Accept the answer **

hduraiswamy · ‎09-22-2016

@Girish Chaudhari what happened right after you executed the Alter table command? Did you get any errors? I am assuming, you tried describe extended <table_name> to determine the location that it is referring to??

hduraiswamy · ‎09-21-2016

Thanks @Randy Gelhausen

Online	Offline
Last Visited	‎06-01-2017 06:37 PM

Member Since	‎12-14-2015 01:38 AM
Last Visited	‎06-01-2017 06:37 PM
Posts	70
Kudos received	92

Cloudera Community

Re: Change Data Capture using NiFi

Re: Hive metastore and postgres authentication to ...

Re: what is the best way to store small files in h...

Re: cases where changing hadoop block size is not ...

Re: How to delete log folder faster having files l...

Re: cases where changing hadoop block size is not ...

Re: How do you achieve high availability in HDFS w...

When I want to do hdfs-encryption, is temp swap sp...

How do you achieve high availability in HDFS when ...

Re: How to delete log folder faster having files l...

Re: I have trouble pushing messages into my Kafka ...

Re: I have trouble pushing messages into my Kafka ...

Re: Is it possible to load hive table parallely?

Re: how to change hive external table location.

Re: What is recommended way of moving mainframe da...