Member since
12-14-2015
70
Posts
94
Kudos Received
16
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6159 | 03-14-2017 03:56 PM | |
1356 | 03-07-2017 07:20 PM | |
4358 | 01-23-2017 05:57 AM | |
5692 | 01-23-2017 05:40 AM | |
1724 | 10-18-2016 03:36 PM |
01-23-2017
05:40 AM
4 Kudos
@ripunjay godhani Here is the general answer - reducing the default block size will result in creation of too many blocks which results an overhead on Name Node. By architecture each node (in newer architecture it will be each storage type per node - but that conversation is for a different time) on the Hadoop cluster will report a storage report and block report back to the Name Node, which will then be used when retrieving/accessing the data at a later time. So, as you would imagine this will increase the chattiness between name node and data node, as well as increase the meta data on the Name node iteself. Also, when you start hitting 100's of millions of file range, then your Name node will start filling up the memory and may result in going through a major garbage collection, which is a stop the world operation and may result in your whole cluster being down for few minutes.. there are ways around this - like increasing the memory size of NN or changing the gc, but none of those are economical or easy. These are all the down sides of reducing the block size - or even a small file problem, in general. And now coming to your specific use case - why do you think you have so many smaller files? Is there a way you can merge multiple of those into a larger file? I know one of my customer had similar issue while storing tick symbols - they mitigated this by combining the tick data on a hourly basis. Another customer had a source file FTP-ed that is quite small and they mitigated by gzipping bunch of those file into a really large one. Also, archiving data to Hive is another option. The bottom line being the small file issue on the hadoop must be viewed as a combination of technical + business problem, and you will be best off by looking to ways to eliminate this situation from business standpoint as well. Simply playing around the block size is not going to give you the most mileage. Lastly, if you felt this answer to be helpful, please upvote and accept the answer. Thank you!
... View more
10-18-2016
04:01 PM
Thanks my friend!
... View more
10-18-2016
03:49 PM
1 Kudo
Labels:
- Labels:
-
Apache Hadoop
10-18-2016
03:42 PM
1 Kudo
Labels:
- Labels:
-
Apache Hadoop
-
Apache Ranger
10-18-2016
03:36 PM
3 Kudos
@vpemawat If you are not using log4j: If you are looking to delete the files for good then there is not many options available other than rm -rf; however there are few tweaks that you can do to make it faster you can perhaps run multiple rm scripts in parallel (multiple threads) In order to do this, you should be able to logically separate the log files either by folder or name format Once you have done that, you can run multiple rm commands in background like something below nohup rm -fr app1-2016* > /tmp/nohup.out 2>&1 &
nohup rm -fr app1-2015* > /tmp/nohup.out 2>&1 & If using log4j: You should probably be 'DailyRollingFileAppender" with 'maxBackupIndex' - this will essentially limit the max file size of your log and then purge the older contents. More details here: http://www.codeproject.com/Articles/81462/DailyRollingFileAppender-with-maxBackupIndex Outside of this, you should consider the below 2 things for future use cases Organize the logs by folder (normally broken down like /logs/appname/yyyy/mm/dd/hh/<log files> Have a mechanism that will either delete the old log files, or archive it to a different log archive server Hopefully this helps. If it does, please 'accept' and 'upvote' the answer. Thank you!!
... View more
10-09-2016
10:16 PM
1 Kudo
You are welcome. Glad it worked.
... View more
10-09-2016
10:11 PM
2 Kudos
Go to your Ambari >> Kafka >> Configs and look for the port the Kafka broker listens on If it is HDP sandbox 2.4, it will most probably be on 6667 and therefore you should be running the below command instead ./kafka-console-producer.sh --broker-list sandbox.hortonworks.com:6667 --topic test1 Let me know if this works. Else, post with the proper exception and we can look deeper. If this answer helps you, please don't forget to upvote / accept this answer
... View more
09-26-2016
05:48 AM
3 Kudos
@Bala Vignesh N V Unfortunately, you cannot run multiple insert commands on the same destination table at the same time (technically you can, but the job will get executed one after the other) however, if you are using external file, you can achieve parallelism by writing multiple files into your destination folder and creating a hive external table on top of your destination folder. It will look something like this: CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
LOCATION '/logs/mywebapp/' where '/logs/mywebapp/' will be your hdfs directory and you will write multiple files (one for each of your parallel jobs) into this directory. ** If this answers your question, please don't forget to upvote and Accept the answer **
... View more
09-22-2016
02:54 AM
@Girish Chaudhari what happened right after you executed the Alter table command? Did you get any errors? I am assuming, you tried describe extended <table_name> to determine the location that it is referring to??
... View more
09-21-2016
03:17 AM
Thanks @Randy Gelhausen
... View more