About snukavarapu

snukavarapu · ‎06-12-2018

https://community.hortonworks.com/content/supportkb/178111/enable-multiple-hive-llap-instances-in-a-single-hd.html

snukavarapu · ‎10-19-2017

This article has steps to identify where most of the small file are located in a large HDFS cluster. Below are some articles regarding the small file issues and how to analyze. https://community.hortonworks.com/articles/15104/small-files-in-hadoop.html https://community.hortonworks.com/repos/105768/identifying-small-file-offenders.html https://community.hortonworks.com/articles/142016/analyze-fsimage-from-a-non-smartsense-cluster.html Smartsense Activity analyzer currently doesn't show the report based on hdfs locations but it shows other useful information like Top N user with Small files, Trends, etc. In a large HDFS cluster with heavy workload env, it is often hard to locate where the most # of small files are located by using 'fsck' or 'hdfs dfs -ls -R' outputs as they can take a long time to retrieve the data and will have to repeat cmds several times to get the desired output. I have taken below approach to spot the HDFS locations where most of the small files exist in a large HDFS cluster so users can look into data and find out the origin of the files (like using incorrect table partition key). Read fsimage and store in HDFS: - Copy of fsimage file to a different location. (Note: please do not run below cmd on live fsimage file) hdfs oiv -p Delimited -delimiter "|" -t /tmp/tmpdir/ -i /fsimage_copy -o /fsimage.out hdfs dfs -put /fsimage.out /user/ambari-qa/ Load to Hive: Create Tables in Hive: CREATE TABLE `fsimage_txt1`( `path` varchar(2000), `repl` int, `mdate` date, `atime` date, `pblksize` int, `blkcnt` bigint, `fsize` bigint, `nsquota` bigint, `dsquota` bigint, `permi` varchar(100), `usrname` varchar(100), `grpname` varchar(100)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://mycluster/apps/hive/warehouse/fsimage_txt' TBLPROPERTIES ( 'last_modified_by'='ambari-qa', 'last_modified_time'='1508277903', 'numFiles'='1', 'numRows'='0', 'rawDataSize'='0', 'totalSize'='12458332430', 'transient_lastDdlTime'='1508277903') CREATE TABLE `file_info1`( `path` varchar(200), `fsize` bigint, `usrname` varchar(100), `depth` int) STORED AS ORC Load data into Hive table; LOAD DATA INPATH '/user/ambari-qa/fsimage.out' INTO TABLE fsimage_txt1; INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/') as path1, fsize ,usrname, 1 from fsimage_txt1; INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/') as path1, fsize ,usrname, 2 from fsimage_txt1; INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/') as path1, fsize ,usrname, 3 from fsimage_txt1; INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/', split(path,'/')[4] , '/') as path1, fsize ,usrname, 4 from fsimage_txt1; INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/', split(path,'/')[4] , '/', split(path,'/')[5] , '/') as path1, fsize ,usrname, 5 from fsimage_txt1; INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/', split(path,'/')[4] , '/', split(path,'/')[5] , '/' ,split(path,'/')[6] , '/') as path1, fsize ,usrname, 6 from fsimage_txt1; Search for dir with highest # of small files: -- Below query shows max small files with dir depth 2 (HDFS files that are of size < 30MB) select path, count(1) as cnt from file_info1 where fsize <= 30000000 and depth = 2 group by path order by cnt desc limit 20; Sample output ------------- /user/abc/ 13400550 /hadoop/data/ 10949499 ... /tmp/ 340400 -- take the dir location with max files (from above output) or your interest and drill down using 'depth' column select path, count(1) as cnt from file_info1 where fsize <= 30000000 and path like '/user/abc/%' and depth = 3 group by path order by cnt desc limit 20;

snukavarapu · ‎09-21-2017

Spark performs better when all the data fits in the memory, especially on dedicated clusters; Hadoop MapReduce is designed for data that doesn’t fit in the memory and it can run well alongside other services. See more: https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/

snukavarapu · ‎05-15-2017

@btandel - Possible cause for this error is kerberos file misconfiguration on the datanode. Verify if the JCE security libraries are deployed correctly on the datanode and krb5.conf file is set the same way as other datanodes.

snukavarapu · ‎02-14-2017

Hi @Bala Vignesh N V Here are some links that you can go though to understand differences between MR and Tez. Difference between MR and Tez & Why Tez is faster? http://hortonworks.com/apache/tez/#section_2 https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey (See slides 9 to 16) why job fails in tez but runs in MR? There could be several reasons for the jobs failure. It could be because of OOM exceptions when memory is not tuned correctly (refer: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/determine-hdp-memory-config.html). I have also seen cases where job fails in MR but runs fine when Tez execution engine is used.

snukavarapu · ‎02-01-2017

Hi @Saikrishna Tarapareddy Can you provide output of below? 1. create table def hive> Show create table test_display 2. Files in the hdfs location hadoop fs -ls <table_display table hdfs location>

snukavarapu · ‎10-12-2016

Just wanted to give an update on this thread. This feature is available in HDP 2.3.4. However, the value set for hive.query.name will not show up in Resource Manager UI instead it will show in Tez UI because Hive re-uses YARN applications to run multiple diff queries by different users. RM UI displays the yarn application name and in Hive/tez we reuse applications heavily so there is no 1-1 relationship between application and DAG or query.

snukavarapu · ‎11-17-2015

@Andrew Watson - Thank you for quick response. Unless customer dealing with special type of data: Is it safe to assume 1) 128MB is optimal value for dfs.blocksize 2) we don't see this value being changed often?

snukavarapu · ‎11-17-2015

Also, did we recommend any customers going higher block size? if so, what were the observations to provide recommendations?

snukavarapu · ‎10-02-2015

For doAs=false, every Hive table has to be owned by “hive” user and if anyone creates an external table, that won’t be owned by Hive user. Are we supposed to restrict usage of external tables?

Online	Offline
Last Visited	‎09-06-2024 11:58 AM

Member Since	‎09-25-2015 05:31 PM
Last Visited	‎09-06-2024 11:58 AM
Posts	22
Kudos received	9

Cloudera Community

Re: Can we create multiple Hive LLAP instances or ...

Re: Can we create multiple Hive LLAP instances or ...

Identify where most of the small file are located ...

Re: Why is spark has better speed than Hadoop

Re: Single Datanode start fails ?

Re: Difference between mr and Tez?

Re: How can we create dynamic partitions in Hive e...

Re: How to set: Tez Job Name?

Re: What factors warrant going to a higher hdfs bl...

What factors warrant going to a higher hdfs block ...

Ranger - hive doAs as false - external tables not...