Member since
09-25-2015
22
Posts
9
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3327 | 06-12-2018 09:26 PM |
06-12-2018
09:26 PM
https://community.hortonworks.com/content/supportkb/178111/enable-multiple-hive-llap-instances-in-a-single-hd.html
... View more
10-19-2017
08:13 PM
6 Kudos
This article has steps to identify where most of the small file are located in a large HDFS cluster. Below are some articles regarding the small file issues and how to analyze. https://community.hortonworks.com/articles/15104/small-files-in-hadoop.html https://community.hortonworks.com/repos/105768/identifying-small-file-offenders.html https://community.hortonworks.com/articles/142016/analyze-fsimage-from-a-non-smartsense-cluster.html Smartsense Activity analyzer currently doesn't show the report based on hdfs locations but it shows other useful information like Top N user with Small files, Trends, etc. In a large HDFS cluster with heavy workload env, it is often hard to locate where the most # of small files are located by using 'fsck' or 'hdfs dfs -ls -R' outputs as they can take a long time to retrieve the data and will have to repeat cmds several times to get the desired output. I have taken below approach to spot the HDFS locations where most of the small files exist in a large HDFS cluster so users can look into data and find out the origin of the files (like using incorrect table partition key). Read fsimage and store in HDFS: - Copy of fsimage file to a different location. (Note: please do not run below cmd on live fsimage file)
hdfs oiv -p Delimited -delimiter "|" -t /tmp/tmpdir/ -i /fsimage_copy -o /fsimage.out
hdfs dfs -put /fsimage.out /user/ambari-qa/ Load to Hive: Create Tables in Hive: CREATE TABLE `fsimage_txt1`(
`path` varchar(2000),
`repl` int,
`mdate` date,
`atime` date,
`pblksize` int,
`blkcnt` bigint,
`fsize` bigint,
`nsquota` bigint,
`dsquota` bigint,
`permi` varchar(100),
`usrname` varchar(100),
`grpname` varchar(100))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://mycluster/apps/hive/warehouse/fsimage_txt'
TBLPROPERTIES (
'last_modified_by'='ambari-qa',
'last_modified_time'='1508277903',
'numFiles'='1',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='12458332430',
'transient_lastDdlTime'='1508277903')
CREATE TABLE `file_info1`(
`path` varchar(200),
`fsize` bigint,
`usrname` varchar(100),
`depth` int)
STORED AS ORC Load data into Hive table; LOAD DATA INPATH '/user/ambari-qa/fsimage.out' INTO TABLE fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/') as path1, fsize ,usrname, 1 from fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/') as path1, fsize ,usrname, 2 from fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/') as path1, fsize ,usrname, 3 from fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/', split(path,'/')[4] , '/') as path1, fsize ,usrname, 4 from fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/', split(path,'/')[4] , '/', split(path,'/')[5] , '/') as path1, fsize ,usrname, 5 from fsimage_txt1;
INSERT INTO TABLE file_info1 select concat('/' , split(path,'/')[1] , '/' , split(path,'/')[2] , '/', split(path,'/')[3] , '/', split(path,'/')[4] , '/', split(path,'/')[5] , '/' ,split(path,'/')[6] , '/') as path1, fsize ,usrname, 6 from fsimage_txt1; Search for dir with highest # of small files: -- Below query shows max small files with dir depth 2 (HDFS files that are of size < 30MB)
select path, count(1) as cnt from file_info1 where fsize <= 30000000 and depth = 2 group by path order by cnt desc limit 20;
Sample output
-------------
/user/abc/ 13400550
/hadoop/data/ 10949499
...
/tmp/ 340400
-- take the dir location with max files (from above output) or your interest and drill down using 'depth' column
select path, count(1) as cnt from file_info1 where fsize <= 30000000 and path like '/user/abc/%' and depth = 3 group by path order by cnt desc limit 20;
... View more
Labels:
09-21-2017
01:01 PM
Spark performs better when all the data fits in the memory, especially on dedicated clusters; Hadoop MapReduce is designed for data that doesn’t fit in the memory and it can run well alongside other services. See more: https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
... View more
05-15-2017
02:52 PM
@btandel - Possible cause for this error is kerberos file misconfiguration on the datanode. Verify if the JCE security libraries are deployed correctly on the datanode and krb5.conf file is set the same way as other datanodes.
... View more
02-14-2017
03:31 PM
Hi @Bala Vignesh N V
Here are some links that you can go though to understand differences between MR and Tez. Difference between MR and Tez & Why Tez is faster? http://hortonworks.com/apache/tez/#section_2 https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey (See slides 9 to 16) why job fails in tez but runs in MR? There could be several reasons for the jobs failure. It could be because of OOM exceptions when memory is not tuned correctly (refer: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/determine-hdp-memory-config.html). I have also seen cases where job fails in MR but runs fine when Tez execution engine is used.
... View more
02-01-2017
07:06 PM
Hi @Saikrishna Tarapareddy Can you provide output of below? 1. create table def hive> Show create table test_display 2. Files in the hdfs location hadoop fs -ls <table_display table hdfs location>
... View more
10-12-2016
04:38 PM
1 Kudo
Just wanted to give an update on this thread. This feature is available in HDP 2.3.4. However, the value set for hive.query.name will not show up in Resource Manager UI instead it will show in Tez UI because Hive re-uses YARN applications to run multiple diff queries by different users. RM UI displays the yarn application name and in Hive/tez we reuse applications heavily so there is no 1-1 relationship between application and DAG or query.
... View more
11-17-2015
08:16 PM
@Andrew Watson - Thank you for quick response. Unless customer dealing with special type of data: Is it safe to assume 1) 128MB is optimal value for dfs.blocksize 2) we don't see this value being changed often?
... View more
11-17-2015
04:30 PM
Also, did we recommend any customers going higher block size? if so, what were the observations to provide recommendations?
... View more
Labels:
- Labels:
-
Apache Hadoop
10-02-2015
02:24 AM
1 Kudo
For doAs=false, every Hive table has to be owned by “hive” user and if anyone creates an external table, that won’t be owned by Hive user. Are we supposed to restrict usage of external tables?
... View more
Labels:
- Labels:
-
Apache Ranger