About munnyrahul

munnyrahul · ‎08-04-2017

disk-mount-point.png@Peter Kim 1) DataNode services has been installed in slave1, slave2 and slave3 2) Datanode directories are /grid/data1/hadoop/hdfs/data,/grid/data2/hadoop/hdfs/data,/grid/data3/hadoop/hdfs/data 3) Yes, I checked disk mount list on slaves. I have attached screenshot for slave1. we have disk been mounted on /grid/data1 as shown in snapshot. Please let me know if anything else is required. Thanks

munnyrahul · ‎08-03-2017

Hi @Sonu Sahi Thanks for your reply. Are you suggesting that we should create 4 HDFS config groups for master, slave1, slave2 and slave3 and provide dfs.datanode.dir as "/grid/data1, /grid/data2, /grid/data3" for slave1, slave2 and slave3 resp and each config group will have entries for its own node i.e. slave1 config group will just have data node directory as /grid/data1 etc? This would make sure that hdfs data for slave1 would go into /grid/data1 and no data will go into /grid/data2 and /grid/data3 on slave1 and same is the case with other 2 slave nodes. And do we need to change replication factor as well? Please correct me if i understood above incorrectly. One more thing, if above scenario is the solution to our problem then what about already existing data under /grid/master, /grid/data2 and /grid/data3 on slave1? How to manage that data? Thanks

munnyrahul · ‎08-03-2017

We have 6 Node hadoop cluster( 1 Edge Node, 2 Master(Primary, Secondary) and 3 slave nodes running on Azure VM's. To each of the slave nodes we have attached 3 disks of 1 TB size each being mounted at /grid/master, /grid/data1, /grid/data2, /grid/data3 on master, slave1, slave2, and slave3 resp. Our replication factor is 3 and We have specified directories as /grid/data1, /grid/data2 and /grid/data3 in Ambari for datanode directories and /grid/master1/hadoop/hdfs/namenode as namenode directories. But since other 3 mount points i.e. /grid/data2, /grid/data3 and /grid/master does not exist on slave1 so hadoop services have started to create these 3 folders on our local filesystem of slave node 1. Same is the case with rest of the 2 slave nodes. This is filling up our local filesystem very fast. Is there any way to deal with this scenario? Are there any specific properties in Amabari which needs to be checked to prevent from this to happen? And since some the data (replicated or other) has already been occupied in local file system of different nodes, can we tackle this safely by backing up without loosing any data? is replication factor required to be changed to 1? Could someone suggest any approach for handling these situations safely? Any help would be much appreciated. Thanks Rahul

munnyrahul · ‎07-13-2017

I have a fixed width file which i am trying to load in hive. But the thing is that i am getting a '\n' character in one of the line in file which is causing record to split and thereby causing regex to fail. I am creating table using below mentioned approach in hive. create external table test.abc1_ext(a STRING,b STRING, c STRING, d STRING, e STRING, f STRING, g STRING, h STRING, i STRING, j STRING, k STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "(.{12})(.{1})(.{50})(.{30})(.{5})(.{30})(.{4})(.{26})(.{10})(.{10})(.{8})") LOCATION '/abc/'; Column d contains '\n' causing record to split. is there a way to handle that in hive? Regards Rahul

munnyrahul · ‎06-07-2017

I am trying to run a spark application which is reading data from hive tables into dataframes and joining them. When i try to run the dataframe individually in spark shell then all joins works fine and i am able to persist data in ORC format in HDFS. But when i run it as an application using spark submit i am getting below mentioned error. Missing an output location for shuffle 2 I did a research on this and found this to be related to Memory issue. I am not getting that why this error is not coming in spark shell even with the same configuration and i am able to persist everything. Command i am using to run application is mentioned below spark-submit --master yarn-client --driver-memory 10g --num-executors 3 --executor-memory 10g --executor-cores 2 --class main.scala.test.Cences --files /usr/hdp/current/spark-client/conf/hive-site.xml --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar /home/talend/test_2.11-0.0.1.jar My cluster configuration is 2 Master Nodes, 3 slave nodes(4 cores and 28 GB each) and 1 Edge Node. Hive tables from which i am reading data are of around 150 MB (very less) in size which is very less as compared to the memory i am giving to spark programs. I am calling following dataframes functions i.e. saveAsTable(), write.format(), persist() in between in application. Any suggestions would really be helpful?

munnyrahul · ‎05-08-2017

I have 2 dataframes in spark as mentioned below. val test = hivecontext.sql("select max(test_dt) as test_dt from abc"); test: org.apache.spark.sql.DataFrame = [test_dt: string] val test1 = hivecontext.table("testing"); where test1 has columns like id,name,age,audit_dt I want to compare these 2 dataframes and filter rows from test1 where audit_dt > test_dt. Somehow I am not able to do that. I am able to compare audit_dt with literal date using lit function but i am not able to compare it with another dataframe column. I am able to compare literal date using lit function as mentioned below val output = test1.filter(to_date(test1("audit_date")).gt(lit("2017-03-23"))) Can anyone suggest as way to compare it with column of dataframe test? Thanks Rahul

munnyrahul · ‎05-02-2017

Hello @Vipin Rathor That's great. I have a 6 node cluster(1 Edge Node, 1 Primary NN, 1 Secondary NN and 2 Slave Nodes). Just wanted to confirm that Shall i setup up new MIT KDC on my Ambari Server Node(Master Node1) and then go for Ambari Automated Kerberos Security setup? I assume above mentioned approach should be the best in my case as i need to setup MIT KDC as well. I am following below mentioned Hortonworks docs for kerberos setup. http://docs.hortonworks.com/HDPDocuments/HDF2/HDF-2.1.2/bk_dataflow-security/content/_optional_install_a_new_mit_kdc.html Thanks in Advance!!

munnyrahul · ‎05-02-2017

I have installed Ranger/Ranger KMS on HDP 2.5 cluster using Ambari.Just wanted to check if we can setup kerberos on HDP 2.5 after Ranger/Ranger KMS have been installed on cluster? Or do we need to delete Ranger/Ranger KMS first for installation of kerberos? Thanks

munnyrahul · ‎04-11-2017

I have installed Ranger KMS on 6 node cluster. I am trying to create encryption zone now. For that i am following below mentioned link http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_hdfs_admin_tools/content/hdfs-encr-appendix.html In this link in step no 5 it is mentioned that add newly created group to dfs.permissions.superusergroup property in Ambari. So i added it as hdfs,cdp in above mentioned property in Ambari and restarted HDFS. But i am not able to run hdfs dfsadmin -report command using user name "mgr" in group "cdp". I wanted to check if we can put 2 different values for same property in Ambari? Or do we need to keep only newly created group in dfs.permissions.superusergroup property? If yes then will removing hdfs have any implications? We have created few HDFS directories using hdfs user and there is data in those. Or is there is way to provide both values as superuser?

munnyrahul · ‎04-03-2017

@Deepak Sharma Yeah i missed the main thing. Was not restarting ldap service. Thanks for the answer. Anyways do u have any reference links to help me connect to hive through knox? Thanks

Online	Offline
Last Visited	‎04-12-2017 01:23 AM

Member Since	‎02-27-2017 05:00 AM
Last Visited	‎04-12-2017 01:23 AM
Posts	171
Kudos received	9

Cloudera Community

Re: Infrastructure Architecture for HDFS/Hadoop

Re: Infrastructure Architecture for HDFS/Hadoop

Infrastructure Architecture for HDFS/Hadoop

Handle New line character in fixed width files in ...

Error in Spark Application - Missing an output loc...

Compare 2 dataframes and filter results based on d...

Re: Kerberos Install on HDP cluster after Ranger/R...

Kerberos Install on HDP cluster after Ranger/Range...

Specifying multiple values in property dfs.cluster...

Re: Authentication Issue in Apache Knox ldap