Member since
02-27-2017
171
Posts
9
Kudos Received
0
Solutions
04-17-2018
07:32 AM
I have a 6 node HDP 2.5 cluster(1 Edge, 1 Master, 1 Secondary Master, 3 Data Nodes) running on Azure VM's. I am looking to install Kafka on on that cluster. I wanted to have Kafka Connect API, Kafka Streams API alongwith that. I went on to "add service" in Ambari and i shows Kafka 0.10 with HDP 2.5. What would the best way to install kafka in my case on HDP 2.5? How many brokers should i be installing and where to run brokers and zookeeper? Any help would be appreciated. Thanks in Advance.
... View more
Labels:
- Labels:
-
Apache Kafka
01-15-2018
10:29 AM
Hi @Bala Vignesh N V, I am not even able to open shell even after 60% utilization of cluster as seen from yarn running applications. And I meant opening hive shell and running individual queries in those. And it is not related to reducers. data and tuning of jobs are already done. Problems is unable to open shell after spark shell or spark jobs are running in yarn cluster mode. Thanks Rahul
... View more
01-04-2018
09:14 AM
yarn-running-applications.pngHi, I am facing lot of trouble in dealing with slowness/unresponsiveness of hive cli. Initially i thought it might be due to lack to resources in cluster but when i went and check running applications I went and see yarn running applications i found that there are lot of resources being left in cluster. I face the same issue when i run spark application which acquires around 50-60% of the cluster. Please note that i have not set up queuing in yarn. All my applications goes to default queue. I am not able to understand why opening hive cli gets stuck even after resource availability in cluster. Could anyone from the community help me in resolving this? Do i need to setup queuing. I am also attaching screenshot for the running applications in yarn when i try to open hive shell
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hive
09-22-2017
10:44 AM
@Bala Vignesh N V Hi Bala, Thanks for your reply. I am also planning to run first solution itself. I assume id in solution 1 means primary key i.e There should be single record for each id in both source and target table? As i have to make a join on multiple columns to get single record for each combination key since by id is not unique. Please suggest? Thanks Rahul
... View more
09-22-2017
06:39 AM
Hi @Bala Vignesh N V Thanks for your reply. max(row_number) is not working in hive. Do you mean to say that we need to get latest value of each partition? If yes, is there any other function to do that as max is not working? Thanks Rahul
... View more
09-22-2017
06:38 AM
Hi @Bala Vignesh N V Thanks for your reply Max(row_num()) is not working in hive. Do you mean to say that we need to get latest value of target table? Is there any other function to achieve that as max gives error? Thanks Rahul
... View more
09-18-2017
10:29 AM
I need to ingest large amount of data into one of the hive orc table using talend.(around 48 million). I am also decrypting and normalizing/denormalizing few of the columns before ingesting the data. Due to memory issues i cannot ingest data at once so i need to ingest it in chunks(by diving on basis on some columns). I tried using limit but was not sure if limit will give new rows on each run of job? Or is there any other hive analytics function which fits this scenario? I want to break the data amongs 3-4 runs of the job so that it didnot cause out of memory error? Example : I ran query based on grouping of relation and it showed results as mentioned below. hive> select count(1),relation from refinedlayer.customerpref group by relation; OK
1719076 NULL
2523 CHILD 33522 OTHER 121394 PARTNER 3282312 SELF I would like to break records of SELF relation into 5-6 runs. Can i use any analytics function on top of it?
... View more
Labels:
- Labels:
-
Apache Hive
08-04-2017
05:34 AM
disk-mount-point.png@Peter Kim 1) DataNode services has been installed in slave1, slave2 and slave3 2) Datanode directories are /grid/data1/hadoop/hdfs/data,/grid/data2/hadoop/hdfs/data,/grid/data3/hadoop/hdfs/data 3) Yes, I checked disk mount list on slaves. I have attached screenshot for slave1. we have disk been mounted on /grid/data1 as shown in snapshot. Please let me know if anything else is required. Thanks
... View more
08-03-2017
05:27 PM
Hi @Sonu Sahi Thanks for your reply. Are you suggesting that we should create 4 HDFS config groups for master, slave1, slave2 and slave3 and provide dfs.datanode.dir as "/grid/data1, /grid/data2, /grid/data3" for slave1, slave2 and slave3 resp and each config group will have entries for its own node i.e. slave1 config group will just have data node directory as /grid/data1 etc? This would make sure that hdfs data for slave1 would go into /grid/data1 and no data will go into /grid/data2 and /grid/data3 on slave1 and same is the case with other 2 slave nodes. And do we need to change replication factor as well? Please correct me if i understood above incorrectly. One more thing, if above scenario is the solution to our problem then what about already existing data under /grid/master, /grid/data2 and /grid/data3 on slave1? How to manage that data? Thanks
... View more
08-03-2017
06:59 AM
We have 6 Node hadoop cluster( 1 Edge Node, 2 Master(Primary, Secondary) and 3 slave nodes running on Azure VM's. To each of the slave nodes we have attached 3 disks of 1 TB size each being mounted at /grid/master, /grid/data1, /grid/data2, /grid/data3 on master, slave1, slave2, and slave3 resp. Our replication factor is 3 and We have specified directories as /grid/data1, /grid/data2 and /grid/data3 in Ambari for datanode directories and /grid/master1/hadoop/hdfs/namenode as namenode directories. But since other 3 mount points i.e. /grid/data2, /grid/data3 and /grid/master does not exist on slave1 so hadoop services have started to create these 3 folders on our local filesystem of slave node 1. Same is the case with rest of the 2 slave nodes. This is filling up our local filesystem very fast. Is there any way to deal with this scenario? Are there any specific properties in Amabari which needs to be checked to prevent from this to happen? And since some the data (replicated or other) has already been occupied in local file system of different nodes, can we tackle this safely by backing up without loosing any data? is replication factor required to be changed to 1? Could someone suggest any approach for handling these situations safely? Any help would be much appreciated. Thanks Rahul
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hadoop
07-13-2017
11:57 AM
I have a fixed width file which i am trying to load in hive. But the thing is that i am getting a '\n' character in one of the line in file which is causing record to split and thereby causing regex to fail. I am creating table using below mentioned approach in hive. create external table test.abc1_ext(a STRING,b STRING, c STRING, d STRING, e STRING, f STRING, g STRING, h STRING, i STRING, j STRING, k STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "(.{12})(.{1})(.{50})(.{30})(.{5})(.{30})(.{4})(.{26})(.{10})(.{10})(.{8})") LOCATION '/abc/';
Column d contains '\n' causing record to split. is there a way to handle that in hive? Regards Rahul
... View more
- Tags:
- Data Processing
- Hive
Labels:
- Labels:
-
Apache Hive
06-07-2017
10:28 AM
I am trying to run a spark application which is reading data from hive tables into dataframes and joining them. When i try to run the dataframe individually in spark shell then all joins works fine and i am able to persist data in ORC format in HDFS. But when i run it as an application using spark submit i am getting below mentioned error. Missing an output location for shuffle 2 I did a research on this and found this to be related to Memory issue. I am not getting that why this error is not coming in spark shell even with the same configuration and i am able to persist everything. Command i am using to run application is mentioned below spark-submit --master yarn-client --driver-memory 10g --num-executors 3 --executor-memory 10g --executor-cores 2 --class main.scala.test.Cences --files /usr/hdp/current/spark-client/conf/hive-site.xml --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar /home/talend/test_2.11-0.0.1.jar My cluster configuration is 2 Master Nodes, 3 slave nodes(4 cores and 28 GB each) and 1 Edge Node. Hive tables from which i am reading data are of around 150 MB (very less) in size which is very less as compared to the memory i am giving to spark programs. I am calling following dataframes functions i.e. saveAsTable(), write.format(), persist() in between in application. Any suggestions would really be helpful?
... View more
Labels:
- Labels:
-
Apache Spark
05-08-2017
05:49 AM
I have 2 dataframes in spark as mentioned below. val test = hivecontext.sql("select max(test_dt) as test_dt from abc"); test: org.apache.spark.sql.DataFrame = [test_dt: string] val test1 = hivecontext.table("testing"); where test1 has columns like id,name,age,audit_dt I want to compare these 2 dataframes and filter rows from test1 where audit_dt > test_dt. Somehow I am not able to do that. I am able to compare audit_dt with literal date using lit function but i am not able to compare it with another dataframe column. I am able to compare literal date using lit function as mentioned below val output = test1.filter(to_date(test1("audit_date")).gt(lit("2017-03-23"))) Can anyone suggest as way to compare it with column of dataframe test? Thanks Rahul
... View more
Labels:
- Labels:
-
Apache Spark
05-02-2017
05:14 PM
Hello @Vipin Rathor That's great. I have a 6 node cluster(1 Edge Node, 1 Primary NN, 1 Secondary NN and 2 Slave Nodes). Just wanted to confirm that Shall i setup up new MIT KDC on my Ambari Server Node(Master Node1) and then go for Ambari Automated Kerberos Security setup? I assume above mentioned approach should be the best in my case as i need to setup MIT KDC as well. I am following below mentioned Hortonworks docs for kerberos setup. http://docs.hortonworks.com/HDPDocuments/HDF2/HDF-2.1.2/bk_dataflow-security/content/_optional_install_a_new_mit_kdc.html Thanks in Advance!!
... View more
05-02-2017
05:23 AM
I have installed Ranger/Ranger KMS on HDP 2.5 cluster using Ambari.Just wanted to check if we can setup kerberos on HDP 2.5 after Ranger/Ranger KMS have been installed on cluster? Or do we need to delete Ranger/Ranger KMS first for installation of kerberos? Thanks
... View more
Labels:
- Labels:
-
Apache Ranger
04-26-2017
05:53 AM
@Ana Gillan I am also facing the same issue. I have already specified above mentioned properties in talend but still it is not working. I am able to access the files from edge node cli using talend sudo user but not from talend. Thanks
... View more
04-21-2017
01:03 PM
1 Kudo
I am trying to covert string column in dataframe to date/time. I am loading dataframe from hive tables and i have tried below mentioned function in converting string to date/time. But it is not giving me the correct output as it is converting all values to null. (unix_timestamp($"BIRTHDT","MM-dd-yyyy").cast("date")) && (to_date(($"BIRTHDT","MM-dd-yyyy").cast("date")) Values in my birthdt columns are as follows 20061202
20061203
20061205
20061206
20061208
Am i missing something? Thanks Rahul
... View more
Labels:
- Labels:
-
Apache Spark
04-14-2017
05:59 PM
I have created a hivecontext in spark and i am reading hive ORC tables from hivecontext into spark dataframes. I have saved that dataframe into temp table. I am looking for how to specify left outer join when running sql queries on that temporary table? Any help would be appreciated. Code that i am running is mentioned below. import org.apache.spark.sql._ val hivecontext = new org.apache.spark.sql.hive.HiveContext(sc) val a = hivecontext.table("customer.a_orc") val b = hivecontext.table("customer.b_orc") a.registerTempTable("a") b.registerTempTable("b") val output = hivecontext.sql("select a.*,b.* from a,b where a left outer join b on (a.id=b.id)) The output dataframe shown above is giving error. Is there a way to specify left outer join like this or do i have to create separate dataframes?
... View more
Labels:
- Labels:
-
Apache Spark
04-11-2017
12:39 PM
I have installed Ranger KMS on 6 node cluster. I am trying to create encryption zone now. For that i am following below mentioned link http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_hdfs_admin_tools/content/hdfs-encr-appendix.html In this link in step no 5 it is mentioned that add newly created group to dfs.permissions.superusergroup property in Ambari. So i added it as hdfs,cdp in above mentioned property in Ambari and restarted HDFS. But i am not able to run hdfs dfsadmin -report command using user name "mgr" in group "cdp". I wanted to check if we can put 2 different values for same property in Ambari? Or do we need to keep only newly created group in dfs.permissions.superusergroup property? If yes then will removing hdfs have any implications? We have created few HDFS directories using hdfs user and there is data in those. Or is there is way to provide both values as superuser?
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Ranger
04-11-2017
12:39 PM
I have installed Ranger KMS on 6 node cluster. I am trying to create encryption zone now. For that i am following below mentioned link http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_hdfs_admin_tools/content/hdfs-encr-appendix.html In this link in step no 5 it is mentioned that add newly created group to dfs.permissions.superusergroup property in Ambari. So i added it as hdfs,cdp in above mentioned property in Ambari and restarted HDFS. But i am not able to run hdfs dfsadmin -report command using user name "mgr" in group "cdp". I wanted to check if we can put 2 different values for same property in Ambari? Or do we need to keep only newly created group in dfs.permissions.superusergroup property? If yes then will removing hdfs have any implications? We have created few HDFS directories using hdfs user and there is data in those. Or is there is way to provide both values as superuser?
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Ranger
04-11-2017
11:19 AM
@Deepak Sharma @Sagar Shimpi @Vipin Rathor
... View more
04-11-2017
11:10 AM
I have installed Ranger KMS on 6 node cluster. I am trying to create encryption zone now. For that i am following below mentioned link http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_hdfs_admin_tools/content/hdfs-encr-appendix.html In this link in step no 5 it is mentioned that add newly created group to dfs.permissions.superusergroup property in Ambari. So i added it as hdfs,cdp in above mentioned property in Ambari and restarted HDFS. But i am not able to run hdfs dfsadmin -report command using user name "mgr" in group "cdp". I wanted to check if we can put 2 different values for same property in Ambari? Or do we need to keep only newly created group in dfs.permissions.superusergroup property? If yes then will removing hdfs have any implications? We have created few HDFS directories using hdfs user and there is data in those. Or is there is way to provide both values as superuser?
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Ranger
04-11-2017
06:53 AM
Hi @Tim Armstrong and @Adar I am able to resolve github ssl certificate issue. I am trying to download quickstart ova file from below link but it seems like this repo is very slow. Is there any other repo from where we can download this kudu quickstart vm file. http://cloudera-kudu-beta.s2.amazonaws.com/cloudera-quickstart-vm-5.10.0-kudu-virtualbox.ova Thanks Rahul
... View more
04-11-2017
01:11 AM
Hi Adar, Thanks for your reply. I am trying the option of installing kudu quickstart VM on Windows Virtualbox. I have installed and setup Ubuntu on VirtualBox on Windows OS. I am using below link to setup Kudu. https://kudu.apache.org/docs/quickstart.html When i run curl command, it doesnot do/return anything. The command is mentioned below. curl -s https://raw.githubusercontent.com/cloudera/kudu-examples/master/demo-vm-setup/bootstrap.sh | bash I think it is unable to establish SSL connection. Is there any workaround for the same? Or am i doing something wrong here? Please suggest Regards Rahul
... View more
04-11-2017
01:09 AM
Hi Tim, Thanks for your reply. I am trying the option of installing kudu quickstart VM on Windows Virtualbox. I have installed and setup Ubuntu on VirtualBox on Windows OS. I am using below link to setup Kudu. https://kudu.apache.org/docs/quickstart.html When i run curl command, it doesnot do/return anything. The command is mentioned below. curl -s https://raw.githubusercontent.com/cloudera/kudu-examples/master/demo-vm-setup/bootstrap.sh | bash I think it is unable to establish SSL connection. Is there any workaround for the same? Or am i doing something wrong here? Please suggest
... View more
04-10-2017
11:14 AM
Thanks Tim I checked CDH 5.10 and it seems like i need to install full cluster with cloudera manager. i wanted a quickstart kind of thing which can be up and running soon. Actually i wanted to show my client a demo of kudu with flume and spark within few days.i wanted to keep it simple and also wanted to set up quickly. So in such case shall i go ahead with kudu quickstart demo on ubuntu(windows virtual box) or is there a another shorthand way to have this demo up and running? i was watching vidoes of Ryan Bosshart on safari and he used kudu quickstart on Mac OS. Please suggest?
... View more
04-10-2017
10:10 AM
Hi, Is there a way to install kudu on cloudera CDH 5.8 quickstart VM? Or is there a way to install Kudu quickstart on Windows machine? I am using windows operating system and i want to try integration of flume and spark for Kudu. Could you please tell me how to install kudu quickstart on oracle virtualbox? Thanks
... View more
04-10-2017
04:21 AM
Are there any steps to install kudu on Quickstart cloudera VM and link it to Impala to create tables?
I have installed Kudu using yum on cloudera CDH 5.8 quickstart and able to create tables using Impala. But when I query that table I get below mentioned error as
"kudu features are diabled by startup flag --disable-kudu"
Please suggest?
... View more
Labels:
04-04-2017
02:45 PM
@vperiasamy Yes i typed few characters in database and it showed up. But for tables it is not showing up.
... View more