Member since
02-24-2016
175
Posts
56
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
654 | 06-16-2017 10:40 AM | |
4705 | 05-27-2016 04:06 PM | |
807 | 03-17-2016 01:29 PM |
08-14-2019
10:39 AM
Hi, We have a 30 nodes production cluster. We want to add 5 data nodes for additional storage to handle the interim spike of data( around 2 TB). This data is to be stored temporarily and we want to get rid of it after 15 days. Is it possible to make sure that the interim data (2 TB) coming in will be stored only on the newly added data nodes? I am looking for something similar to YARN node labelling. Regards, SS
... View more
Labels:
- Labels:
-
Apache Hadoop
07-09-2018
09:00 PM
Nope @Josh Nicholson
... View more
03-15-2018
04:46 PM
2 Kudos
Hi, I was going through the smart sense recommendation, which suggests to enable "tez.task.scale.memory.enabled". I searched through Tez official documentation which says Whether to scale down memory requested by each component if the total exceeds the available JVM memory I am keen on understanding if we enable auto-scaling of memory for tasks, what are the disadvantages and possible advantages. Thanks for sharing your experience. Regards,
... View more
Labels:
10-26-2017
10:41 AM
2 Kudos
Hi guys, I have installed and built anaconda virtual environment on a node outside of the HDP cluster. To use this with Spark, we need to have this on the HDP cluster. Around this I have a couple of questions. 1) Do we need to install Anaconda on all the nodes? We will possibly try to avoid this as we do not have internet access from cluster and Anaconda installation will require download of the libraries while installation. Did not find repos officially supported for linux installations. 2) If we need to distribute the environment by copying in to all the the nodes before starting any spark applications, then while submitting the spark job from edge node, how do we make sure Spark job uses the Anaconda virtual environment? ( On the single node it is easy as we can switch the Anaconda environments) Thanks, SS
... View more
Labels:
08-22-2017
08:55 AM
Hi, I understand that mechanism of Hive(with Tez) and Hive (with MR) is different from traditional RDBMS databases. We have a set of analysts who perform : "select * from view limit n" kind of queries many times. Since all analysts/BI users come from traditional RBDMS background, users do compare the waiting time for RDBMS and for Hive Query to return results. For example : select top 10 * from db.view ; using SQL server and on a much larger dataset, takes the following times to complete: run 1: 0 seconds run 2: 0 seconds run 3: 0 seconds ........... When running the same query through Hive over Knox (or even with beeline), it takes much higher time. SELECT * FROM db.view limit 10 Takes the following times to complete via hive over Knox or via Ambari View or with beeline. run 1: 36 seconds run 2: 18 seconds run 3: 38 seconds ....... This is one example of a db/table combination, but this is a common scenario for mostly all the tables in a few databases. I tried analyze and compute statistics on underlying tables on which these queries are run, but query times did not change. I understand that, we are not comparing apple to apple here, but this question is more to do with improvement of end user experience, and how best can we help to avoid long wait times? (This is on HDP 2.6.x) Regards, SS
... View more
Labels:
- Labels:
-
Apache Hive
07-06-2017
09:07 AM
Thank you @Manish Gupta, This is something we can try by having a proxy configured to have load balancing for HS2. Also would like to understand are they changes to be made for Zookeeper ? Is there any other way by which, without using non HDP component or external network changes, we can achieve load balancing of HS2. Regards,
... View more
07-05-2017
03:47 PM
1 Kudo
Hi, Using https://knox.apache.org/books/knox-0-9-0/user-guide.html, I have configured Knox topology for Hive Server2 High Availability. I also noticed Dynamic Service Discovery Through ZooKeeper in documentation. I see that all the queries/connections happen though only one of the HiveServer2, now if this HS2 instances down, I notice that connections/queries happen through another instance of HS2. My question is : In the busy cluster, when we have multiple HS2 servers installed, is it possible to load balance (possibly round robin) so that one server does not get overloaded? If yes, how? Regards, SS
... View more
Labels:
- Labels:
-
Apache Hive
07-03-2017
02:50 PM
Hi @vshukla, the use case is to let end users use the Queues they generally use to run their interactive queries over other tools like beeline, other jdbc clients.
... View more
06-30-2017
10:32 AM
@Sandeep NemuriThank you for the confirmation. When do we expect this fix?
... View more
06-27-2017
11:35 AM
Using auth=HTTPKerberosAuth()
will pass your Kerberos ticket in my understanding. It is similar to --negotiate, in curl.
... View more
06-27-2017
10:25 AM
HI @Javert Kirilov, I was facing this issue when trying accessing livy with Python scripts. Please try something like this , if curl is blocking you. You may need to install python's requests package. import json, pprint, requests, textwrap
from requests_kerberos import HTTPKerberosAuth
host='http://LIVY_HOST:LIVY_PORT'
data = {'kind': 'spark'}
headers = {'Requested-By': 'MY_USER_ID','Content-Type': 'application/json'}
auth=HTTPKerberosAuth()
r0 = requests.post(host + '/sessions', data=json.dumps(data), headers=headers,auth=auth)
r0.json()
Regards, SS
... View more
06-27-2017
09:08 AM
This works for us. thanks.
... View more
06-26-2017
03:42 PM
Hi, On HDP 2.6, I have configured Spark Thrift Server for Spark 1.6.x based on community wiki with which I see the queries are executed as the end user, User when connects to Spark Thrift Server for Spark-1, using beeline, I see a new instance of the process getting listed under Resource Manager with end user ( YARN Process runs as end user) Now I am trying to configure Spark2 Thrift Server, following the official documentation .
Added hive.server2.enable.doAs=true Added spark.jars to classpath (dataNucleus jars) Set the spark.master to local Restarted Spark2 Thrift Server In my understanding Now,
Queries should run as "end user" (Queries are still running as hive user) Spark2 Thrift Server when connected with spark.master=local, should be listed under Resource Manager UI with "end user" ( I do not see it listed now) When all the JDBC connections to STS are closed, STS application should disappear. As the STS is started in local mode and for each user/connection if not shred, queries are executed by the Spark Application Master launched on behalf of the end user. Above all are not respected in Spark2 Thrift Server ( But with impersonation support in Spark-1 Thrift Server, above all three are working as expected). Attaching the screenshots for anomalies. I am not sure if I missed something here. 1 Queries still run as Hive. 2. STS is not listed under Resource Manager 3. Spark2 Thrift Server Still Runs As Hive User Thanks in advance. Regards, SS Any inputs? @cdraper, @amcbarnett, @Ana Gillan ?
... View more
Labels:
- Labels:
-
Apache Spark
06-26-2017
11:33 AM
Hi, I have configured Spark Thrift Server with impersonation. Now the queries run as end user. The "spark.yarn.queue" under "Advanced spark2-thrift-sparkconf" is configured to point to a certain queue ( Say Test_Q). In my understanding with impersonation, Spark spawns a new container and an STS instance to serve. Now I see for all the users a single queue is used ( Test_Q). We would like that end users have ability to override the queue at run time and not use the same queue for spark thrift server. If I am correct, we are looking for a property which can override value of spark.yarn.queue under spark-thrift-sparkconf.conf something like --queue thequeue for spark-submit
Thanks,
... View more
Labels:
- Labels:
-
Apache Spark
06-22-2017
06:11 AM
Hi @Bala Vignesh N V, I have similar issue. have done the above settings, but this does not help. I have posted a question on HCC : https://community.hortonworks.com/questions/109365/controlling-number-of-small-files-while-inserting.html.
... View more
06-22-2017
05:51 AM
Hi, We do "insert into 'target_table' select a,b,c from x where .." kind of queries for a nightly load. This insert goes in a new partition of the target_table. Now the concern is : this inserts load hardly any data ( I would say less than 128 MB per day) but 1200 files. Each file in few KiloBytes. This is slowing down the performance. How can we make sure, this load does not generate lot of small files? I have already set : hive.merge.mapfiles and hive.merge.mapredfiles to true in custom/advanced hive-site.xml. But still the load job loads data with 1200 small files. I know why 1200 is, this is the value of maximum number of reducers/containers available in one of the hive-sites. (I do not think its a good idea to do cluster wide setting, as this can affect other jobs which can use cluster when it has free containers) What could be other way/settings, so that the hive insert do not take 1200 slots and generate lots of small files? I also have another question which is partly contrary to above : (This is relatively less important) When I reload this table by creating another table by doing select on target table, this newly created table does not contain too many small files. What could be the reason?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Tez
06-16-2017
10:40 AM
Well, this worked "As is" in North Virginia region. ! Earlier I was using a different region.
... View more
06-16-2017
08:45 AM
Hi @anatva :
It does spawn a Tez Job : > select * from DATABASENAME.TABLE_NAME limit 10;
INFO : Session is already open
INFO : Dag name: select * from DATABASENAME.TAB...10(Stage-1)
INFO : Status: Running (Executing on YARN cluster with App id application_1496614688621_2617)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 33 33 0 0 0 0
Reducer 2 ...... SUCCEEDED 227 227 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 17.63 s
--------------------------------------------------------------------------------
Here are the answers.
1. are you using beeline ? If yes, hiveserver2 could be busy at times:
Yes and other tools which connect using JDBC using Knox.
2. Are you querying a table or view:
View
3. what is the amount of memory on edge node ?
256 GB
... View more
06-16-2017
08:35 AM
Adding experts 🙂 @Jonas Straub, @cdraper
... View more
06-15-2017
11:09 AM
Hi all, When I run this query I get totally different time taken when I run on the same cluster. (This is Hive on Tez) select * from database.tabeName limit 10; The time taken to run this ranges from 2 seconds to 10 minutes!! If sometimes it take 2 mins and next time it takes 2mins and 20 seconds it is still ok. But from 2 seconds to 10 mins. What could be the possible reasons? How can we make sure that they take similar time ? Regards, SS
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Tez
-
Apache YARN
06-14-2017
01:38 PM
Hi, Here are some stats before the question: From Ambari -> YARN -> Stats: Number of available containers : 1000+ Number of YARN applications running : 22 ( out of which 18 are Hive-Tez based) Number of allocated containers : 780 Number of containers pending for allocation : 12000+ My question is : When we have queue burst enabled up to 100% of cluster capacity, Why will Resource Manager not allocate containers to the pending/running job? What could be the reason that we see huge number of pending allocation ? Is it that Tez calcuates the number of containers required for completion of the job and adds that to the pending container allocation and it gradually gets allocated when required ( like reduce stage's 3000 containers are blocked on map)? Can any one enlighten please?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Tez
-
Apache YARN
06-13-2017
06:39 AM
Thanks Carl @cdraper . I enabled EhCache and enabled logging for EHCache. In our use case : We use Knox only for Hive. And as we discussed, during the session Hive Queries do not go through re-authentication, hence, I do not think we will get any benefit by enabling cache. What's your views on this? Regards, SS
... View more
06-12-2017
01:04 PM
Thanks @pdarvasi
... View more
06-12-2017
01:02 PM
Thanks Carl. @cdraper. I will try this. An additional question : When the end user comes and issues a connect command from jdbc client like beeline For example : 1) beeline 2) now enter !connect string with Knox:port 3) Enter AD UserName/Password 4) AD authenticates OK 5) user submits queries. 6) more queries. In default case, without enabling EHCache, does further A/D authentication happens for step 5..6.. and onwards ? Or since its part of the same session it doesn't need to re-authenticate? I am wondering how much % A/D round trips can be avoided on a busy production cluster with cachetime out of 2 mins. Thanks,
... View more
06-12-2017
10:51 AM
Tagging SME @Kevin Minder
... View more
06-12-2017
09:00 AM
Hi, I was going through the HDP documentation which talks about enabling caching for Knox LDAP authentication : https://docs.hortonworks.com/HDPDocuments/HDF2/HDF-2.1.0/bk_dataflow-security/content/ldap_authentication_caching.html What is the default cache expiry time? And How can I reduce/increase the cache expiry time? Regards, SS
... View more
Labels:
- Labels:
-
Apache Knox
06-06-2017
08:36 AM
Hi, In our production with impersonation disabled, we have at least 5000 queries run on a daily basic. And I suspect a few queries which are part of batch jobs (1000+) are eating up a lot of cluster resources, possibly because it is written poorly. How do I find out those queries which are possible 'resource hungry'? Regards, SS
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache YARN
-
Cloudera Manager
04-26-2017
08:32 PM
Hi @William Gonzalez, I cleared the exam on 26/March/2017,I have not had received any communication from Hortonworks about the badge? After that I wrote and cleared HDPCA on 23/April, for HDPCA I got the digital badge but not for HCA. Wrote 4 emails to certification at hortonwork dot com. Got the ticket numbers from zendesk! But unfortunately I have failed to receive any communication! Kindly help. Best regards.
... View more
04-19-2017
09:02 PM
Hi Gurus, Following the Hortonworks documentation : https://2xbbhjxc6wk3v21p62t8n4d4-wpengine.netdna-ssl.com/wp-content/uploads/2015/04/HDPCA-PracticeExamGuide.pdf. I selected HDCPA IMI., C3.4x instnace type, Created a security group with incoming traffic from all the addresses on port 5901, 9999, and 8888 (last two are not in documentation but wanted to make sure my instance runs ). Ok Now, as per the instruction I am trying to connect to the instance using VNC viewer. I copy paste the DNS name/IP from the intance's public DNS/IP columns. And use DNSName:5091 or DNSName:9999 or IP:5901 etc. Open Ports for Incoming Traffic 8888 0.0.0.0/0, ::/0 tcp 9999 0.0.0.0/0, ::/0 tcp 22 0.0.0.0/0, ::/0 tcp 5901 0.0.0.0/0, ::/0 tcp It does not work. Every time I see Cannot establish connection. Are you sure you have entered the correct network address, and port number if necessary?
... View more
- Tags:
- Hadoop Core
- hdpca
Labels:
- Labels:
-
Security