Member since
09-23-2015
800
Posts
897
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2085 | 08-12-2016 01:02 PM | |
1233 | 08-08-2016 10:00 AM | |
1212 | 08-03-2016 04:44 PM | |
2542 | 08-03-2016 02:53 PM | |
707 | 08-01-2016 02:38 PM |
05-05-2022
04:36 PM
As a general statement this is not right by any means. LDAP provides secure and encrypted authentication (encrypted user password and SSL/TLS communication) , together with user/group management. It's only the Hadoop stack does not support this and the two only autentication methods implemented for all the CDP components are the dummy simple auth (described above) and the Kerberos authentication (used in combination with PAM or LDAP for user/group mappings). As an example, nothing less than Knox (the security gateway to HDP or CDP) implements full authenticacion using only LDAP (with TLS), and it only relies on Kerberos to authenticate a single service/proxy user to communicate with the rest of the cluster.
... View more
04-28-2021
05:00 PM
Frustrating that that link is hidden behind a 'paywall'. I have an account but I am not allowed to view without contacting sales
... View more
09-29-2020
05:27 AM
Facing issues with region availability and it seems to be due to compactions. We are getting below exception when we try to access region org.apache.hadoop.hbase.NotServingRegionException: Region is not online But when we checked corresponding region server logs we can see lot of compactions happening on the table. Does table becomes unaccessible during compaction? Is there a way to reduce number of compactions through some setting?
... View more
08-30-2020
12:39 AM
Pic 1 - Running containers is max'd at 50. Pic 2- Free resources Pic 3 - Tez Application in Default queue I was able to get 4 vCore per container. The number of containers for Tez Application doesn't go beyond 50, though I have free vCores and Memory (Pic 2)
... View more
04-28-2020
07:28 AM
Please check below command, here 2> /dev/null will consume all the logs and error. It will now allow standard output to be shown: beeline -u jdbc:hive2://somehost_ip/ -f 2> /dev/null hive.hql >op.txt if you like this please give me kudos. Thanks!!!
... View more
11-11-2019
12:00 PM
What is the menu option on Ambari where I can check this infos?
... View more
10-29-2019
02:04 PM
hive -e 'select col1,col2 from schema.your_table_name' --hiveconf tez.queue.name=YOUR_QUEUE_NAME > /yourdir/subdir/my_sample_output.csv
... View more
10-04-2019
02:46 PM
Hello, I'm looking your answer 3 years later because I'm in a similar situation :). In my company (telco) we're planning using 2 hot clusters with dual ingest because our RTO is demanding and we're looking for mechanisms to monitor and keep in sync both clusters. We ingest data in real-time with kafka + spark streaming, loading to HDFS and consuming with Hive/Impala. I'm thinking about a first approach making simple counts with Hive/Impala tables on both clusters each hour/half hour and comparing. If something is missing in one of the clusters, we will have to "manually" re-ingest the missing data (or copy it with cloudera BDR from one cluster to the other) and re-process enriched data. I'm wondering if have you dealt with similar scenarios or suggestions you may have. Thanks in advance!
... View more
08-01-2016
06:23 PM
I should have read the post a little closer I thought you were doing a groupByKey. You are correct, you need to use groupBy to keep the execution within the dataframe and out of Python. However, you said you are doing an outer join. If it is a left join and the right side is larger than the left, then do an inner join first. Then do your left join on the result. Your result most likely will be broadcasted to do the left join. This is a pattern that Holden described at Strata this year in one of her sessions.
... View more
11-04-2017
12:19 PM
Hi @Jeff Watson. You are correct about SAS use of String datatypes. Good catch! One of my customers also had to deal with this. String datatype conversions can perform very poorly in SAS. With SAS/ACCESS to Hadoop you can set the libname option DBMAX_TEXT (added with SAS 9.4m1 release) to globally restrict the character length of all columns read into SAS. However for restricting column size SAS does specifically recommends using the VARCHAR datatype in Hive whenever possible. http://support.sas.com/documentation/cdl/en/acreldb/67473/HTML/default/viewer.htm#n1aqglg4ftdj04n1eyvh2l3367ql.htm Use Case
Large Table, All Columns of Type String: Table A stored in Hive has 40 columns, all of type String, with 500M rows. By default, SAS Access converts String to $32K. So, 32K in length for char. The math for this size table yields 1.2MB row length x 500M rows. This causes the system to come to a halt - Too large to store in LASR or WORK. The following techniques can be used to work around the challenge in SAS, and they all work:
Use char and varchar in Hive instead of String. Set the libname option DBMAX_TEXT to globally restrict the character length of all columns read in In Hive do "SET TBLPROPERTIES SASFMT" to add formats for SAS on schema in HIVE. Add formatting to SAS code during inbound reads
example: Sequence Length 8 Informat 10. format 10. I hope this helps.
... View more
07-23-2016
07:57 PM
I've seen it in as far back as 2.1 that's why I was surprised it was missing in my install.
... View more
07-22-2016
10:10 AM
4 Kudos
It is a Tez application. They stay around for a while to wait for new dags ( execution graph) otherwise you need to create a new session for every query which adds around 20s to your query time. Its configured here (normally a couple minutes) tez.session.am.dag.submit.timeout.secs
... View more
07-26-2016
12:58 PM
Very neatly explained.!
... View more
11-17-2017
05:54 PM
I had the same problem and I also solve it adding the user to all controler nodes. Run a script for add them from on_to_all ______________________________________________________________________________________________ #!/bin/bash # Linux/UNIX box with ssh key based login SERVERS=/root/hadoop_hosts # SSH User name USR="root" # Email SUBJECT="Server user login report" EMAIL="your_e-mail@here" EMAILMESSAGE="/tmp/sshpool_`date +%Y%m%d-%H:%M`.txt" # create new file >$EMAILMESSAGE # connect each host and pull up user listing for host in `cat $SERVERS` do echo "--------------------------------" >>$EMAILMESSAGE echo "* HOST: $host " >>$EMAILMESSAGE echo "--------------------------------" >>$EMAILMESSAGE ###ssh $USR@$host w >> $EMAILMESSAGE ssh -tq -o "BatchMode yes" $USR@$host $1 >> $EMAILMESSAGE done # send an email using /bin/mail ######/bin/mailx -s "$SUBJECT" "$EMAIL" < $EMAILMESSAGE echo ">>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<" echo ">>> check the output file " $EMAILMESSAGE _________________________________________________________________________________________ put DNS servers names into /root/hadoop_hosts Also in linux there is a good command called pssh to run comands in parallel in computer clusters 🙂
... View more
07-27-2016
11:16 AM
The cluster is fairly small as its mostly experimental but I have 3 out of the 4 nodes in the cluster that each have 4 vCores and 1GB of memory, with a global YARN minimum memory container size of 256MB. So when you say slots I'm assuming that would translate into 12 slots/containers potentially? i.e. a container representing 1vCore + 256MB. I had assumed that for the resource (CPU/RAM) available in my cluster that the query I'm running on the dataset sizes I'm working with i..e 30-40k records would be more than enough?
... View more
07-11-2016
04:28 PM
1 Kudo
I think the majority of people do not use ssh fencing at all. The reason for this is that Namenode HA works fine without it. The only issue can be that during a network partitioning old connections to the old standby might still exist and get stale old date during read-only operations. - They cannot do any write transactions since the Journalnode majority prohibits that - Normally if zkfc works correctly an active namenode will not go into zombie mode, he is dead or not. So the chances of a split brain are low and the impact is pretty limited. If you use ssh fencing the important part is that your script cannot block other wise the failover will be stopped, you need to have all scripts return in a sensible amount of time even if the access is not possible. Fencing by definition is always an attempt. Since most of the time the node is simply down. And they need to return success in the end. So you need a fork with a timeout and then return true. https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#Verifying_automatic_failover
... View more
07-06-2016
12:03 PM
You mean to exclude two columns? That one would definitely work: (id1|id2)?+.+ Your version would say id1 once or not at all followed by id2 once or not at all followed by anything else. So should work too I think.
... View more
06-19-2016
03:02 PM
normally mapper dont fail with OOM and 8192M is pretty good, I suspect that if you have some big records while reading from csv, are you doing some memory intensive operation inside mapper. could you please share the task log for this attempt attempt_1466342436828_0001_m_000008_2
... View more
06-16-2016
05:16 PM
That is amazing!
... View more
06-14-2016
09:20 AM
Good you fixed it. I would just read a good hadoop book and understand the MapCombinerShuffleReduce process in detail. After that the majority of markers should be pretty self evident. https://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/184-6666119-1311365?ie=UTF8&*Version*=1&*entries*=0
... View more
06-09-2016
06:45 PM
Rajkumar, Have you tried connecting directly with the hive jdbc driver? I'm suspecting it's a jar conflict somewhere. Here's my hive driver config in IntelliJ, obviously took the shotgun approach and added all client jar but the main required are hive-common, hive-jdbc.
... View more
05-24-2016
04:28 PM
yes you would need to configure user sync with ldap/ad in the ranger ui. Alternatively use UNIX user sync in Ranger to sync with the local operating system. ( Works as well )
... View more
06-09-2016
10:34 AM
Can you give me top 50 min, max and the average. Also did you try the query ? What was the behaviour ? The reason I am asking that if your query is very long using a few number of reducer for example it may imply the skew and so to maximize usage of the cluster one way is too look at surrogate key creation.
... View more
05-16-2016
11:14 AM
1) yes you can see the "Tez session was closed ... 2) In anything after HDP2 tez is enabled by default. MapReduce might be going away as an option anyway 3) You can still use set execution engine in queries set hive.execution.engine=mr or tez 4) Not sure what you mean with utiliy. The Tez view in ambari would provide the functionality I am not completely sure about the out of the box integration with resource manager https://www.youtube.com/watch?v=xyqct59LxLY
... View more
05-10-2016
05:11 PM
2 Kudos
Here is a great writeup on file compression in Hadoop - http://comphadoop.weebly.com/
... View more
05-10-2016
01:42 PM
Hi Ed, It would be useful to know if you are aiming for HA or performance. Since it is a small cluster you may use it as a POC and not care much about HA, I don't know. One option not mentioned below is going with 3 masters and 3 slaves in a small HA cluster setup. That allows you to balance services on the masters more and/or dedicate one to be mostly an edge node. If security is a topic that may come in handy. Cheers, Christian
... View more
05-07-2016
07:34 PM
1 Kudo
1) You essentially have two options. Use Sqoop import-all-tables with exclude as you mention. However in that case you have a single sqoop action in oozie and no parallelity in oozie. However sqoop might provide that. You have some limitations though ( only straight imports all columns , ... ) Alternatively you make an oozie flow that uses a fork and then one single table sqoop action per table. In that case you have fine grained control over how much you want to run in parallel. ( You could for example load 4 at a time by doing Start -> Fork -> 4 Sqoop Actions -> Join -> Fork -> 4 Sqoop Actions -> Join -> End 2) If you want incremental load I don't think the Sqoop import-all-tables is possible. So one Sqoop action per table it is. Essentially you can either use Sqoop incremental import functionality ( using a property file ) or use WHERE conditions and give through the date parameter from the coordinator. You can use coord:dateformat to transform your execution date. 3) Run One coord for each table OR have a Decision action in the oozie workflow that skips some sqoop actions Like Start -> Sqoop1 where date = mydate -> Decision if mydate % 3 = 0 then Sqoop2 else end. 4) incremental imports load the new data into a folder in HDFS. If you run it the folder needs to be deleted. If you use append it doesn't delete the old data in HDFS. Now you may ask why would I ever not want append and the reason is that you mostly do something with the data after like importing the new data to a hive partitioned table. If you would use append he would load the same data over and over.
... View more
04-28-2016
04:57 PM
3 Kudos
You would have to make sure that the mapreduce.framework.name is set correctly ( yarn I suppose ) and the mapred files are there but first please verify that your nameNode parameter is set correctly. HDFS is very exact about it and requires the hdfs:// in front. So hdfs://nameonode:8020 instead of namenode:8020
... View more
12-08-2017
03:45 AM
@Joseph Niemiec How can I do this command " select * from table where date <= '2017-12-08' " in nest partitions form? In case the table is partitioned by day,month,year
... View more