Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5199 | 08-12-2016 01:02 PM | |
2153 | 08-08-2016 10:00 AM | |
2531 | 08-03-2016 04:44 PM | |
5349 | 08-03-2016 02:53 PM | |
1373 | 08-01-2016 02:38 PM |
10-29-2019
02:04 PM
hive -e 'select col1,col2 from schema.your_table_name' --hiveconf tez.queue.name=YOUR_QUEUE_NAME > /yourdir/subdir/my_sample_output.csv
... View more
10-04-2019
02:46 PM
Hello, I'm looking your answer 3 years later because I'm in a similar situation :). In my company (telco) we're planning using 2 hot clusters with dual ingest because our RTO is demanding and we're looking for mechanisms to monitor and keep in sync both clusters. We ingest data in real-time with kafka + spark streaming, loading to HDFS and consuming with Hive/Impala. I'm thinking about a first approach making simple counts with Hive/Impala tables on both clusters each hour/half hour and comparing. If something is missing in one of the clusters, we will have to "manually" re-ingest the missing data (or copy it with cloudera BDR from one cluster to the other) and re-process enriched data. I'm wondering if have you dealt with similar scenarios or suggestions you may have. Thanks in advance!
... View more
08-01-2016
06:23 PM
I should have read the post a little closer I thought you were doing a groupByKey. You are correct, you need to use groupBy to keep the execution within the dataframe and out of Python. However, you said you are doing an outer join. If it is a left join and the right side is larger than the left, then do an inner join first. Then do your left join on the result. Your result most likely will be broadcasted to do the left join. This is a pattern that Holden described at Strata this year in one of her sessions.
... View more
11-04-2017
12:19 PM
Hi @Jeff Watson. You are correct about SAS use of String datatypes. Good catch! One of my customers also had to deal with this. String datatype conversions can perform very poorly in SAS. With SAS/ACCESS to Hadoop you can set the libname option DBMAX_TEXT (added with SAS 9.4m1 release) to globally restrict the character length of all columns read into SAS. However for restricting column size SAS does specifically recommends using the VARCHAR datatype in Hive whenever possible. http://support.sas.com/documentation/cdl/en/acreldb/67473/HTML/default/viewer.htm#n1aqglg4ftdj04n1eyvh2l3367ql.htm Use Case
Large Table, All Columns of Type String: Table A stored in Hive has 40 columns, all of type String, with 500M rows. By default, SAS Access converts String to $32K. So, 32K in length for char. The math for this size table yields 1.2MB row length x 500M rows. This causes the system to come to a halt - Too large to store in LASR or WORK. The following techniques can be used to work around the challenge in SAS, and they all work:
Use char and varchar in Hive instead of String. Set the libname option DBMAX_TEXT to globally restrict the character length of all columns read in In Hive do "SET TBLPROPERTIES SASFMT" to add formats for SAS on schema in HIVE. Add formatting to SAS code during inbound reads
example: Sequence Length 8 Informat 10. format 10. I hope this helps.
... View more
08-12-2016
01:02 PM
1 Kudo
"If it runs in the Appmaster, what exactly are "the computed input splits" that jobclient stores into HDFS while submitting the Job ??" InputSplits are simply the work assignments of a mapper. I.e. you have the inputfolder /in/file1
/in/file2 And assume file1 has 200MB and file2 100MB ( default block size 128MB ) So the InputFormat per default will generate 3 input splits ( on the appmaster its a function of InputFormat) InputSplit1: /in/file1:0:128000000
InputSplit2: /in/file1:128000001:200000000
InputSplit3:/in/file2:0:100000000 ( per default one split = 1 block but he COULD do whatever he wants. He does this for example for small files where he uses MultiFileInputSplits which span multiple files ) "And how map works if the split spans over data blocks in two different data nodes??" So the mapper comes up ( normally locally to the block ) and starts reading the file with the offset provided. HDFS by definition is global and if you read non local parts of a file he will read it over the network but local is obviously more efficient. But he COULD read anything. The HDFS API makes it transparent. So NORMALLY the InputSplit generation will be done in a way that this does not happen. So data can be read locally but its not a necessary precondition. Often maps are non local ( you can see that in the resource manager ) and then he can simply read the data over the network. The API call is identical. Reading an HDFS file in Java is the same as reading a local file. Its just an extension to the Java FileSystem API.
... View more
07-23-2016
07:57 PM
I've seen it in as far back as 2.1 that's why I was surprised it was missing in my install.
... View more
07-22-2016
10:10 AM
4 Kudos
It is a Tez application. They stay around for a while to wait for new dags ( execution graph) otherwise you need to create a new session for every query which adds around 20s to your query time. Its configured here (normally a couple minutes) tez.session.am.dag.submit.timeout.secs
... View more
07-26-2016
12:58 PM
Very neatly explained.!
... View more
11-17-2017
05:54 PM
I had the same problem and I also solve it adding the user to all controler nodes. Run a script for add them from on_to_all ______________________________________________________________________________________________ #!/bin/bash # Linux/UNIX box with ssh key based login SERVERS=/root/hadoop_hosts # SSH User name USR="root" # Email SUBJECT="Server user login report" EMAIL="your_e-mail@here" EMAILMESSAGE="/tmp/sshpool_`date +%Y%m%d-%H:%M`.txt" # create new file >$EMAILMESSAGE # connect each host and pull up user listing for host in `cat $SERVERS` do echo "--------------------------------" >>$EMAILMESSAGE echo "* HOST: $host " >>$EMAILMESSAGE echo "--------------------------------" >>$EMAILMESSAGE ###ssh $USR@$host w >> $EMAILMESSAGE ssh -tq -o "BatchMode yes" $USR@$host $1 >> $EMAILMESSAGE done # send an email using /bin/mail ######/bin/mailx -s "$SUBJECT" "$EMAIL" < $EMAILMESSAGE echo ">>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<" echo ">>> check the output file " $EMAILMESSAGE _________________________________________________________________________________________ put DNS servers names into /root/hadoop_hosts Also in linux there is a good command called pssh to run comands in parallel in computer clusters 🙂
... View more
07-13-2016
11:07 AM
1 Kudo
My first question would be why do you want to do that. If you want to manage your cluster you would normally install something like pssh or ansible or puppet and use that to manage the cluster. You can put that on one control node define a list of servers and move data/execute scripts on all of them at the same time. You can do something very simple like that with a one line ssh program To execute a script on all nodes: for i in server1 server2;do echo $i; ssh $i $1;done To copy files to all nodes: for i in server1 server2;do scp $1 $i:$2;done [all need keyless ssh from the control node to the cluster nodes] If on the other hand you want to execute job dependencies, something like the distributed mapreduce cache is normally a good idea. Oozie provides the <file> tag to upload files from hdfs to the execution directory of the job. So honestly if you go into more details what you ACTUALLY want we might be able to help more.
... View more