Member since
09-26-2017
24
Posts
0
Kudos Received
0
Solutions
05-30-2018
04:56 PM
Hello, I have a table (not partitioned), with tons of small files... So, I had run an "alter table concat" and it works, but not in the way i thought... From tons of small files (let say 10 000), i get 1 000 ; i run again and i get 100, to converge to 5 files... I use a lot of SET instructions but I think I used them without understanding their purpose. So, is there a way to concat a table in a only file, in a one shot ? Thanks 🙂 SF
... View more
Labels:
- Labels:
-
Apache Hive
04-04-2018
07:26 AM
Hi @Joy Ndjama, Awesome ! Exactly what I was expecting. Even if it is quite expensive, it is a elegant way to get a true sample. Thanks @Scott Shaw as well, TABLESAMPLE is a very interesting functionnality too.
... View more
03-27-2018
05:47 PM
Hi, I was wondering if is there a way to perform a "local limit" in a Hive query. I explain : Considering a query that "distribute by" a partition "X". This partition contains 30 values and I want to have exactly 100 rows per value... Because, when we perform "limit", generally, this one will break the sink operation at the n-th row, generally only one partition is concerned in that way... And in the aim to build some samples, I think it will be very helpful that reducers (or mappers) can be locally "limited"... I hope it is clear 🙂 Thanks for your replies. SF
... View more
Labels:
- Labels:
-
Apache Hive
01-06-2018
08:39 AM
Hi @Gunther Hagleitner ; thanks it's very clear with your explainations.
... View more
01-03-2018
02:36 PM
Hi, I know that Tez avoids storing intermediates result into HDFS (versus MapReduce that does it) but I was wondering, where are they stored then ? I read : "on memory", "on local disk"... But what if the task which emits intermediates result are not on the same node that the task which will receive it ? So, is it just network I/O instead of HDFS read / write streaming datas from memory and/or local disk ? Thanks for your help 🙂
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Tez
01-01-2018
11:52 AM
thanks @Bala Vignesh N V ; it helps 🙂
... View more
12-27-2017
04:37 PM
Hello, If the concept of MapReduce is pretty clear in my mind, i can't say so much for Tez. MapReduce performs its work through Map > Partition, Sort, Shuffle > Reduce, and I know well each of these phases... But for Tez, and more precisely, between two Vertices (considering a Vertices Map and a Vertices Reduce), how is it ? Is there a built-in "partition, sort, shuffle" like in MR ? Or is it to us to manage this internal logic (i read a word count example, it seems it is, but I prefer to be sure) ? Thanks !
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Tez
12-18-2017
08:46 AM
Thanks @Gunther Hagleitner for your answer. But, how does Hive know that datas of both tables are sorted ? When I join 2 tables and both are not sorted, the explain command tells me that Hive will perform a "Merge Join" too... I know that a SMB join (Sort Merge Bucket) will improve a join, but, this needs bucketed tables...
... View more
12-17-2017
04:43 PM
Thanks a lot @bkosaraju it really helps me 🙂
... View more
12-17-2017
04:42 PM
Hi, I read in many articles about Hive queries optimization the advice that consists in presort tables in order to optimize joins. I read too, in others articles, that the sort algorithm (when shuffling) used is the QuickSort or derived... So, I am a little bit confused, is the Quick Sort fastest when it takes an array composed of 2 sorted arrays ? Thanks.
... View more
Labels:
- Labels:
-
Apache Hive
12-11-2017
06:24 PM
Hi, I would to select a partitioned table (by YEAR, MONTH, DAY), but instead of writing "WHERE YEAR='2017' AND MONTH='12' AND DAY='11'", I would like make a join from this table to a table that contains each field YEAR, MONTH, DAY. SELECT * FROM mypartitionedtable t1 INNER JOIN currentpartitiontable t2 ON t1.YEAR=t2.YEAR etc. etc. But when I am doing an EXPLAIN EXTENDED, I see the analyzer will fetch every partition... Is there something I missed ? Thanks 🙂
... View more
Labels:
- Labels:
-
Apache Hive
11-27-2017
08:35 AM
Hi, For example, I have an Hive query which implies a Map phase and a Reduce phase. Is there a way to get the output file from the Map phase, before it is processed by the Reduce phase ? That will allow me to understand who does what (and then, optimize the query)... Thanks.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Tez
11-09-2017
09:55 AM
Thanks it helps. before OVERWRITE : $ hdfs dfs -ls /apps/hive/warehouse/xyz.db/table_tmp
Found 1 items
718 2017-11-09 10:18 /apps/hive/warehouse/xyz.db/table_tmp/000000_0 during OVERWRITE : $ hdfs dfs -ls /apps/hive/warehouse/xyz.db/table_tmp
Found 2 items
0 2017-11-09 10:35 /apps/hive/warehouse/xyz.db/table_tmp/.hive-staging_hive_2017-11-09_10-35-38_682_2619781700846007196-1
718 2017-11-09 10:18 /apps/hive/warehouse/xyz.db/table_tmp/000000_0
after OVERWRITE : $ hdfs dfs -ls /apps/hive/warehouse/xyz.db/table_tmp
Found 1 items
718 2017-11-09 10:35 /apps/hive/warehouse/xyz.db/table_tmp/000000_0 What I understand is that a query running (involving the file in example), for example, since 10:15 and still executing at 10:35 does not garantee a good execution (but I can presume the file, especially because it is small here, will have already been processed in a first stage of the M/R process). Is that so ? I am wondering if OVERWRITE is a good way to build intermediate table in this case... Without LOCK functionnality enabled, do you suggest a better way ?
... View more
11-07-2017
10:24 AM
Hi, In my organization, Hive is used with the hive.support.concurrency setted to false. I am wondering what are the consequences about inserting datas during a select (and vice versa). At insert, I think the table's metadatas are updated at the very end of the Map/Reduce job. Thus, a select should be not disturbed, because I think files involved by the select are determined at the very beginning of the M/R job... For an insert overwrite, I think this is pretty similar, but I didn't find a confirmation during my research... Could you validate (or not ;)) my thoughts ? Thanks 🙂
... View more
Labels:
- Labels:
-
Apache Hive
11-02-2017
04:04 PM
...and, I always wondered how benchmarks are performed, is it just a timing of an execution on a "clear" plateform ?
... View more
11-02-2017
03:42 PM
I read lot of articles advising about fastest solutions to compute datasets. I saw that Hive / TEZ is 100x faster than Hive / MapReduce, but Spark
is 100x faster than Hive (TEZ or MR not mentionned ;-)), and finally,
"it depends if you compute huge datasets or not". My first question is : from what size can I consider a "huge"
datasets ? I presume the number of rows and columns is significant... My second question is : what if I am querying few partitions from a
large dataset ? I think it comes to querying a small dataset ?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
10-09-2017
07:24 AM
thanks 🙂
... View more
10-04-2017
04:39 PM
Maybe it is obvious but I was wondering : When we declare a dataset, based on the date ($YEAR/$MONTH/$DAY/data for example) as an output-events, and used from an input-events where "instance" will watch at current(0) : Does the dated directory name is directly used to check the input event, or is there a kind of database that register that inside Oozie ? In other words, if we don't mention the output-events and create the "good" directory, will it still working ?
... View more
Labels:
- Labels:
-
Apache Oozie
10-02-2017
01:29 PM
Thanks a lot for your reply. But, through JMX, will I be able to monitor a particular query ? Or just the global activity of the JVM ?
... View more
10-02-2017
09:55 AM
Hi, I would like to know if there is a way to get metrics about cpu and memory usage. For example, I would like to highlight the effect of a skew join on reducers, disk I/O, mapper memory usage during the querying, etc. I saw really interesting slides about performance comparison (Hortonworks), with some graphes and bars, etc. and I was wondering the method to get those values... Thanks.
... View more
Labels:
- Labels:
-
Apache Hive