About sebastien_frack

sebastien_frack · ‎04-04-2018

Hi @Joy Ndjama, Awesome ! Exactly what I was expecting. Even if it is quite expensive, it is a elegant way to get a true sample. Thanks @Scott Shaw as well, TABLESAMPLE is a very interesting functionnality too.

sebastien_frack · ‎03-27-2018

Hi, I was wondering if is there a way to perform a "local limit" in a Hive query. I explain : Considering a query that "distribute by" a partition "X". This partition contains 30 values and I want to have exactly 100 rows per value... Because, when we perform "limit", generally, this one will break the sink operation at the n-th row, generally only one partition is concerned in that way... And in the aim to build some samples, I think it will be very helpful that reducers (or mappers) can be locally "limited"... I hope it is clear 🙂 Thanks for your replies. SF

sebastien_frack · ‎01-06-2018

Hi @Gunther Hagleitner ; thanks it's very clear with your explainations.

sebastien_frack · ‎01-03-2018

Hi, I know that Tez avoids storing intermediates result into HDFS (versus MapReduce that does it) but I was wondering, where are they stored then ? I read : "on memory", "on local disk"... But what if the task which emits intermediates result are not on the same node that the task which will receive it ? So, is it just network I/O instead of HDFS read / write streaming datas from memory and/or local disk ? Thanks for your help 🙂

sebastien_frack · ‎01-01-2018

thanks @Bala Vignesh N V ; it helps 🙂

sebastien_frack · ‎12-27-2017

Hello, If the concept of MapReduce is pretty clear in my mind, i can't say so much for Tez. MapReduce performs its work through Map > Partition, Sort, Shuffle > Reduce, and I know well each of these phases... But for Tez, and more precisely, between two Vertices (considering a Vertices Map and a Vertices Reduce), how is it ? Is there a built-in "partition, sort, shuffle" like in MR ? Or is it to us to manage this internal logic (i read a word count example, it seems it is, but I prefer to be sure) ? Thanks !

sebastien_frack · ‎12-17-2017

Thanks a lot @bkosaraju it really helps me 🙂

sebastien_frack · ‎12-11-2017

Hi, I would to select a partitioned table (by YEAR, MONTH, DAY), but instead of writing "WHERE YEAR='2017' AND MONTH='12' AND DAY='11'", I would like make a join from this table to a table that contains each field YEAR, MONTH, DAY. SELECT * FROM mypartitionedtable t1 INNER JOIN currentpartitiontable t2 ON t1.YEAR=t2.YEAR etc. etc. But when I am doing an EXPLAIN EXTENDED, I see the analyzer will fetch every partition... Is there something I missed ? Thanks 🙂

sebastien_frack · ‎11-09-2017

Thanks it helps. before OVERWRITE : $ hdfs dfs -ls /apps/hive/warehouse/xyz.db/table_tmp Found 1 items 718 2017-11-09 10:18 /apps/hive/warehouse/xyz.db/table_tmp/000000_0 during OVERWRITE : $ hdfs dfs -ls /apps/hive/warehouse/xyz.db/table_tmp Found 2 items 0 2017-11-09 10:35 /apps/hive/warehouse/xyz.db/table_tmp/.hive-staging_hive_2017-11-09_10-35-38_682_2619781700846007196-1 718 2017-11-09 10:18 /apps/hive/warehouse/xyz.db/table_tmp/000000_0 after OVERWRITE : $ hdfs dfs -ls /apps/hive/warehouse/xyz.db/table_tmp Found 1 items 718 2017-11-09 10:35 /apps/hive/warehouse/xyz.db/table_tmp/000000_0 What I understand is that a query running (involving the file in example), for example, since 10:15 and still executing at 10:35 does not garantee a good execution (but I can presume the file, especially because it is small here, will have already been processed in a first stage of the M/R process). Is that so ? I am wondering if OVERWRITE is a good way to build intermediate table in this case... Without LOCK functionnality enabled, do you suggest a better way ?

sebastien_frack · ‎11-07-2017

Hi, In my organization, Hive is used with the hive.support.concurrency setted to false. I am wondering what are the consequences about inserting datas during a select (and vice versa). At insert, I think the table's metadatas are updated at the very end of the Map/Reduce job. Thus, a select should be not disturbed, because I think files involved by the select are determined at the very beginning of the M/R job... For an insert overwrite, I think this is pretty similar, but I didn't find a confirmation during my research... Could you validate (or not ;)) my thoughts ? Thanks 🙂

Online	Offline
Last Visited	‎05-30-2018 04:56 PM

Member Since	‎09-26-2017 02:10 PM
Last Visited	‎05-30-2018 04:56 PM
Posts	24

Cloudera Community

Re: [HIVE] is there a way to perform a "local limi...

[HIVE] is there a way to perform a "local limit" i...

Re: [TEZ] where are stored intermediates result ?

[TEZ] where are stored intermediates result ?

Re: [TEZ] are partition, sort and shuffle built-in...

[TEZ] are partition, sort and shuffle built-in ?

Re: [HIVE] select a partitioned table and specify ...

[HIVE] select a partitioned table and specify part...

Re: what is the behaviour of select during an inse...

what is the behaviour of select during an insert i...