Member since
12-30-2015
68
Posts
16
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2091 | 06-14-2016 08:56 AM | |
1815 | 04-25-2016 11:59 PM | |
2283 | 03-25-2016 06:50 PM |
01-17-2017
08:13 PM
Could not paste both the explain plans in previous comment. Here is the explain plan by disabling hive.explain.user=false. hive> set hive.explain.user=false;
hive> explain select a.* from big_part a, small_np b where a.jdate = b.jdate;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
DagId: A515595_20170117140547_4494cba3-581e-441c-8fb6-8175b74d89c2:3
Edges:
Map 1 <- Map 2 (BROADCAST_EDGE)
DagName:
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: a
filterExpr: jdate is not null (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL
Filter Operator
predicate: jdate is not null (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 jdate (type: date)
1 jdate (type: date)
outputColumnNames: _col0, _col1, _col6
input vertices:
1 Map 2
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
HybridGraceHashJoin: true
Filter Operator
predicate: (_col1 = _col6) (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: date)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Execution mode: vectorized
Map 2
Map Operator Tree:
TableScan
alias: b
filterExpr: jdate is not null (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Filter Operator
predicate: jdate is not null (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Reduce Output Operator
key expressions: jdate (type: date)
sort order: +
Map-reduce partition columns: jdate (type: date)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.428 seconds, Fetched: 68 row(s)
... View more
01-17-2017
08:10 PM
Thanks for your comments! Here is the explain plan and create table statements. Hive version is 0.14. And also for the 3rd answer, in case both are partitioned tables, is there anyway to ensure that bigger of the two tables undergo partition pruning instead of the small one? (or is this the default behavior?) What does hive.explain.user = false do? I have attached explain plan by both enabling and disabling this. > show create table big_part;
OK
CREATE TABLE `big_part`(
`id` int)
PARTITIONED BY (
`jdate` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://littleredns/apps/hive/warehouse/big_part'
TBLPROPERTIES (
'transient_lastDdlTime'='1484615054')
Time taken: 1.749 seconds, Fetched: 14 row(s)
hive> show create table small_np;
OK
CREATE TABLE `small_np`(
`id2` int,
`jdate` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://littleredns/apps/hive/warehouse/small_np'
TBLPROPERTIES (
'transient_lastDdlTime'='1484615162')
Time taken: 0.16 seconds, Fetched: 13 row(s)
hive> set hive.optimize.ppd=true;
hive> set hive.tez.dynamic.partition.pruning=true;
hive> explain select a.* from big_part a, small_np b where a.jdate = b.jdate;
OK
Plan not optimized by CBO.
Vertex dependency in root stage
Map 1 <- Map 2 (BROADCAST_EDGE)
Stage-0
Fetch Operator
limit:-1
Stage-1
Map 1 vectorized
File Output Operator [FS_21]
compressed:false
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Select Operator [OP_20]
outputColumnNames:["_col0","_col1"]
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Filter Operator [FIL_19]
predicate:(_col1 = _col6) (type: boolean)
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Map Join Operator [MAPJOIN_18]
| condition map:[{"":"Inner Join 0 to 1"}]
| HybridGraceHashJoin:true
| keys:{"Map 2":"jdate (type: date)","Map 1":"jdate (type: date)"}
| outputColumnNames:["_col0","_col1","_col6"]
| Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
|<-Map 2 [BROADCAST_EDGE]
| Reduce Output Operator [RS_4]
| key expressions:jdate (type: date)
| Map-reduce partition columns:jdate (type: date)
| sort order:+
| Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
| Filter Operator [FIL_14]
| predicate:jdate is not null (type: boolean)
| Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
| TableScan [TS_1]
| alias:b
| Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
|<-Filter Operator [FIL_17]
predicate:jdate is not null (type: boolean)
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL
TableScan [TS_0]
alias:a
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL
Time taken: 1.459 seconds, Fetched: 45 row(s)
... View more
01-17-2017
03:40 AM
Hi, Could someone please explain me understand the below questions on hive partition pruning and explain plan? 1. How to check if partition pruning occurs by checking the explain plan? I thought I would see "Dynamic partitioning event operator" in explain plan, but in my sample query below I am not seeing any such operator. I enabled hive.tez.dynamic.partition.pruning. -- Since the table does not have much data, it is going for map join, does that have anything to do with partition pruning not happening? explain select a.* from big_part a, small_np b where a.jdate = b.jdate ;
big_part is partitioned on jdate where small_np is a non partitioned table, even adding explicit filter on jdate like jdate = "2017-01-01" is not showing this operator in explain plan.
The tables are just in text formats. I tried disabling and enabling hive.optimize.ppd . But it just changed adding or removing a filter operator much higher in explain plan, but no difference besides that. Will the optimize.ppd parameter have any effect on partition pruning? 2. Is it correct to expect that dynamic partition pruning should happen on big_part table in the above query? 3. If both the tables used in join are partitioned, can we expect that dynamic partition pruning happens on both tables? 4. Will the dynamic partition pruning occur in case of outer joins too? (full and left outer assuming that inner table's conditions are given in "on condition" or outer table's conditions are given in "where clause"). 5. What exactly this hive.optimize.ppd do in case of text files? Just push the filter predicates when reading from table itself if possible? Thank you!
... View more
Labels:
- Labels:
-
Apache Hive
09-28-2016
07:00 PM
Thanks for the suggestion. I have not tried these parameters.. What are these parameters for? Are these the ones that help set the mapper memory size in pig?
... View more
09-27-2016
10:37 PM
I am running my pig scripts and Hive queries in tez mode. For all of these pig scripts/Hive queries the mapper memory requested was more than the memory used. So I changed the mapreduce.map.memory.mb to a lesser value and also changed the mapreduce.map.java.opts. Even after changing these values, my mapper memory requested is more than the map memory used, nothing seemed to changed in performance metrics. (This was from analyzing the job in Dr. elephant), but then the pig script also aborted now with below error message after changing these settings. "java.lang.IllegalArgumentException: tez.runtime.io.sort.mb 1638 should be larger than 0 and should be less than the available task memory (MB):786" I never gave 786 MB anywhere in my setting, where did it take this value from? And also, how do I configure the map and reduce memory in tez execution mode? (I see documentation for hive to set then hive.tez.container.size, but nothing for pig). And is it possible to configure the map and reduce memory differently in tez mode? since in hive on tez documentation it was just mentioned about the map memory setting nothing for reducer memory. And also since tez creates a dag of tasks, they are not like map reduce right, both map and reduce are just seen as an individual task in DAG? or are these DAG tasks still can be classified into mapper/reducer actions? Thanks!
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Pig
-
Apache Tez
06-14-2016
08:56 AM
When I added the hive-site.xml first, I missed a few properties, now I added all properties mentioned by @allen huang in this link https://community.hortonworks.com/questions/25121/oozie-execute-sqoop-falls.html#answer-25291 So even if the sqoop is called using ooziee shell action, I had to add hive-site.xml with properties mentioned by Allen. Thank you Allen :). My script is working fine now.
... View more
06-14-2016
06:56 AM
Hi, I checked the logs. No information found as to why the script aborted. This is all is shown in the log. INFO hive.HiveImport: Loading uploaded data into Hive
WARN conf.HiveConf: HiveConf of name hive.metastore.pre-event.listeners does not exist
WARN conf.HiveConf: HiveConf of name hive.semantic.analyzer.factory.impl does not exist
Logging initialized using configuration in jar:file:/grid/8/hadoop/yarn/local/filecache/5470/hive-common-1.2.1.2.3.4.0-3485.jar!/hive-log4j.properties
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]
... View more
06-10-2016
12:43 PM
Hi, I am running a oozie shell action to run a sqoop command to import data to Hive. When I run the sqoop command directly, it works fine, but when I run it through oozie shell action, it aborts with Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1] Based on this link, https://community.hortonworks.com/questions/25121/oozie-execute-sqoop-falls.html#answer-25290 I have added hive-site.xml also using <file> tag in oozie shell action and also based on other link I have added export HIVE_CONF_DIR=`pwd` before running the sqoop command. But nothing worked. When I add full hive-site.xml it resulted in the same error above, when I added just the important properties mentioned in this link http://ingest.tips/2014/11/27/how-to-oozie-sqoop-hive/, I get this error FAILED: IllegalStateException Unxpected Exception thrown: Unable to fetch table XYZ. java.net.SocketException: Connection resetFailing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1] Both the times, the sqoop command successfully creates the file in target-directory but fails while loading this data to hive. Hadoop cluster is kerberos enabled. I have a kinit done before submitting the workflow and also kinit is done again inside the oozie shell action. Can someone please throw some light on how to fix this one? below is the sqoop command used. Sqoop command:sqoop import \
--connect "jdbc:teradata://${server}/database=${db},logmech=ldap" \
--driver "com.teradata.jdbc.TeraDriver" \
--table "XYZ" \
--split-by "col1" \
--hive-import \
--delete-target-dir \
--target-dir "/user/test/" \
--hive-table "default.XYZ" \
--username "terauser" \
--password tdpwd \
--where "${CONDITION}" \
--m 2 \
--fetch-size 1000 \
--hive-drop-import-delims \
--fields-terminated-by '\001' \
--lines-terminated-by '\n' \
--null-string '\\N' \
--null-non-string '\\N'
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Oozie
-
Apache Sqoop
06-10-2016
07:53 AM
Thanks, I was able to set up the SSH and it is working. But I have a question. What is this oozie ID? I am logging in to Linux using my ID. I submit the workflow either using my ID or I do a kinit to other ID and submit the workflow using that ID. When I see in the UI logs, oozie workflow is shown to be submitted using either my ID or other ID for which ticket was obtained using Kinit. What I dont understand is where does this "oozie" user id fit in. Even I had to go to the home directory of this user "oozie" and get its keys and added it to my destination server authprized_keys file. Can you please explain the purpose of this id? And also how to find this "oozie user id" Is this user id available in oozie-site.xml? Based on this article I also searched for oozie id, but what if it were different? how to find this oozie id? Thanks!
... View more
06-07-2016
06:07 PM
Hi, I need to execute a shell script in a remote server from hadoop server. So I am planning to use oozie SSH action for this. I have 2 basic questions regarding oozie actions. 1. for passwordless SSH, I need to share the public keys between 2 servers. In case of oozie SSH action, where does the oozie workflow initiate the SSH action from? Does it execute from any of the data nodes? If so, how to setup the SSH or get the public keys. 2. Does the oozie Shell action execute from any of the available data nodes or is there any specific way this execution server is identified? Thanks!
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Oozie