Member since
12-30-2015
68
Posts
16
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
565 | 06-14-2016 08:56 AM | |
522 | 04-25-2016 11:59 PM | |
724 | 03-25-2016 06:50 PM |
02-13-2017
11:28 PM
@Jasper Thanks for your comments. Could you please also let me know if this is the usual way Kafka consumers are run in hadoop? if not, could you let me know how the consumers/producers are usually scheduled in hadoop cluster?
... View more
02-09-2017
02:05 AM
Question on scheduling the Kafka Consumer client in hadoop cluster: I have coded a Kafka consumer client that reads the messages from a topic and writes to a local file. I want to schedule this consumer client so that it runs continuously and reads from the topic as and when the message is published in Hadoop cluster. Can someone please explain what is the standard way of doing this in hadoop cluster? I have following approach in my mind, but not sure if this is a usual way. Please let me know your thoughts or suggestions on this. (The sample client writes to a file in local filesystem, but thats just for testing when I schedule it, I am planning to write to HDFS file and then process it later; later after sometime I am planning to write to Hbase directly from Kafka consumer) I am thinking of creating a Oozie workflow with consumer client called using java action and submit the same workflow as many times as the number of consumers I want. I will also change the consumer to write to HDFS file instead of local file. (The HDFS filename will be appended with partition number so that two consumers dont try to write to same file). If I follow this approach, the kafka clients on run on Yarn right? So do I have to do something specific to consumer client rebalancing? or will that work properly as usual? I am just assigning topics to the consumer, not subscribing to specific partition in consumer. Please let me know. And generally do I have to code the Java client in any different way to run through oozie? The entire java client will be launched in a single mapper in my case correct?
... View more
01-17-2017
08:13 PM
Could not paste both the explain plans in previous comment. Here is the explain plan by disabling hive.explain.user=false. hive> set hive.explain.user=false;
hive> explain select a.* from big_part a, small_np b where a.jdate = b.jdate;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
DagId: A515595_20170117140547_4494cba3-581e-441c-8fb6-8175b74d89c2:3
Edges:
Map 1 <- Map 2 (BROADCAST_EDGE)
DagName:
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: a
filterExpr: jdate is not null (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL
Filter Operator
predicate: jdate is not null (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 jdate (type: date)
1 jdate (type: date)
outputColumnNames: _col0, _col1, _col6
input vertices:
1 Map 2
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
HybridGraceHashJoin: true
Filter Operator
predicate: (_col1 = _col6) (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: date)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Execution mode: vectorized
Map 2
Map Operator Tree:
TableScan
alias: b
filterExpr: jdate is not null (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Filter Operator
predicate: jdate is not null (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Reduce Output Operator
key expressions: jdate (type: date)
sort order: +
Map-reduce partition columns: jdate (type: date)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.428 seconds, Fetched: 68 row(s)
... View more
01-17-2017
08:10 PM
Thanks for your comments! Here is the explain plan and create table statements. Hive version is 0.14. And also for the 3rd answer, in case both are partitioned tables, is there anyway to ensure that bigger of the two tables undergo partition pruning instead of the small one? (or is this the default behavior?) What does hive.explain.user = false do? I have attached explain plan by both enabling and disabling this. > show create table big_part;
OK
CREATE TABLE `big_part`(
`id` int)
PARTITIONED BY (
`jdate` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://littleredns/apps/hive/warehouse/big_part'
TBLPROPERTIES (
'transient_lastDdlTime'='1484615054')
Time taken: 1.749 seconds, Fetched: 14 row(s)
hive> show create table small_np;
OK
CREATE TABLE `small_np`(
`id2` int,
`jdate` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://littleredns/apps/hive/warehouse/small_np'
TBLPROPERTIES (
'transient_lastDdlTime'='1484615162')
Time taken: 0.16 seconds, Fetched: 13 row(s)
hive> set hive.optimize.ppd=true;
hive> set hive.tez.dynamic.partition.pruning=true;
hive> explain select a.* from big_part a, small_np b where a.jdate = b.jdate;
OK
Plan not optimized by CBO.
Vertex dependency in root stage
Map 1 <- Map 2 (BROADCAST_EDGE)
Stage-0
Fetch Operator
limit:-1
Stage-1
Map 1 vectorized
File Output Operator [FS_21]
compressed:false
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Select Operator [OP_20]
outputColumnNames:["_col0","_col1"]
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Filter Operator [FIL_19]
predicate:(_col1 = _col6) (type: boolean)
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Map Join Operator [MAPJOIN_18]
| condition map:[{"":"Inner Join 0 to 1"}]
| HybridGraceHashJoin:true
| keys:{"Map 2":"jdate (type: date)","Map 1":"jdate (type: date)"}
| outputColumnNames:["_col0","_col1","_col6"]
| Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
|<-Map 2 [BROADCAST_EDGE]
| Reduce Output Operator [RS_4]
| key expressions:jdate (type: date)
| Map-reduce partition columns:jdate (type: date)
| sort order:+
| Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
| Filter Operator [FIL_14]
| predicate:jdate is not null (type: boolean)
| Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
| TableScan [TS_1]
| alias:b
| Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
|<-Filter Operator [FIL_17]
predicate:jdate is not null (type: boolean)
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL
TableScan [TS_0]
alias:a
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: PARTIAL
Time taken: 1.459 seconds, Fetched: 45 row(s)
... View more
01-17-2017
03:40 AM
Hi, Could someone please explain me understand the below questions on hive partition pruning and explain plan? 1. How to check if partition pruning occurs by checking the explain plan? I thought I would see "Dynamic partitioning event operator" in explain plan, but in my sample query below I am not seeing any such operator. I enabled hive.tez.dynamic.partition.pruning. -- Since the table does not have much data, it is going for map join, does that have anything to do with partition pruning not happening? explain select a.* from big_part a, small_np b where a.jdate = b.jdate ;
big_part is partitioned on jdate where small_np is a non partitioned table, even adding explicit filter on jdate like jdate = "2017-01-01" is not showing this operator in explain plan.
The tables are just in text formats. I tried disabling and enabling hive.optimize.ppd . But it just changed adding or removing a filter operator much higher in explain plan, but no difference besides that. Will the optimize.ppd parameter have any effect on partition pruning? 2. Is it correct to expect that dynamic partition pruning should happen on big_part table in the above query? 3. If both the tables used in join are partitioned, can we expect that dynamic partition pruning happens on both tables? 4. Will the dynamic partition pruning occur in case of outer joins too? (full and left outer assuming that inner table's conditions are given in "on condition" or outer table's conditions are given in "where clause"). 5. What exactly this hive.optimize.ppd do in case of text files? Just push the filter predicates when reading from table itself if possible? Thank you!
... View more
Labels:
01-04-2017
07:33 PM
Thanks for your detailed reply. I created a Hive table with the column mapped to Bigint in Hive, even then the column is displayed as null. Only those columns that were loaded into Hbase as String are getting displayed in Hive (with hive column both as String/Bigint). Can you please let me know why? Is it always better to write as String when i am planning to read through Hive or using Hbase Shell? or is there something I am missing here? CREATE EXTERNAL TABLE `hbase_table_4`(
`key` string COMMENT 'from deserializer',
`value` bigint COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'hbase.columns.mapping'='offset:value',
'serialization.format'='1')
TBLPROPERTIES (
'hbase.table.name'='kafka_conn',
'transient_lastDdlTime'='1483506197')
hive> select * from hbase_table_4;
OK
test_0909 NULL
test_0910 NULL
test_0911 111
test_0919 5
test_0920 5
hbase(main):004:0> scan 'kafka_conn', {VERSIONS => 10}
ROW COLUMN+CELL
test_0909 column=offset:value, timestamp=1483558076087, value=\x00\x00\x00\x00\x00\x00\x00\x0F
test_0910 column=offset:value, timestamp=1483498353863, value=\x00\x00\x00\x00\x00\x00\x00\x0A
test_0911 column=offset:value, timestamp=1483504038021, value=111
test_0919 column=offset:value, timestamp=1483505296398, value=5
test_0920 column=offset:value, timestamp=1483505356278, value=5
... View more
01-04-2017
04:43 AM
I am using Hbase Put API to write a long datatype to Hbase using the below code p.add(Bytes.toBytes(this.hcol_fam_n), Bytes.toBytes(this.hcol_qual_n), Bytes.toBytes(this.newoffset)); When I run a scan command on Hbase shell, the values are displayed in hex format of binary representation value=\x00\x00\x00\x00\x00\x00\x00\x07 when I use Hbase put command, the value is properly displayed as Integer. Now when I create a Hive external table on
top of this hbase table and declare this column as String, I am unable to read the values that are loaded through
Java code, they are displayed as Nulls. Only those rows that are loaded through Hbase shell using put command are
returned when queried through hive. 1. Any idea how to make hive display this column for all the records? 2. I dont understand why the shell displays values in hex format (of binary value) for the numbers inserted through Java
and values are displayed in ascii for records inserted through put command. Any reason? 3. How do I make the hbase shell display the values in proper ascii format using scan or get command in Hbase shell?
4. Even in Java code, I am unable to convert the result from Get method to string using Bytes.toString() it returns null
whereas Bytes.toLong works in java code. Why is this difference?
... View more
Labels:
12-29-2016
08:54 AM
I have read in many places that Hbase does not perform well for joins but has good performance when performed a random read/write. My question is would it still give good performance if there is a bulk scan of Hbase table using full rowkey (like say scanning 30% of table where the scanned rowkeys are random and distributed in nature and not query just a few regions of the table) Consider a Hbase table whose regions are equally distributed across many region servers. If a external table is created for such a table in Hive and this external table is joined with another Hive managed table based on Rowkey from Hbase table, would huge number of rowkey scans during the join on Hbase table be a performance bottleneck in this scenario? If so, could you please explain why? Thanks!
... View more
Labels:
12-21-2016
12:51 AM
@mqureshi I have one question based on your comment 8. What if I join Hbase table based on the entire row key (say the Hbase table is huge ~20 million and the join would involve almost 20 % of the entire table at random, they may be scanning data in different region servers, not just one specific RS), could you please let me know if these kind of joins based on Hbase table row key be efficient as long as it does not scan everything in the same RS? And also if I create a external table in Hive for the corresponding Hbase table, and joining another Hive table (key is bucketed and not skewed) with the Hbase table based on row key (through hive external table), would Hbase scan be a bottle neck in this scenario ? Please let me know your thoughts.
... View more
09-28-2016
07:00 PM
Thanks for the suggestion. I have not tried these parameters.. What are these parameters for? Are these the ones that help set the mapper memory size in pig?
... View more
09-27-2016
10:37 PM
I am running my pig scripts and Hive queries in tez mode. For all of these pig scripts/Hive queries the mapper memory requested was more than the memory used. So I changed the mapreduce.map.memory.mb to a lesser value and also changed the mapreduce.map.java.opts. Even after changing these values, my mapper memory requested is more than the map memory used, nothing seemed to changed in performance metrics. (This was from analyzing the job in Dr. elephant), but then the pig script also aborted now with below error message after changing these settings. "java.lang.IllegalArgumentException: tez.runtime.io.sort.mb 1638 should be larger than 0 and should be less than the available task memory (MB):786" I never gave 786 MB anywhere in my setting, where did it take this value from? And also, how do I configure the map and reduce memory in tez execution mode? (I see documentation for hive to set then hive.tez.container.size, but nothing for pig). And is it possible to configure the map and reduce memory differently in tez mode? since in hive on tez documentation it was just mentioned about the map memory setting nothing for reducer memory. And also since tez creates a dag of tasks, they are not like map reduce right, both map and reduce are just seen as an individual task in DAG? or are these DAG tasks still can be classified into mapper/reducer actions? Thanks!
... View more
Labels:
08-24-2016
09:43 PM
1 Kudo
Hi, I am unable to typecast the column from chararray to int for the column output from over operator. I am using over operator to derive the value of a column from next record. I define the over operator like this. DEFINE Over org.apache.pig.piggybank.evaluation.Over('chararray'); So that output is chararray. Now the following code produces chararray, I could see the type as chararray using describe command. But unable to cast it to int dist_dates = foreach expr_d_set {
distinct_dates = distinct LND_DATA_REQD.file_date;
sorted_ip = order distinct_dates by file_date;
stitched = Stitch(sorted_ip, Over(sorted_ip.file_date, 'lead',0,1,1,'99991231'));
generate flatten(group) as (queueid:chararray, acdid:int), flatten(stitched) as (file_date:int, expr_d:chararray) ;
};
grunt> describe dist_dates;
...
dist_dates: {queueid: chararray,acdid: int,file_date: int,expr_d: chararray} --> Notice the chararray here?
I = FOREACH dist_dates generate (int)expr_d; I try this to cast to integer. But failing with error message "java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String" - I dont understand why this cast is not working, I am able to cast other columns to integer though using the same syntax. But I am able to convert the same to Date.. Can someone please suggest what I am missing here? I = FOREACH dist_dates generate ToDate(expr_d,'YYYYMMdd') as EXPR_D;
... View more
Labels:
06-20-2016
06:57 AM
Hi, My input data will be in the below format.
col1 col2 col3 effective date expiry date
1 Q1 A1 Value1 01/01 01/02
2 Q1 A1 Value1 01/02 01/03
3 Q1 A1 Value1 01/03 01/05
4 Q1 A1 Value2 01/05 01/06
5 Q1 A1 Value2 01/06 01/07
6 Q1 A1 Value2 01/07 01/08
7 Q1 A1 Value1 01/08 01/11
8 Q1 A1 Value1 01/11 12/31 I need to remove duplicates based on values of col1, col2, col3 but not all the duplicates. Until the value of col3 changes to different value the records
are considered as duplicates. for eg. in the above data, Value 1 changes to value 2 in 4th record, so among records 1,2 and 3 only 1st should be retained.
And among record 4,5 and 6 only 4th should be retained. And among records 7 and 8 only 7 should be retained. The last 2 columns are actually date columns (effective
and expiry date). The duplicates like 1,2 and 3 could occur many times (like 1,2,3,4 and 5 could have same value) or there could be no duplicates at all.
I was having two approaches in mind, but not sure how to code for any of them. 1. So I was thinking of generating a keychange column (1 or 0) that changes the value from 1 to 0 for all the dupes and when the key (combination of col1, col2, col3)
changes, the value of this keychange column should be set to 1. Then I could filter on this column.
But for this I need to write a UDF (or are there any UDF with similar functionality available?), since this requires input to be in sorted order when passing to udf,
is it possible to pass sorted data to udf? if so, how? What kind of UDF should this be?
or even if I write a mapreduce code, how should I proceed, should I just emit the records in mapper and do all the sorting
and generating the column in reducer? Please let me know your inputs (new to mapreduce programming, so your ideas will help me a lot in learning, thanks!). 2. When I went through the "over" function documentation, it compares only previous record and current record same column, If somehow I could compare the col5 (expiry date) of
current record with col4 (effective date) of next record after sorting based on col4 (effective date) in ascending order, I could do a group by on Col1, col2 and Col3
and eliminate those record where effective date was same as previous record's expiry date. But not sure how to compare two different columns using over function. Please let me know your thoughts on this one too. Please let me know if there is another better way to solve this. Thank you for your time!
... View more
Labels:
06-15-2016
07:12 AM
Hi, I have a workflow which has four actions like this A -> B -> C (if success ok-to option)
B -> D (if B fails error-to option) When I run this workflow, A succeeds and B fails, so C is run and this succeeds. When I rerun the workflow again, A is skipped (I have selected oozie.wf.rerun.failnodes = true in properties file) Now B runs again and fails, this time during rerun, C does not run again since it succeeded in previous run. I want C to run for every run B is failing. I tried oozie.wf.rerun.skip.nodes=,. This one causes all the actions to rerun again after failure. action A also executes again. I dont want this behaviour. I want all actions in the work action path subsequent to failed actions to be rerun again when the failed action is rerun by resubmitting the workflow. (in this case since B is rerun, if this fails again I want C to run again irrespective of its status previous time. Is there any way to achieve this functionality without splitting the workflow? ( B is a hive action and C is a email action, so every time B fails, I want email action to be triggered saying it failed and its error message) Please suggest.
... View more
Labels:
06-14-2016
08:56 AM
When I added the hive-site.xml first, I missed a few properties, now I added all properties mentioned by @allen huang in this link https://community.hortonworks.com/questions/25121/oozie-execute-sqoop-falls.html#answer-25291 So even if the sqoop is called using ooziee shell action, I had to add hive-site.xml with properties mentioned by Allen. Thank you Allen :). My script is working fine now.
... View more
06-14-2016
06:56 AM
Hi, I checked the logs. No information found as to why the script aborted. This is all is shown in the log. INFO hive.HiveImport: Loading uploaded data into Hive
WARN conf.HiveConf: HiveConf of name hive.metastore.pre-event.listeners does not exist
WARN conf.HiveConf: HiveConf of name hive.semantic.analyzer.factory.impl does not exist
Logging initialized using configuration in jar:file:/grid/8/hadoop/yarn/local/filecache/5470/hive-common-1.2.1.2.3.4.0-3485.jar!/hive-log4j.properties
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]
... View more
06-10-2016
12:43 PM
Hi, I am running a oozie shell action to run a sqoop command to import data to Hive. When I run the sqoop command directly, it works fine, but when I run it through oozie shell action, it aborts with Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1] Based on this link, https://community.hortonworks.com/questions/25121/oozie-execute-sqoop-falls.html#answer-25290 I have added hive-site.xml also using <file> tag in oozie shell action and also based on other link I have added export HIVE_CONF_DIR=`pwd` before running the sqoop command. But nothing worked. When I add full hive-site.xml it resulted in the same error above, when I added just the important properties mentioned in this link http://ingest.tips/2014/11/27/how-to-oozie-sqoop-hive/, I get this error FAILED: IllegalStateException Unxpected Exception thrown: Unable to fetch table XYZ. java.net.SocketException: Connection resetFailing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1] Both the times, the sqoop command successfully creates the file in target-directory but fails while loading this data to hive. Hadoop cluster is kerberos enabled. I have a kinit done before submitting the workflow and also kinit is done again inside the oozie shell action. Can someone please throw some light on how to fix this one? below is the sqoop command used. Sqoop command:sqoop import \
--connect "jdbc:teradata://${server}/database=${db},logmech=ldap" \
--driver "com.teradata.jdbc.TeraDriver" \
--table "XYZ" \
--split-by "col1" \
--hive-import \
--delete-target-dir \
--target-dir "/user/test/" \
--hive-table "default.XYZ" \
--username "terauser" \
--password tdpwd \
--where "${CONDITION}" \
--m 2 \
--fetch-size 1000 \
--hive-drop-import-delims \
--fields-terminated-by '\001' \
--lines-terminated-by '\n' \
--null-string '\\N' \
--null-non-string '\\N'
... View more
Labels:
06-10-2016
07:53 AM
Thanks, I was able to set up the SSH and it is working. But I have a question. What is this oozie ID? I am logging in to Linux using my ID. I submit the workflow either using my ID or I do a kinit to other ID and submit the workflow using that ID. When I see in the UI logs, oozie workflow is shown to be submitted using either my ID or other ID for which ticket was obtained using Kinit. What I dont understand is where does this "oozie" user id fit in. Even I had to go to the home directory of this user "oozie" and get its keys and added it to my destination server authprized_keys file. Can you please explain the purpose of this id? And also how to find this "oozie user id" Is this user id available in oozie-site.xml? Based on this article I also searched for oozie id, but what if it were different? how to find this oozie id? Thanks!
... View more
06-07-2016
06:07 PM
Hi, I need to execute a shell script in a remote server from hadoop server. So I am planning to use oozie SSH action for this. I have 2 basic questions regarding oozie actions. 1. for passwordless SSH, I need to share the public keys between 2 servers. In case of oozie SSH action, where does the oozie workflow initiate the SSH action from? Does it execute from any of the data nodes? If so, how to setup the SSH or get the public keys. 2. Does the oozie Shell action execute from any of the available data nodes or is there any specific way this execution server is identified? Thanks!
... View more
Labels:
06-07-2016
06:01 PM
Hi, I am trying to read a file in HDFS using hadoop fs -cat command in oozie Shell action. Mine is a kerberized cluster. oozie workflow is submitted using my ID, A. The file can be only read using ID B. I am doing kinit -kt using B's keytab. Inside the shell script, I did kinit -kt and i also did klist. Klist displayed B as default principal and it showed a valid ticket. But even though the klist shows B's valid ticket, hadoop fs -cat is executed using my ID (A) and not B. This results in insufficient privilege issue. Why is the hadoop fs -cat command not using B's ticket and using my ID ? The same thing works when I run from linux as individual commands instead of oozie shell action. I login to linux using my ID. klist just shows my principal. I do a kinit for B and now Klist shows B's ticket. and in the same shell (not from oozie, from linux command line) when I issue hadoop fs -cat filename, it displays the content of the file. Why is this working from linux directly but not working when executed from oozie shell action? After doing a kinit on a different user, all the hadoop commands seem to be executing using the second user in linux CLI, I thought this is how the oozie shell action would work too. Please help me understand this. Note: When I login using my ID and do kinit as second user before even submitting the oozie workflow, and submit the oozie workflow after kinit to second user, this seems to work and all actions inside shell action of workflow now seems to exxecuted by second user rather than my ID. and this way it does not create any issues. Please help me understand this as well.
... View more
Labels:
05-19-2016
11:16 PM
Question on pig exec command: I am executing a pig script inside a pig script. The outer pig script is executed in tez mode, will the pig script script executed using exec command also run in tez mode? Will the entire script run using exec command be executed in single vertex? will there be any difference in the DAG created for the inner script called using exec compared to if it had been run directly? because my pig scirpt was getting aborted because of max counters (I posted that question here: https://community.hortonworks.com/questions/34420/oozie-pig-action-mapreduce-job-counters-limitexcep.html ) This script aborts when run from oozie. But the same script I tried to execute using exec statement inside another pig script with the same settings, it is executed without any issues. What is the difference?
... View more
Labels:
05-19-2016
09:38 PM
Hi, My pig action called form oozie workflow fails due to below error. Though pig is executing in tez it is taking mapreduce max counters. This is only an issue when run through oozie. org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 131 max=130 I think this is a known issue https://issues.apache.org/jira/browse/PIG-4760. My version is 0.15.0 I am trying to manually set the mapreduce.job.counters.max to a higher value. But I am not sure how to do this correctly without modifying mapred-site.xml I tried following things and nothing worked 1. I set in oozie workflow under <configuration> </configuration> tag in <property></property> 2. I tried adding a modified mapred-site.xml and provided in <job-xml></job-xml> tag in oozie workflow. 3. I also tried setting inside pig using set and also tried to set this in workflow.xml using <configuration> </configuration> and <property> in pig action. None of this works, so is the way to fix this modify the mapred-site.xml in hadoop home directory? Any suggestions would be helpful. Thanks!
... View more
05-19-2016
04:11 AM
Due to some reason, I am unable to post reply to your last comment. yes the table is in Hcat. I was able to check hcat -e "describe table;", its just truncate thats now working. hcat -e "truncate table tablename;" is giving the same error that I got from pig. According to https://cwiki.apache.org/confluence/display/Hive/HCatalog+CLI all commands that does not require map reduce job should run through hcat right? It works. All hive managed tables are hcat tables right?
... View more
05-19-2016
01:25 AM
My bad, actually I didnt look at your sample script properly. I tried these commands.
But only show tables work. I tried select * from table and truncate table tablename it didnt work. Please find the error below. grunt> sql select * from temp_dim_w;
2016-05-18 20:24:13,705 [main] INFO org.apache.pig.tools.grunt.GruntParser - Going to run hcat command: select * from temp_dim_w;
WARNING: Use "yarn jar" to launch YARN applications.
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
16/05/18 20:24:15 WARN conf.HiveConf: HiveConf of name hive.metastore.pre-event.listeners does not exist
16/05/18 20:24:15 WARN conf.HiveConf: HiveConf of name hive.semantic.analyzer.factory.impl does not exist
FAILED: SemanticException Operation not supported.
grunt> sql truncate table temp_dim_w;
2016-05-18 20:24:31,858 [main] INFO org.apache.pig.tools.grunt.GruntParser - Going to run hcat command: truncate table temp_dim_w;
WARNING: Use "yarn jar" to launch YARN applications.
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
16/05/18 20:24:33 WARN conf.HiveConf: HiveConf of name hive.metastore.pre-event.listeners does not exist
16/05/18 20:24:33 WARN conf.HiveConf: HiveConf of name hive.semantic.analyzer.factory.impl does not exist
FAILED: SemanticException Operation not supported.
... View more
05-19-2016
12:52 AM
I am able to read and write to the table, but the writes are always append. I could not truncate the Hive table.
... View more
05-18-2016
11:13 PM
1 Kudo
Is there a way to truncate hive table from pig? I need to overwrite a temp table everytime I load the data into it, but I am unable to find anything about how to overwrite the contents, only way I could think of is to remove the files in that table directory manually and rewriting it using Store interface. Is there a better way to do this? Please advise. Thanks!
... View more
05-18-2016
08:07 PM
Thank you!
... View more
05-18-2016
08:07 PM
Thank you!
... View more
05-18-2016
07:59 PM
Thanks for the suggestion, Actually I am just using this sequence table as a place to land the data from external source with little to no modification to the data. And using ORC file for final table which would be used for querying. But since the data in this sequence table would be retained for around 100 days, I wanted the data to be compressed without compromising the read performance, thats why I chose snappy compression over others, thinking that it might be faster to query.
... View more
05-18-2016
07:54 PM
Hi, Thanks for your reply. So if I set the compression code for Textfile/Sequencefile as gzip, bzip, LZO or snappy, then will those files created after insert statement are splittable? or need to be read in entirely one node? What about the other file formats like parquet, Avro and RC file? So whether a file is split or not is based on File format or compression method? It is compression method right? What are the compression methods that would make the file data non splittable? And what actually the compressed means in the output of describe formatted? Is that some table level compression? how to enable it? So if I understand the answers right, the compression method for sequence file is based on the parameters set before the insert statement. If this is the case, what if I enable the compression and set to snappy before first insert statement, and set to different compression codec for the second insert, and disable the compression before the third insert. Would this create three different file formats in the background? I tried this, I am able to select the data from table, not sure if it creates three different file formats in background? Any idea? And If one of these compression methods are non splittable, then only that mapper that reads the data from that file would read the entire file right? the other mappers reading the other files would still read the SPLIT data right?
... View more