Member since
12-10-2015
58
Posts
24
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
395 | 02-17-2016 04:12 AM | |
733 | 02-03-2016 05:15 AM | |
353 | 01-27-2016 09:13 AM | |
1036 | 01-27-2016 07:00 AM | |
387 | 01-02-2016 03:29 PM |
11-09-2016
12:11 PM
1 Kudo
I want to inject data from local file system to HDFS.I have chosen spooling directory source to that.But i need to write custome event deserelizer to read excel files.How to write that?Any help?
... View more
Labels:
11-08-2016
06:35 AM
1 Kudo
I am just wondering has anybody come across the scenario where you need to import or read the data from excel to Hadoop? Is there such thing like Flume Excel source around? btw, I know I can convert the excel file to csv then deal with it. Really just trying to explore flume source a bit further here.
... View more
Labels:
02-22-2016
01:42 PM
I already have table with union type column in hive..my question is to how access the column so that i can access different datatypes in it.Suppose if we have struct we can access like s.x .Like that how to access datatypes inside union type.
... View more
02-22-2016
12:10 PM
1 Kudo
I have column in hive table like UNIONTYPE<int, double, array<string>, struct<a:int,b:string> This one will results when select that column like following. 1)How to get particular tag/column type eg: struct (tag no: 3) 2)If the selected type is struct ,then how to access fields(filelds of struct) in it {0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
... View more
- Tags:
- Data Processing
- Hive
Labels:
02-17-2016
04:12 AM
I completed this task by dowloading hwi.*.war file from 0.12 version of hive as i didn't find it in 0.13 and 0.14
... View more
02-15-2016
08:41 AM
1 Kudo
What are the prerequisites dor starting HWI on HDP 2.2
... View more
Labels:
02-09-2016
04:02 PM
I installed Hive ODBC Driver for HDP 2.2 on my windows 7 machine and trying to connect to hive through ODBC(hadoop istalled on CENTOS).I encoutered with following error. configs are all default. For example authentication for hiveserver2 is "none"(default).Is anything i missed out.I followed the document of hortoworks.I gave the server ip and port is 10000.I assumed hiveserver2 is running because beeline command line is working for following command
beeline -u jdbc:hive2://ip:10000
... View more
Labels:
02-05-2016
02:13 PM
@Artem Ervits Yeah artem i know casting ,But this column not accepting anything.see following. lead_result = foreach gprd {
C1 = order req_cols by time ASC;
generate flatten(org.apache.pig.piggybank.evaluation.Stitch(C1, org.apache.pig.piggybank.evaluation.Over(C1.time, 'lead', 0, 1, 1, 0))) as (year,month,day,time,cust_id1,cust_id2,page_url,visit_num,next_url_hit_time:bytearray);
};
change_col_type = foreach lead_result generate next_url_hit_time as next_url:chararray;
For the first time i am facing this issue.Bold one is completely new for me. 2016-01-16 09:04:54,994 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable field schema: declared is "next_url:chararray", infered is "next_url_hit_time:NULL" 2016-01-16 09:04:54,994 [main] WARN org.apache.pig.tools.grunt.Grunt - There is no log file to write to. 2016-01-16 09:04:54,994 [main] ERROR org.apache.pig.tools.grunt.Grunt - Failed to parse: Pig script failed to parse: <line 19, column 18> pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1031: Incompatable field schema: declared is "next_url:chararray", infered is "next_url_hit_time:NULL" at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:199) at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1707) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1680) at org.apache.pig.PigServer.registerQuery(PigServer.java:623) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1063) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66) at org.apache.pig.Main.run(Main.java:558) at org.apache.pig.Main.main(Main.java:170) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
... View more
02-05-2016
01:56 PM
@Artem Ervits Tq for reply Before applying over schema would like req_cols: {year: int,month: int,day: int,time: int,cust_id1: chararray,cust_id2: chararray,post_page_url: bytearray,visit_num: int} After i applied lead with over we get one more column lets say that is "next_url_hit_time" ($8). Where actually i am facing issue.see following code. lead_result = foreach gprd {
C1 = order req_cols by time ASC;
generate flatten(org.apache.pig.piggybank.evaluation.Stitch(C1, org.apache.pig.piggybank.evaluation.Over(C1.time, 'lead', 0, 1, 1, 0))) as (year,month,day,time,cust_id1,cust_id2,page_url,visit_num,next_url_hit_time:chararray);
}; The above one generating error like grunt> lead_result = foreach gprd { >> C1 = order req_cols by time ASC; >> generate flatten(org.apache.pig.piggybank.evaluation.Stitch(C1, org.apache.pig.piggybank.evaluation.Over(C1.time, 'lead', 0, 1, 1, 0))) as (year,month,day,time,cust_id1,cust_id2,page_url,visit_num,next_url_hit_time:chararray); >> }; 2016-01-16 08:47:53,566 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable field schema: declared is "next_url_hit_time:chararray", infered is ":NULL" 2016-01-16 08:47:53,566 [main] WARN org.apache.pig.tools.grunt.Grunt - There is no log file to write to. 2016-01-16 08:47:53,567 [main] ERROR org.apache.pig.tools.grunt.Grunt - Failed to parse: Pig script failed to parse: <line 12, column 14> pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1031: Incompatable field schema: declared is "next_url_hit_time:chararray", infered is ":NULL" at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:199) at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1707) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1680) at org.apache.pig.PigServer.registerQuery(PigServer.java:623) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1063) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66) at org.apache.pig.Main.run(Main.java:558) at org.apache.pig.Main.main(Main.java:170) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) But the below one executing ok lead_result = foreach gprd {
C1 = order req_cols by time ASC;
generate flatten(org.apache.pig.piggybank.evaluation.Stitch(C1, org.apache.pig.piggybank.evaluation.Over(C1.time, 'lead', 0, 1, 1, 0))) as (year,month,day,time,cust_id1,cust_id2,page_url,visit_num,next_url_hit_time);
}; Except byteaaray it is not accepting any other type.Actually i need to cast the column generated by lead function.
... View more
02-05-2016
01:02 PM
1 Kudo
given_data = load '/clickstream/total_hitdata/05/hit_data.tsv' using PigStorage('\t');
filtered = FILTER given_data by ($133!=0);
req_cols = foreach filtered generate GetYear(ToDate((chararray)$25,'yyyy-MM-dd HH:mm:ss','GMT')) as year:int,GetMonth(ToDate((chararray)$25,'yyyy-MM-dd HH:mm:ss','GMT')) as month:int,GetDay(ToDate((chararray)$25,'yyyy-MM-dd HH:mm:ss','GMT')) as day:int,($161-1400000000) as time,$343 as cust_id1:chararray,$344 as cust_id2:chararray,$256 as post_page_url,$466 as visit_num:int;
gprd = group req_cols by (year,month,day,cust_id1,cust_id2,visit_num);
lead_result = foreach gprd {
C1 = order req_cols by time ASC;
generate flatten(org.apache.pig.piggybank.evaluation.Stitch(C1, org.apache.pig.piggybank.evaluation.Over(C1.time, 'lead', 0, 1, 1, 0)));
};
In lead_result relation i used 'lead' function according to my requirement. $8 is the column which is generated by lead function along with old schema.But i unable to cast to anytype.I am getting following error when try to cast to chararray with name my.
<line 57, column 4> pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1031: Incompatable field schema: declared is "my:chararray", infered is ":NULL"
The following is the schema overall.
lead_result: {stitched::year: int,stitched::month: int,stitched::day: int,stitched::time: int,stitched::cust_id1: chararray,stitched::cust_id2: chararray,stitched::post_page_url: bytearray,stitched::visit_num: int,NULL}
... View more
Labels:
02-03-2016
07:01 AM
As @Gangadhar Kadam said it has problem in 0.13 but works fine in 0.14
... View more
02-03-2016
05:15 AM
1 Kudo
The configuration variable "sqoop.export.records.per.statement" can be set to 1 as a workaround for this problem. https://issues.apache.org/jira/browse/SQOOP-314
... View more
01-29-2016
03:35 PM
Yeah ...@Artem Ervits i got your point.simple but logical.
... View more
01-29-2016
10:31 AM
Hi All, According to my requirement i need script like following A = load '/bsuresh/sample' USING PigStorage(',') as (id,name,sal,deptid);
B = GROUP A by deptid;
C = foreach B {
D = A.name,A.sal;--two fields
E = DISTINCT D;
generate group,COUNT(E);
};
In relation 'D', i am extracting two fields.Where exactly i am facing error. If i chnaged the script like,which is working fine. C = foreach B {
D = A.name; --one filed
E = DISTINCT D;
generate group,COUNT(E);
}; But i need count based on distinct of two columns .Can any one help me??
... View more
- Tags:
- Data Processing
- Pig
Labels:
01-27-2016
09:16 AM
use ISO time pattern instead of dd-MMM-yyyy from my code
... View more
01-27-2016
09:13 AM
2 Kudos
all the comments mentioned here are correct,this is small example
emp = load 'data' using PigStorage(',') as (empno,ename ,job,mgr,hiredate ,sal,comm,deptno);
each_date = foreach emp generate ToDate(hiredate,'dd-MMM-yyyy') as mydate;
subt = foreach each_date generate mydate,SubtractDuration(mydate,'PT1M');
dump subt;
... View more
01-27-2016
07:00 AM
3 Kudos
I guess it is not considering param file try this, pig -param_file=hdfs://ip-XXX-XX-XX-XXX.ec2.internal:8020/home/hadoop/adh_time /home/hadoop/test.pig writing param_file at the end encountering same issue for me too
... View more
01-05-2016
10:08 AM
1 Kudo
I have some number of files having format like 1)filename+date.filefomat 2)filename.fileformat now i need to copy only files which have some number before .(dot).
... View more
Labels:
01-05-2016
04:37 AM
@Kuldeep Kulkarni Is there any way like ,Using "set" in directly grunt shell??? eg: set exectype=tez;
... View more
01-05-2016
04:21 AM
I would like to know what are the different ways to enable hcatalog and tez in writing pig scripts.
... View more
- Tags:
- Data Processing
- Pig
Labels:
01-04-2016
02:15 PM
@Benjamin Leonhardi Thank you.Yeah,The script setting the environmental variable and then executing pig script in $PIG_HOME like exec /usr/hdp/2.2.8.0-3150/pig/bin/pig.distro "$@"
... View more
01-04-2016
01:53 PM
@Benjamin Leonhardi I used dump after illustrate.So i got error.So the problem is with " illustrate " command. Actually i have a habit to use illustrate for every pig command i used in grunt shell to check the output.
... View more
01-04-2016
09:28 AM
I would like to print or read pig_home from terminal.Is there any way???
... View more
- Tags:
- Data Processing
- Pig
Labels:
01-04-2016
08:56 AM
Apache Pig version 0.12.1.2.1.7.0-784 I have data where one of the field doesn't have data like 2015,,08
2015,,09
2015,,11
2015,,04
2015,,05 Now i run the pig command like grunt> given_input = load '/pigtest/flightdelays/' using PigStorage(',') as (year,month,day);
grunt> ori = foreach given_input generate month;
grunt> illustrate ori;
generating error like : Caused by: java.lang.RuntimeException: No (valid) input data found! when i replace the loader with CSVExcelStorage like grunt> given_input = load '/pigtest/flightdelays/' using org.apache.pig.piggybank.storage.CSVExcelStorage(',') as (year,month,day);
grunt> ori = foreach given_input generate month;
grunt> illustrate ori; getting output like -------------------------------------------------------------------------------
| given_input | year:bytearray | month:bytearray | day:bytearray |
-------------------------------------------------------------------------------
| | 2015 | | 05 |
-------------------------------------------------------------------------------
--------------------------------
| ori | month:bytearray |
--------------------------------
| | |
-------------------------------- So,I would like to know 1)What is the problem with Pigstorage. 2)Is it loader problem or pig version problem. 3)If i want to use PigStoarage in this,How is should??? Not only illustrate even dump behaves the same.
... View more
- Tags:
- Data Processing
- Pig
Labels:
01-02-2016
03:29 PM
1 Kudo
@Vidya SK DISTINCT in pig is a relational operator.So it will apply or perform on relations rather than fields or some other.consider the following. given_input = load '/given/path' using PigStorage(',') as (col1 ,col2,col3); consider the following situations. 1)Suppose i want to maintain unique values in col1 then, unique_col1 = foreach given_input generate col1;
unique_values= DISTINCT unique_col1; (DISTINCT only perform on relations i.e unique_col1). suppose col1 contains data like hortonworks
hortonworks
cloudera then u get cloudera
hortonworks 2)Suppose i want to maintain unique values in col1 and col2 then unique_two_fields = forech given_input generate col1 ,col2;
unique_values = DISTINCT unique_two_fields; (DISTINCT only performs on relations) suppose col1 and col2 contains data like hortonworks,clouera
hortonworks,clouera
hortonwors,hortonworks u get like hortonworks,clouera
hortonwors,hortonworks
Like this we should get the data that u want to make unique in one relation and then apply the distinct operator.Suppose if u want to perform any aggregations then go for group and apply aggregations.
... View more
12-31-2015
06:05 AM
@Guilherme Braccialli yupp...its working.
... View more
12-31-2015
05:59 AM
@Artem Ervits Now i changed sqoop command like this sqoop-export --connect jdbc:oracle:thin:@ipaddress:orcl --username username -P --table EMP --columns EMPNO,ENAME,JOB,MGR --export-dir /sqooptest/export -m 1 --direct Even , No result.It behaves same as i mentioned. Current sqoop version: 15/12/31 11:18:33 INFO sqoop.Sqoop: Running Sqoop version: 1.4.4.2.1.7.0-784 see the following error once. sqoop database hanging error
... View more
12-31-2015
05:42 AM
@hrongali I tried like that also , set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.auto.convert.join=true;
Now ,i make the tables sorted along with bucketing then i got following error
hive> explain select * FROM bucket_small a JOIN bucket_big b ON a.key = b.key;
FAILED: SemanticException [Error 10135]: Sort merge bucketed join could not be performed. If you really want to perform the operation, either set hive.optimize.bucketmapjoin.sortedmerge=false, or set hive.enforce.sortmergebucketmapjoin=false.
when i make following property false and execute plan,map join is generating
hive> set hive.optimize.bucketmapjoin.sortedmerge=false;
... View more
12-30-2015
06:28 AM
1 Kudo
Sqoop version: 1.4.4.2.1.7.0-784 The following is the sqoop command that i used to export simple(comma separated) records . sqoop-export --connect jdbc:oracle:thin:@ipaddress:1521:orcl --username user --password password --table EMP --columns EMPNO,ENAME,JOB,MGR --export-dir /sqooptest/export -m 1 --batch The above command stucks at 95% and doesn't get completed.
... View more
Labels:
12-28-2015
12:20 PM
I have two table bucket_small and bucket_big as following hive> show create table bucket_big;
OK
CREATE TABLE `bucket_big`(
`id` int,
`student_id` string,
`student_name` string,
`course_id` int)
CLUSTERED BY (
course_id)
INTO 4 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://hdp1.stratapps.com:8020/apps/hive/warehouse/bucket_big'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='4',
'numRows'='10',
'rawDataSize'='148',
'totalSize'='158',
'transient_lastDdlTime'='1451302285')
Time taken: 0.166 seconds, Fetched: 23 row(s)
hive>
hive> show create table bucket_small;
OK
CREATE TABLE `bucket_small`(
`course_id` int,
`course_name` string)
CLUSTERED BY (
course_id)
INTO 2 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://hdp1.stratapps.com:8020/apps/hive/warehouse/bucket_small'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='2',
'numRows'='6',
'rawDataSize'='39',
'totalSize'='45',
'transient_lastDdlTime'='1451302349')
Time taken: 0.172 seconds, Fetched: 21 row(s)
And i inserted Data as following.
hive>insert overwrite table bucket_big select *from table_one;
hive>insert overwrite table bucket_small select *from table_two;
Now tables have buckets as i expected.And i set the following configurations.
hive>set hive.auto.convert.join=true;
hive>set hive.auto.convert.sortmerge.join=true;
hive>set hive.optimize.bucketmapjoin = true;
hive>set hive.optimize.bucketmapjoin.sortedmerge = true;
When i run the following query,
hive> explain select /*+ MAPJOIN(a) */ b.student_id,a.course_name FROM bucket_small a JOIN bucket_big b ON a.course_id = b.course_id;
OK
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
a
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
a
TableScan
alias: a
filterExpr: course_id is not null (type: boolean)
Statistics: Num rows: 6 Data size: 39 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: course_id is not null (type: boolean)
Statistics: Num rows: 3 Data size: 19 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
condition expressions:
0 {course_name}
1 {student_id}
keys:
0 course_id (type: int)
1 course_id (type: int)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: b
filterExpr: course_id is not null (type: boolean)
Statistics: Num rows: 10 Data size: 148 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: course_id is not null (type: boolean)
Statistics: Num rows: 5 Data size: 74 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
-----------------
condition map:
Inner Join 0 to 1
condition expressions:
0 {course_name}
1 {student_id}
keys:
0 course_id (type: int)
1 course_id (type: int)
outputColumnNames: _col1, _col6
Statistics: Num rows: 5 Data size: 81 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col6 (type: string), _col1 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 5 Data size: 81 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 5 Data size: 81 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.299 seconds, Fetched: 70 row(s)
hive>
It is generating only Map Only join not Bucket Map join or SMB Map join.What is the problem ???
... View more
- Tags:
- Data Processing
- Hive
Labels: