Member since
04-27-2016
60
Posts
20
Kudos Received
0
Solutions
12-01-2016
08:54 PM
You might want to look at Workflow Designer too, which is Technical Preview in HDP 2.5. You can work with it in the sandbox (http://hortonworks.com/downloads/#sandbox) and get an idea of how you can create Oozie workflows with Pig, Hive, and Spark actions.
... View more
10-06-2016
04:46 PM
RDD's saveAsTextFile does not give us the opportunity to do that (DataFrame's have "save modes" for things like append/overwrite/ignore). You'll have to control this prior before (maybe delete or rename existing data) or afterwards (write the RDD as a diff dir and then swap it out).
... View more
09-17-2016
11:22 PM
gkeys, many thanks! This was a fantastic answer and cover all of my doubts! 😄 😄
... View more
09-04-2016
07:24 PM
@João Souza This requirement is based around FILTER, which retrieves records that satisfy one or more conditions. There are two ways to do this. This first is using FILTER as below: X = FILTER Count by Field >10;
Y = FILTER Count by Field <=10; The second way achieves the same result but using different grammar. SPLIT Count into X if Field >10, Y if Field <=10; Please note that the use of SUM requires a GROUP operation beforehand. In your case, you would have needed to GROUP data before you summed it as shown in your first line of code. It would have to look something like the following. data = LOAD ... as (amt:int, name:chararray);
grouped_data = GROUP data by name;
summed_data = FOREACH grouped_data GENERATE SUM(data.amt) amtSum, name;
X = FILTER summed_data by amtSum >10;
Y = FILTER summed_data by amtSum <=10;
See:
https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SUM http://www.thomashenson.com/sum-field-apache-pig/ (Let me know if this is what you are looking for by accepting the answer).
... View more
08-08-2016
07:20 PM
Hi @João Souza Personally, I'd create a script by each individual table. This way I can focus on the one table (if something changes) rather than modifying a larger script that encompasses all the tables (which would of course be more coding - creating a steeper learning curve for another developer).
... View more
08-04-2016
10:46 AM
1 Kudo
I was missing some Jar files 🙂
... View more
08-03-2016
08:06 AM
Perfect Lester 🙂 It's exactly what I need!!! 🙂 Many thanks!!!
... View more
07-19-2016
04:08 PM
1 Kudo
Spark has a GraphX component library (soon to be upgraded to GraphFrames) which can be used to model graph type relationships. These relationships are modeled by combining a vertex table (vertices) with an edge table (edges). Read here for more info: http://spark.apache.org/docs/latest/graphx-programming-guide.html#example-property-graph
... View more
06-30-2016
12:21 AM
I haven't tested it, but I believe using -tagFile will prepend the file name, which will place it at position 0 instead of 1. I.e. GENERATE
(chararray)$0 AS Filename,
(chararray)$1 AS ID, etc. Hope this solves it!
... View more
07-19-2016
03:20 PM
I agree with the notes identified in the comments section between you and @Benjamin Leonhardi that the gotcha is the zip files. http://stackoverflow.com/questions/17200731/pig-tagsource-doesnt-work-with-multiple-files suggests that you can set pig.splitCombination to false to get over the hump that you may be running multiple files in a single mapper. That said, I did a simple test on the 2.4 Sandbox having three files (named file1.txt, file2.txt and file3.txt) with the following contents. a,b,c
a,b,c
a,b,c I then ran the following simple script (tried it with MR as well as Tez as the execution engine). a = LOAD '/user/maria_dev/multFiles' using PigStorage(' ','-tagFile');
DUMP a; And I got the following (expected) output. (file1.txt,a,b,c)
(file1.txt,a,b,c)
(file1.txt,a,b,c)
(file2.txt,a,b,c)
(file2.txt,a,b,c)
(file2.txt,a,b,c)
(file3.txt,a,b,c)
(file3.txt,a,b,c)
(file3.txt,a,b,c) Good luck!
... View more