About prodgers125

bandalora · ‎12-01-2016

You might want to look at Workflow Designer too, which is Technical Preview in HDP 2.5. You can work with it in the sandbox (http://hortonworks.com/downloads/#sandbox) and get an idea of how you can create Oozie workflows with Pig, Hive, and Spark actions.

LesterMartin · ‎10-06-2016

RDD's saveAsTextFile does not give us the opportunity to do that (DataFrame's have "save modes" for things like append/overwrite/ignore). You'll have to control this prior before (maybe delete or rename existing data) or afterwards (write the RDD as a diff dir and then swap it out).

prodgers125 · ‎09-17-2016

gkeys, many thanks! This was a fantastic answer and cover all of my doubts! 😄 😄

gkeys · ‎09-04-2016

@João Souza This requirement is based around FILTER, which retrieves records that satisfy one or more conditions. There are two ways to do this. This first is using FILTER as below: X = FILTER Count by Field >10; Y = FILTER Count by Field <=10; The second way achieves the same result but using different grammar. SPLIT Count into X if Field >10, Y if Field <=10; Please note that the use of SUM requires a GROUP operation beforehand. In your case, you would have needed to GROUP data before you summed it as shown in your first line of code. It would have to look something like the following. data = LOAD ... as (amt:int, name:chararray); grouped_data = GROUP data by name; summed_data = FOREACH grouped_data GENERATE SUM(data.amt) amtSum, name; X = FILTER summed_data by amtSum >10; Y = FILTER summed_data by amtSum <=10; See: https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SUM http://www.thomashenson.com/sum-field-apache-pig/ (Let me know if this is what you are looking for by accepting the answer).

RyanCicak · ‎08-08-2016

Hi @João Souza Personally, I'd create a script by each individual table. This way I can focus on the one table (if something changes) rather than modifying a larger script that encompasses all the tables (which would of course be more coding - creating a steeper learning curve for another developer).

prodgers125 · ‎08-04-2016

I was missing some Jar files 🙂

prodgers125 · ‎08-03-2016

Perfect Lester 🙂 It's exactly what I need!!! 🙂 Many thanks!!!

phargis · ‎07-19-2016

Spark has a GraphX component library (soon to be upgraded to GraphFrames) which can be used to model graph type relationships. These relationships are modeled by combining a vertex table (vertices) with an edge table (edges). Read here for more info: http://spark.apache.org/docs/latest/graphx-programming-guide.html#example-property-graph

emilysharpe · ‎06-30-2016

I haven't tested it, but I believe using -tagFile will prepend the file name, which will place it at position 0 instead of 1. I.e. GENERATE (chararray)$0 AS Filename, (chararray)$1 AS ID, etc. Hope this solves it!

LesterMartin · ‎07-19-2016

I agree with the notes identified in the comments section between you and @Benjamin Leonhardi that the gotcha is the zip files. http://stackoverflow.com/questions/17200731/pig-tagsource-doesnt-work-with-multiple-files suggests that you can set pig.splitCombination to false to get over the hump that you may be running multiple files in a single mapper. That said, I did a simple test on the 2.4 Sandbox having three files (named file1.txt, file2.txt and file3.txt) with the following contents. a,b,c a,b,c a,b,c I then ran the following simple script (tried it with MR as well as Tez as the execution engine). a = LOAD '/user/maria_dev/multFiles' using PigStorage(' ','-tagFile'); DUMP a; And I got the following (expected) output. (file1.txt,a,b,c) (file1.txt,a,b,c) (file1.txt,a,b,c) (file2.txt,a,b,c) (file2.txt,a,b,c) (file2.txt,a,b,c) (file3.txt,a,b,c) (file3.txt,a,b,c) (file3.txt,a,b,c) Good luck!

Online	Offline
Last Visited	‎07-13-2016 11:56 AM

Member Since	‎04-27-2016 01:54 AM
Last Visited	‎07-13-2016 11:56 AM
Posts	60
Kudos received	20

Cloudera Community

Re: Oozie - Scheduling - Pig and Hive Scripts and ...

Re: Apache SPARK - Overwrite data file

Re: Hadoop versus (SQL Server or ODI)

Re: Apache PIG - If Statement based on a count val...

Re: Apache PIG - Script per Table to data cleansin...

Re: Big Data Analytics - Approach for Data Quality...

Re: Creating a iterativa loop using Apache PIG

Re: Social Network Analysis using Spark MLLIB

Re: Merge and Rename files in HDFS - Pig?

Re: Apache Pig - Load 80 files into another direco...