About prodgers125

prodgers125 · ‎08-01-2016

Hi Benjamin, I follow that steps to include Java UDFs in Pig but it always gives me error... that's way I'm looking for alternatives.

prodgers125 · ‎07-31-2016

I doing a small project in Hadoop which the main goal is create some KPI in Hive. However I needed to do some ETL jobs using Pig to clean my data and I put the transformed files into a new directory in HDFS. To ensure that all the files are in correct form, I want to create some data quality activities in Java or Python. I tried to to use PIG UDFs to achieve this but I couldn't connect the Jar file with Pig. Since I can't use PIG UDFs, I'm planning a new approach to do the data quality phase: 1) Run the PIG scripts to clean the data and extract the new files into a new directory in HDFS 2) Put Java/Python independentely read the new files and perform the data quality activities 3) If the Data Quality tests return sucessfully load the files into Hive In your opinion this a good approach for a Big Data project? I'm new in this topic... If not, what a good alternative for perform data quality jobs in this project? Many thanks for your help!

prodgers125 · ‎07-24-2016

Thanks Lester 🙂 So in your opinion is better do my transformation activities using Java? During my research I read that Apache pig is where we make the ETL process on Big Data projects. What type of jobs you recommends to do in Pig? Manu thanks! 🙂

prodgers125 · ‎07-23-2016

Hi experts, I've the following code: SPLIT A INTO Src01 IF (Date=='2016-07-01'), Src02 IF (Date=='2016-07-02'), Src03 IF (Date=='2016-07-03'), Src04 IF (Date=='2016-07-04'), Src05 IF (Date=='2016-07-05'), Src06 IF (Date=='2016-07-06'), Src07 IF (Date=='2016-07-07'), Src08 IF (Date=='2016-07-08'), Src09 IF (Date=='2016-07-09'), Src10 IF (Date=='2016-07-10'), Src11 IF (Date=='2016-07-11'), Src07 IF (Date=='2016-07-12'), Src13 IF (Date=='2016-07-13'), Src14 IF (Date=='2016-07-14'), Src15 IF (Date=='2016-07-15'), Src16 IF (Date=='2016-07-16'), Src17 IF (Date=='2016-07-17'), Src18 IF (Date=='2016-07-18'), Src19 IF (Date=='2016-07-19'), Src20 IF (Date=='2016-07-20'), Src21 IF (Date=='2016-07-21'), Src22 IF (Date=='2016-07-22'), Src23 IF (Date=='2016-07-23'), Src24 IF (Date=='2016-07-24'), Src25 IF (Date=='2016-07-25'), Src26 IF (Date=='2016-07-26'), Src27 IF (Date=='2016-07-27'), Src28 IF (Date=='2016-07-28'), Src29 IF (Date=='2016-07-29'), Src30 IF (Date=='2016-07-30'), Src31 IF (Date=='2016-07-31'), Src011 IF (Date=='2016-06-01'); STORE Src01 INTO '/path/2016-07-01' using PigStorage('\t'); STORE Src02 INTO '/path/2016-07-02' using PigStorage('\t'); STORE Src03 INTO '/path/2016-07-03' using PigStorage('\t'); STORE Src04 INTO '/path/2016-07-04' using PigStorage('\t'); STORE Src05 INTO '/path/2016-07-05' using PigStorage('\t'); STORE Src06 INTO '/path/2016-07-06' using PigStorage('\t'); STORE Src07 INTO '/path/2016-07-07' using PigStorage('\t'); STORE Src08 INTO '/path/2016-07-08' using PigStorage('\t'); STORE Src09 INTO '/path/2016-07-09' using PigStorage('\t'); STORE Src10 INTO '/path/2016-07-10' using PigStorage('\t'); STORE Src11 INTO '/path/2016-07-11' using PigStorage('\t'); STORE Src07 INTO '/path/2016-07-12' using PigStorage('\t'); STORE Src13 INTO '/path/2016-07-13' using PigStorage('\t'); STORE Src14 INTO '/path/2016-07-14' using PigStorage('\t'); STORE Src15 INTO '/path/2016-07-15' using PigStorage('\t'); STORE Src16 INTO '/path/2016-07-16' using PigStorage('\t'); STORE Src17 INTO '/path/2016-07-17' using PigStorage('\t'); STORE Src18 INTO '/path/2016-07-18' using PigStorage('\t'); STORE Src19 INTO '/path/2016-07-19' using PigStorage('\t'); STORE Src20 INTO '/path/2016-07-20' using PigStorage('\t'); STORE Src21 INTO '/path/2016-07-21' using PigStorage('\t'); STORE Src22 INTO '/path/2016-07-22' using PigStorage('\t'); STORE Src23 INTO '/path/2016-07-23' using PigStorage('\t'); STORE Src24 INTO '/path/2016-07-24' using PigStorage('\t'); STORE Src25 INTO '/path/2016-07-25' using PigStorage('\t'); STORE Src26 INTO '/path/2016-07-26' using PigStorage('\t'); STORE Src27 INTO '/path/2016-07-27' using PigStorage('\t'); STORE Src28 INTO '/path/2016-07-28' using PigStorage('\t'); STORE Src29 INTO '/path/2016-07-29' using PigStorage('\t'); STORE Src30 INTO '/path/2016-07-30' using PigStorage('\t'); STORE Src31 INTO '/path/2016-07-31' using PigStorage('\t'); STORE Src011 INTO '/path/2016-06-01' using PigStorage('\t'); There's a way that I can make this more automatically? Like using a loop or other iterative way? Many thanks!

prodgers125 · ‎07-19-2016

Hi, Anyone already do (or read) any social network analysis using Spark Mllib with Python? I need to do a research with Spark to see the relationships between the people in a organization. Many thanks!

prodgers125 · ‎06-28-2016

Emily, only one more question. Mu current code is in attach. It execute succesfully however my final data sets it returns empty... Do you know why?pig-statement.txt

prodgers125 · ‎06-27-2016

Many thanks Emily. One problem I think: my column "date" isn't ideitified as date because it apperars like the filename "2016-06-23.txt". So I think it was created like a String. Can I do the Split in same way?

prodgers125 · ‎06-26-2016

Hi experts, I used Apache Pig to add a new column to my 3 text files inserted on HDFS. The three texts files was: 2016-06-25.txt 2016-06-24.txt 2016-06-23.txt However after I execute my Pig code I've in my HDFS 7 files (because the Map Reduce): part-m-0000 part-m-0001 part-m-0002 part-m-0003 ... part-m-0006 How can I obtain only 3 files with it orignally name? Basically I want to add the new column but still have the same files with the same name... My code is: Src = LOAD '/data/Src/' using PigStorage(' ','-tagFile'); STORE Src INTO '/data/Src/Src2' USING PigStorage(' ');

prodgers125 · ‎06-17-2016

Thanks Benjamin for your support 🙂 When you speak to Resource Manager, are you talking about Job Browser, to see the logs?

prodgers125 · ‎06-16-2016

No, only one 😞 I put the files into 2 zipped files now. Don't know if I will get a better performance in doing this...

Online	Offline
Last Visited	‎07-13-2016 11:56 AM

Member Since	‎04-27-2016 01:54 AM
Last Visited	‎07-13-2016 11:56 AM
Posts	60
Kudos received	20

Cloudera Community

Re: Big Data Analytics - Approach for Data Quality...

Big Data Analytics - Approach for Data Quality pha...

Re: Creating a iterativa loop using Apache PIG

Creating a iterativa loop using Apache PIG

Social Network Analysis using Spark MLLIB

Re: Merge and Rename files in HDFS - Pig?

Re: Merge and Rename files in HDFS - Pig?

Merge and Rename files in HDFS - Pig?

Re: Apache Pig - Load 80 files into another direco...

Re: Apache Pig - Load 80 files into another direco...