Member since
04-27-2016
60
Posts
20
Kudos Received
0
Solutions
08-01-2016
10:50 AM
Hi Benjamin, I follow that steps to include Java UDFs in Pig but it always gives me error... that's way I'm looking for alternatives.
... View more
07-31-2016
07:23 PM
I doing a small project in Hadoop which the main goal is create some KPI in Hive. However I needed to do some ETL jobs using Pig to clean my data and I put the transformed files into a new directory in HDFS. To ensure that all the files are in correct form, I want to create some data quality activities in Java or Python. I tried to to use PIG UDFs to achieve this but I couldn't connect the Jar file with Pig. Since I can't use PIG UDFs, I'm planning a new approach to do the data quality phase:
1) Run the PIG scripts to clean the data and extract the new files into a new directory in HDFS
2) Put Java/Python independentely read the new files and perform the data quality activities
3) If the Data Quality tests return sucessfully load the files into Hive
In your opinion this a good approach for a Big Data project? I'm new in this topic... If not, what a good alternative for perform data quality jobs in this project?
Many thanks for your help!
... View more
Labels:
- Labels:
-
Apache Pig
07-24-2016
07:09 PM
Thanks Lester 🙂 So in your opinion is better do my transformation activities using Java? During my research I read that Apache pig is where we make the ETL process on Big Data projects. What type of jobs you recommends to do in Pig? Manu thanks! 🙂
... View more
07-23-2016
11:15 PM
Hi experts, I've the following code: SPLIT A INTO Src01 IF (Date=='2016-07-01'),
Src02 IF (Date=='2016-07-02'),
Src03 IF (Date=='2016-07-03'),
Src04 IF (Date=='2016-07-04'),
Src05 IF (Date=='2016-07-05'),
Src06 IF (Date=='2016-07-06'),
Src07 IF (Date=='2016-07-07'),
Src08 IF (Date=='2016-07-08'),
Src09 IF (Date=='2016-07-09'),
Src10 IF (Date=='2016-07-10'),
Src11 IF (Date=='2016-07-11'),
Src07 IF (Date=='2016-07-12'),
Src13 IF (Date=='2016-07-13'),
Src14 IF (Date=='2016-07-14'),
Src15 IF (Date=='2016-07-15'),
Src16 IF (Date=='2016-07-16'),
Src17 IF (Date=='2016-07-17'),
Src18 IF (Date=='2016-07-18'),
Src19 IF (Date=='2016-07-19'),
Src20 IF (Date=='2016-07-20'),
Src21 IF (Date=='2016-07-21'),
Src22 IF (Date=='2016-07-22'),
Src23 IF (Date=='2016-07-23'),
Src24 IF (Date=='2016-07-24'),
Src25 IF (Date=='2016-07-25'),
Src26 IF (Date=='2016-07-26'),
Src27 IF (Date=='2016-07-27'),
Src28 IF (Date=='2016-07-28'),
Src29 IF (Date=='2016-07-29'),
Src30 IF (Date=='2016-07-30'),
Src31 IF (Date=='2016-07-31'),
Src011 IF (Date=='2016-06-01');
STORE Src01 INTO '/path/2016-07-01' using PigStorage('\t');
STORE Src02 INTO '/path/2016-07-02' using PigStorage('\t');
STORE Src03 INTO '/path/2016-07-03' using PigStorage('\t');
STORE Src04 INTO '/path/2016-07-04' using PigStorage('\t');
STORE Src05 INTO '/path/2016-07-05' using PigStorage('\t');
STORE Src06 INTO '/path/2016-07-06' using PigStorage('\t');
STORE Src07 INTO '/path/2016-07-07' using PigStorage('\t');
STORE Src08 INTO '/path/2016-07-08' using PigStorage('\t');
STORE Src09 INTO '/path/2016-07-09' using PigStorage('\t');
STORE Src10 INTO '/path/2016-07-10' using PigStorage('\t');
STORE Src11 INTO '/path/2016-07-11' using PigStorage('\t');
STORE Src07 INTO '/path/2016-07-12' using PigStorage('\t');
STORE Src13 INTO '/path/2016-07-13' using PigStorage('\t');
STORE Src14 INTO '/path/2016-07-14' using PigStorage('\t');
STORE Src15 INTO '/path/2016-07-15' using PigStorage('\t');
STORE Src16 INTO '/path/2016-07-16' using PigStorage('\t');
STORE Src17 INTO '/path/2016-07-17' using PigStorage('\t');
STORE Src18 INTO '/path/2016-07-18' using PigStorage('\t');
STORE Src19 INTO '/path/2016-07-19' using PigStorage('\t');
STORE Src20 INTO '/path/2016-07-20' using PigStorage('\t');
STORE Src21 INTO '/path/2016-07-21' using PigStorage('\t');
STORE Src22 INTO '/path/2016-07-22' using PigStorage('\t');
STORE Src23 INTO '/path/2016-07-23' using PigStorage('\t');
STORE Src24 INTO '/path/2016-07-24' using PigStorage('\t');
STORE Src25 INTO '/path/2016-07-25' using PigStorage('\t');
STORE Src26 INTO '/path/2016-07-26' using PigStorage('\t');
STORE Src27 INTO '/path/2016-07-27' using PigStorage('\t');
STORE Src28 INTO '/path/2016-07-28' using PigStorage('\t');
STORE Src29 INTO '/path/2016-07-29' using PigStorage('\t');
STORE Src30 INTO '/path/2016-07-30' using PigStorage('\t');
STORE Src31 INTO '/path/2016-07-31' using PigStorage('\t');
STORE Src011 INTO '/path/2016-06-01' using PigStorage('\t'); There's a way that I can make this more automatically? Like using a loop or other iterative way? Many thanks!
... View more
Labels:
- Labels:
-
Apache Pig
07-19-2016
02:52 PM
Hi,
Anyone already do (or read) any social network analysis using Spark Mllib with Python? I need to do a research with Spark to see the relationships between the people in a organization.
Many thanks!
... View more
Labels:
- Labels:
-
Apache Spark
06-28-2016
04:45 PM
Emily, only one more question. Mu current code is in attach. It execute succesfully however my final data sets it returns empty... Do you know why?pig-statement.txt
... View more
06-27-2016
01:32 PM
1 Kudo
Many thanks Emily. One problem I think: my column "date" isn't ideitified as date because it apperars like the filename "2016-06-23.txt". So I think it was created like a String. Can I do the Split in same way?
... View more
06-26-2016
04:29 PM
2 Kudos
Hi experts, I used Apache Pig to add a new column to my 3 text files inserted on HDFS. The three texts files was:
2016-06-25.txt 2016-06-24.txt 2016-06-23.txt However after I execute my Pig code I've in my HDFS 7 files (because the Map Reduce):
part-m-0000 part-m-0001 part-m-0002 part-m-0003 ... part-m-0006 How can I obtain only 3 files with it orignally name? Basically I want to add the new column but still have the same files with the same name... My code is:
Src = LOAD '/data/Src/' using PigStorage(' ','-tagFile'); STORE Src INTO '/data/Src/Src2' USING PigStorage(' ');
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Pig
06-17-2016
09:47 AM
Thanks Benjamin for your support 🙂
When you speak to Resource Manager, are you talking about Job Browser, to see the logs?
... View more
06-16-2016
03:17 PM
No, only one 😞
I put the files into 2 zipped files now. Don't know if I will get a better performance in doing this...
... View more