Member since
04-27-2016
60
Posts
20
Kudos Received
0
Solutions
12-17-2016
11:08 AM
1 Kudo
Hi experts, I'm doing a bacheloor degree under Informatic Engineer and in my Big Data lesson I will need to create a project using Spark. The Schumacher for my dataset ise from a set of Cinemas and movies: - Customer_ID (identifier of a customer); - Ticket _ID (identifier for a ticket to a movie); - Freq_Movies (number of movies that customer see) - Value_Movies (total money that Customer already used to buy tickets) - Cinema_ID (the cinema where customer see the movie) - Movie_ID (the movie that Customer see) - Ticket _Value (purchase value of the movie ticket individually) - Year - Quarter - Month - Day - Day_Week My teacher told us to create some approach to predict "something ". I was thinking in creating a Forecasting Time-Series using Spark but I am not getting any good approach to use it... Can you give some help? Many thanks!!!!
... View more
Labels:
11-30-2016
11:05 AM
Hi all, Is possible to create an workflow on Oozie that automatically execute some Hive, Pig and Spark scripts in order to automate my analytics process? Many thanks!
... View more
Labels:
10-29-2016
09:26 AM
Hi, Using Hadoop ecossystem, in your opinion is Impala power Querying and SolR power visualization the best approach to calculate and visualize some KPIs?
What other tools that you recommend?
Thanks!
... View more
Labels:
09-29-2016
06:38 PM
Hi experts,
How can I overwrite an existing file by a new one (data update). Imagine that I've this:
result.map(pair => pair.swap).sortByKey(true).saveAsTextFile("FILE/results")
And Imagine that I want to do this: test.map(pair => pair.swap).sortByKey(false).saveAsTextFile("FILE/results")
How can I overwrite the results of the var result to the results of the val test in same directory?
... View more
Labels:
09-17-2016
11:22 PM
gkeys, many thanks! This was a fantastic answer and cover all of my doubts! 😄 😄
... View more
09-17-2016
08:15 PM
2 Kudos
What is the bigger advantage of using Hadoop instead SQL Server or ODI when we aren't in a Big Data Scenario?
Many thanks!
... View more
Labels:
09-04-2016
02:53 PM
Hi experts,
I've this statment in Apache PIG:
...
Count = FOREACH data GENERATE SUM(Field);
...
How can do a IF Statement like this:
IF(SUM(Field) > 10) Store into X;
ELSE
STORE into Y;
Is possible to do this?
Many thanks!
... View more
Labels:
08-29-2016
01:28 PM
Alex Woolford many thanks for your help 🙂 In your case how can plan this project as a machine learning project? What I'm seeing is that the algorithms that I've been seen only count the occurrences.
Thanks!
... View more
08-29-2016
12:03 PM
1 Kudo
Hi experts, I've the following dataset (just a example):
Customer_ID
Product_Desc
1
Jeans
1
T-Shirt
1
Food
2
Jeans
2
Food
2
Nightdress
2
T-Shirt
2
Hat
3
Jeans
3
Food
4
Food
4
Water
5
Water
5
Food
5
Beer There exists any algorithm available that allows me to predictive Consumer Behavior like this:
"When a customer buy a Jeans it also buys Food together" The algorithms that I've found only calculate the most common products...not the association between them 😞 Anyone knows a good tutorial that shows me how can I predict the association that I said above? The first step is to conclude this relationships:
Jeans-T-Shirt-Food
Jeans-Food-Nightdress-T-Shirt-Hat
Jeans-Food
Food-Water
Water-Food-Beer Anyone have an Idea? Many thanks!!!
... View more
Labels:
08-25-2016
06:50 PM
Hi experts,
I've multiples files (parquet files) in a directory of the HDFS and I want to join all the files into one file using Apache PIG. I don't know how many files I will have into this directory so I can't declare a variable for each file. There is a way to identify all the files in the same directory and with the same schema?
Thanks!
Regards!!!
... View more
Labels:
08-08-2016
04:09 PM
Hi mqureshi, many thanks for your help 🙂 I will look for good articles/tutorials that show me how to use complex Types in Hive.
Thanks!
... View more
08-08-2016
03:37 PM
Hi,
I have four tables in .csv. All of them can be conected through a fact table (that are in .csv too). I wanna to do some data cleansing to this files and next put them into a Big Table in Have. But in Apache PIG should I've to create a script by table individually, or is better to join in PIG and then aplly some data cleansing in this normalized table?
Thanks!
... View more
Labels:
08-04-2016
10:46 AM
1 Kudo
I was missing some Jar files 🙂
... View more
08-03-2016
08:07 AM
If I use Python inside a file.py in my HDFS I can run Pytho UDFs but with Java I'm getting error... I think I'm not getting all the files
... View more
08-03-2016
08:06 AM
Perfect Lester 🙂 It's exactly what I need!!! 🙂 Many thanks!!!
... View more
08-01-2016
10:50 AM
Hi Benjamin, I follow that steps to include Java UDFs in Pig but it always gives me error... that's way I'm looking for alternatives.
... View more
07-31-2016
07:23 PM
I doing a small project in Hadoop which the main goal is create some KPI in Hive. However I needed to do some ETL jobs using Pig to clean my data and I put the transformed files into a new directory in HDFS. To ensure that all the files are in correct form, I want to create some data quality activities in Java or Python. I tried to to use PIG UDFs to achieve this but I couldn't connect the Jar file with Pig. Since I can't use PIG UDFs, I'm planning a new approach to do the data quality phase:
1) Run the PIG scripts to clean the data and extract the new files into a new directory in HDFS
2) Put Java/Python independentely read the new files and perform the data quality activities
3) If the Data Quality tests return sucessfully load the files into Hive
In your opinion this a good approach for a Big Data project? I'm new in this topic... If not, what a good alternative for perform data quality jobs in this project?
Many thanks for your help!
... View more
Labels:
07-24-2016
07:09 PM
Thanks Lester 🙂 So in your opinion is better do my transformation activities using Java? During my research I read that Apache pig is where we make the ETL process on Big Data projects. What type of jobs you recommends to do in Pig? Manu thanks! 🙂
... View more
07-23-2016
11:15 PM
Hi experts, I've the following code: SPLIT A INTO Src01 IF (Date=='2016-07-01'),
Src02 IF (Date=='2016-07-02'),
Src03 IF (Date=='2016-07-03'),
Src04 IF (Date=='2016-07-04'),
Src05 IF (Date=='2016-07-05'),
Src06 IF (Date=='2016-07-06'),
Src07 IF (Date=='2016-07-07'),
Src08 IF (Date=='2016-07-08'),
Src09 IF (Date=='2016-07-09'),
Src10 IF (Date=='2016-07-10'),
Src11 IF (Date=='2016-07-11'),
Src07 IF (Date=='2016-07-12'),
Src13 IF (Date=='2016-07-13'),
Src14 IF (Date=='2016-07-14'),
Src15 IF (Date=='2016-07-15'),
Src16 IF (Date=='2016-07-16'),
Src17 IF (Date=='2016-07-17'),
Src18 IF (Date=='2016-07-18'),
Src19 IF (Date=='2016-07-19'),
Src20 IF (Date=='2016-07-20'),
Src21 IF (Date=='2016-07-21'),
Src22 IF (Date=='2016-07-22'),
Src23 IF (Date=='2016-07-23'),
Src24 IF (Date=='2016-07-24'),
Src25 IF (Date=='2016-07-25'),
Src26 IF (Date=='2016-07-26'),
Src27 IF (Date=='2016-07-27'),
Src28 IF (Date=='2016-07-28'),
Src29 IF (Date=='2016-07-29'),
Src30 IF (Date=='2016-07-30'),
Src31 IF (Date=='2016-07-31'),
Src011 IF (Date=='2016-06-01');
STORE Src01 INTO '/path/2016-07-01' using PigStorage('\t');
STORE Src02 INTO '/path/2016-07-02' using PigStorage('\t');
STORE Src03 INTO '/path/2016-07-03' using PigStorage('\t');
STORE Src04 INTO '/path/2016-07-04' using PigStorage('\t');
STORE Src05 INTO '/path/2016-07-05' using PigStorage('\t');
STORE Src06 INTO '/path/2016-07-06' using PigStorage('\t');
STORE Src07 INTO '/path/2016-07-07' using PigStorage('\t');
STORE Src08 INTO '/path/2016-07-08' using PigStorage('\t');
STORE Src09 INTO '/path/2016-07-09' using PigStorage('\t');
STORE Src10 INTO '/path/2016-07-10' using PigStorage('\t');
STORE Src11 INTO '/path/2016-07-11' using PigStorage('\t');
STORE Src07 INTO '/path/2016-07-12' using PigStorage('\t');
STORE Src13 INTO '/path/2016-07-13' using PigStorage('\t');
STORE Src14 INTO '/path/2016-07-14' using PigStorage('\t');
STORE Src15 INTO '/path/2016-07-15' using PigStorage('\t');
STORE Src16 INTO '/path/2016-07-16' using PigStorage('\t');
STORE Src17 INTO '/path/2016-07-17' using PigStorage('\t');
STORE Src18 INTO '/path/2016-07-18' using PigStorage('\t');
STORE Src19 INTO '/path/2016-07-19' using PigStorage('\t');
STORE Src20 INTO '/path/2016-07-20' using PigStorage('\t');
STORE Src21 INTO '/path/2016-07-21' using PigStorage('\t');
STORE Src22 INTO '/path/2016-07-22' using PigStorage('\t');
STORE Src23 INTO '/path/2016-07-23' using PigStorage('\t');
STORE Src24 INTO '/path/2016-07-24' using PigStorage('\t');
STORE Src25 INTO '/path/2016-07-25' using PigStorage('\t');
STORE Src26 INTO '/path/2016-07-26' using PigStorage('\t');
STORE Src27 INTO '/path/2016-07-27' using PigStorage('\t');
STORE Src28 INTO '/path/2016-07-28' using PigStorage('\t');
STORE Src29 INTO '/path/2016-07-29' using PigStorage('\t');
STORE Src30 INTO '/path/2016-07-30' using PigStorage('\t');
STORE Src31 INTO '/path/2016-07-31' using PigStorage('\t');
STORE Src011 INTO '/path/2016-06-01' using PigStorage('\t'); There's a way that I can make this more automatically? Like using a loop or other iterative way? Many thanks!
... View more
Labels:
07-19-2016
02:52 PM
Hi,
Anyone already do (or read) any social network analysis using Spark Mllib with Python? I need to do a research with Spark to see the relationships between the people in a organization.
Many thanks!
... View more
Labels:
07-12-2016
06:35 AM
Hi experts, I've some tables in Hive and I want to run some clustering analysis using Spark MLlib in Python. Is possible to do it using the cloudera-quickstart-vm-5.7.0-0-virtualbox? There exists any tutorial that shows how I work with Spark MLlib? Many thanks!
... View more
06-29-2016
06:16 AM
Hi experts, I've this statement: --Insert a new column based on filename Data = LOAD '/user/cloudera/Source_Data' using PigStorage('\t','-tagFile'); Data_Schema = FOREACH Data GENERATE (chararray)$1 AS Date, (chararray)$2 AS ID, (chararray)$3 AS Interval, (chararray)$4 AS Code, (chararray)$5 AS S_In, (chararray)$6 AS S_Out, (chararray)$7 AS C_In, (chararray)$8 AS C_Out, (chararray)$9 AS Traffic; --Split into different directories SPLIT Data_Schema INTO Src1 IF (Date == '2016-06-25.txt'), Src2 IF (Date == '2014-07-31.txt'), Src3 IF (Date == '2016-01-01.txt'); STORE Src1 INTO '/user/cloudera/Source_DatA/ 201 6 - 06 -25 ' using PigStorage('\t'); STORE Src2 INTO '/user/cloudera/Source_Data/ 2014-07-31.txt ' using PigStorage('\t'); STORE Src2 INTO '/user/cloudera/Source_Data/ 2016-01-01 ' using PigStorage('\t'); And there is a example of my orignally source data: 10000 1388530800000 39 8.600870350350515 13.86183926855984 1.7218329193014124 3.424444103320796 25.972920214509095 But when I execute it runs successfully, however the files in HDFS are without data... Note that I add a new column based on filename. That's why I've one more column in Foreach Statment...
... View more
- Tags:
- Apache Pig
- HDFS
- Pig
06-28-2016
04:45 PM
Emily, only one more question. Mu current code is in attach. It execute succesfully however my final data sets it returns empty... Do you know why?pig-statement.txt
... View more
06-27-2016
01:32 PM
1 Kudo
Many thanks Emily. One problem I think: my column "date" isn't ideitified as date because it apperars like the filename "2016-06-23.txt". So I think it was created like a String. Can I do the Split in same way?
... View more
06-26-2016
04:29 PM
2 Kudos
Hi experts, I used Apache Pig to add a new column to my 3 text files inserted on HDFS. The three texts files was:
2016-06-25.txt 2016-06-24.txt 2016-06-23.txt However after I execute my Pig code I've in my HDFS 7 files (because the Map Reduce):
part-m-0000 part-m-0001 part-m-0002 part-m-0003 ... part-m-0006 How can I obtain only 3 files with it orignally name? Basically I want to add the new column but still have the same files with the same name... My code is:
Src = LOAD '/data/Src/' using PigStorage(' ','-tagFile'); STORE Src INTO '/data/Src/Src2' USING PigStorage(' ');
... View more
Labels:
06-17-2016
09:47 AM
Thanks Benjamin for your support 🙂
When you speak to Resource Manager, are you talking about Job Browser, to see the logs?
... View more
06-16-2016
03:17 PM
No, only one 😞
I put the files into 2 zipped files now. Don't know if I will get a better performance in doing this...
... View more
06-16-2016
10:01 AM
Hi experts, I'm trying to do some data transformations (simple) in my text files using Apache Pig. I've 80 text files in my HDFS and I want to add a new column based on filnename. I test the code for to only one text file and works fine. But when I put the code reading all the files it don't do the job (it stays 0% at long time). Here is my code:
A = LOAD '/user/data' using PigStorage(' ','-tagFile')
STORE A INTO '/user/data/Data_Transformation/SourceFiles' USING PigStorage(' ');
In your opinion, Pig are the best way to this? Thanks!!
... View more
Labels:
06-15-2016
12:46 PM
Hi Sindhu,
Yes, the directory is right. The files are inserted into:
user -> cloudera -> Analytics (folder created) -> source (folder created)
Do you think the script do what I want?
... View more