Member since
06-18-2016
52
Posts
14
Kudos Received
0
Solutions
05-15-2018
11:12 AM
Hi experts, There exists any way to make a query to sys.tables like we do on T-SQL, like: SELECT *FROM sys.tables
Is this possible in Impala or Hive? Thanks!
... View more
Labels:
12-19-2016
01:07 AM
2 Kudos
Hi, Do you know any good tutorial/use case using Hadoop that shows a good approach to clean our data (specially the outliers detection phase)? Thanks!
... View more
Labels:
10-10-2016
09:34 PM
But when you are putting a file into HDFS, you're using a MapReduce job (even if you don't see)?
... View more
10-10-2016
09:27 PM
Hi experts,
I have some basic questions about the relationship between MapReduce and HDFS: The Data File placing on HDFS is through MapReduce? All transactions in HDFS are using MapReduce jobs? Anyone knows the answer?
Many thanks!
... View more
Labels:
09-29-2016
04:34 PM
@jfrazee but I can define the cut-off the values that are higher than the average, right?
... View more
09-29-2016
09:37 AM
@jfrazee The normal is to reduce / remove products with little occurrence, right? It is reasonable to think about eliminating the products that appear only in 20% of all transactions?
... View more
09-28-2016
10:59 PM
Hi jfrazee,
Many thanks for your response 🙂 I've some questions about this:
1) The structure of my data (Each line corresponds to a set of products_id) is correct to this algorithm? 2) The ".filter(_._2 > 2)" filter the products that have occurrence smaller than 2? 3) When I submit "val freqItemsets = transactions.map(_.split(",")).flatMap(xs => (xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).filter(_.nonEmpty).filter(_._2 > 2).map(x => (x.toList, 1L)) ).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}" I'm getting the following error: <console>:31: error: value _2 is not a member of Array[String] .Do you know how to solve it?
Many thanks for your help and explaination about the association rules algorithm 🙂 And sorry for this questions.
... View more
09-28-2016
11:53 AM
Hi experts,
I have attached to this post dataset sample.txt And I am trying to extract some association rules using Spark Mllib:
val transactions = sc.textFile("DATA") import org.apache.spark.mllib.fpm.AssociationRules import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset val freqItemsets = transactions.map(_.split(",")).flatMap(xs => (xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).filter(_.nonEmpty).map(x => (x.toList, 1L)) ).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)} val ar = new AssociationRules().setMinConfidence(0.8) val results = ar.run(freqItemsets)results.collect().foreach { rule => println("[" + rule.antecedent.mkString(",") + "=>" + rule.consequent.mkString(",") + "]," + rule.confidence)} However, my code returns a dozen rules with confidence equal to 1 ... which makes little sense! Does anyone know if I lack some parameterization?
... View more
Labels:
09-25-2016
02:54 PM
4 Kudos
Hi experts,
I've this line from a .txt which results from a Group Operator:
1;(7287026502032012,18);{(706)};{(101200010)};{(17286)};{(oz)};2.5
Basically I've 7 fields how can I obtain this:
1;7287026502032012,18;706;101200010;17286;oz;2.5
Many thanks!
... View more
Labels:
09-09-2016
06:18 PM
3 Kudos
Having this statement:
Values = FILTER Input_Data BY Fields > 0
How to cont the number of records that was filtered and not?
Many thanks!
... View more
- Tags:
- Data Processing
- Pig
Labels:
09-06-2016
01:03 PM
Hi,
Everytime that I run my Pig Script it generates a multiple files in HDFS (I never know the number). I need to do some anlytics using Spark.
How can I join that multiple files to have only one file like:
val data = sc.textFile("PATH/Filejoined");
Thanks!
... View more
Labels:
09-06-2016
08:49 AM
Hi, I've this data in a textfile: 1 4 2 5 2 2 1 5 How can I using Spark and programming Scala can identify the rows that have the number repetead in same row? And how can I delete it? In this case I want to remove the third row... Mnay thanks!
... View more
Labels:
09-05-2016
02:09 PM
I'm trying to return this:
val output = vertices.map(_.split(" ")).toArray
... View more
09-05-2016
01:57 PM
I'm trying to save my Array in HDFS. For that I've this:
array.saveAsTextFile("PATH")
but when I submit this I'm getting this error:
error: value saveAsTextFile is not a member of Array[Array[String]]
Anyone knows how to solve this?
Many thanks!
... View more
Labels:
09-03-2016
02:41 AM
Hi experts, Afther some scala programming, I'm getting this output: [40146844020121125,WrappedArray(1726)] [40148356620121118,WrappedArray(7205)] [40148813920120703,WrappedArray(3504, 1703)] [40148991920121112,WrappedArray(5616)] [40150340320130324,WrappedArray(9909)] [40150796920120926,WrappedArray(3509)] [40151143320130423,WrappedArray(9909)] [40153957220120426,WrappedArray(9909)] [40154761720120504,WrappedArray(9909)] [40154969620130124,WrappedArray(9909, 9909)] But I want to extract this: 40146844020121125,1726 40148356620121118,7205 40148813920120703,3504, 1703 40148991920121112,5616 40150340320130324,9909 40150796920120926,3509 40151143320130423,9909 40153957220120426,9909 40154761720120504,9909 40154969620130124,9909,9909 I'm trying to analyze the frequent products purchase together and my Scala code is: val data = sc.textFile("FILE"); case class Transactions(Transaction_ID:String,Dept:String,Category:String,Company:String,Brand:String,Product_Size:String,Product_Measure:String,Purchase_Quantity:String,Purchase_Amount:String); def csvToMyClass(line: String) = { val split = line.split(','); Transactions(split(0),split(1),split(2),split(3),split(4),split(5),split(6),split(7),split(8)) } val df = data.map(csvToMyClass).toDF("Transaction_ID","Dept","Category","Company","Brand","Product_Size","Product_Measure","Purchase_Quantity","Purchase_Amount"); df.show; val df2 = df.groupBy("Transaction_ID").agg(collect_list($"Category")) df.groupBy("Transaction_ID").agg(collect_list($"Category")).show How can map the DataFrame to a normal list? Many thanks!!!
... View more
08-31-2016
09:40 AM
Like this:
{Stock_ID} -> {Sales_ID}:
{1}->A
{2}->A
{3}->C,D
{4}->A
... View more
08-30-2016
08:28 AM
Hi experts, Imagine that I've this example stored on HDFS in a .CSV file: Stock_ID Sales_ID
1 A
2 A
3 C
3 D
4 A I want to map the row using CombineByKey to list the elements and after that I wanto to reduce it to I can get the RDD that I expect. I only have this line:
val textFile = sc.textFile("/input/transactions.csv") How can map and reduce it using Scala in Spark? Many thanks!!!
... View more
Labels:
08-25-2016
12:26 PM
Hi experts,I have a .csv file stored in HDFS and I need to do 3 steps:a) Create a parquet file format b) Load the data from .csv to the Parquet Filec) Store Parquet file in a new HDFS directoryThe first step I had completed using Apache Hive:
create external table parquet_file (ID BIGINT, Date TimeStamp, Size Int)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
LOCATION '.../filedirectory';
How can I complete tasks b) and c)??? Many thanks!
... View more
Labels:
08-22-2016
10:03 AM
Hi,
I need to create some graphs using PySpark to elaborate some link analysis research. I already see this link:
http://kukuruku.co/hub/algorithms/social-network-analysis-spark-graphx
But this algorithm is implemented in Scala which is very more complex to understand.
Anyone have an idea on a white paper or some tutorial that do some link analysis research using PySpark?
Thanks!
... View more
08-12-2016
08:28 AM
1 Kudo
Hi guys,
I'm doing a Big Data Analytics project and I am in the phase of detection of outliers. I only have experience using SAS, and in there I ussually use Histogram charts to detect the outliers. In Hadoop which component is used to identify this records? Normally you use Pig or Hive? Or just use any tool outside Hadoop like Python or Java?
Many thanks!
... View more
Labels:
08-10-2016
02:13 PM
Hi guys,
I'm very new in using Apache PIG, and I already see a lot of Scripts using Group stament without any operator (Like Sum(X), A Group by A). Why is a good alternative to use group statement?
Thanks!
... View more
Labels:
08-09-2016
01:02 PM
Hi,
I've the follwoing code:
from datetime import datetime
loc_offers = "offers.csv"
loc_transactions = "transactions.csv"
loc_reduced = "reduced2.csv"
# will be createddef reduce_data(loc_offers, loc_transactions, loc_reduced):
start = datetime.now()
#get all categories on offer in a dictoffers = {}
for e, line in enumerate( open(loc_offers) ):
offers[ line.split(",")[1] ] = 1#open output filewith open(loc_reduced, "wb") as outfile:
#go through transactions file and reducereduced = 0for e, line in enumerate( open(loc_transactions) ):
if e == 0:
outfile.write( line ) #print headerelse:
#only write when category in offers dictif line.split(",")[3] in offers:
outfile.write( line )
reduced += 1#progressif e % 5000000 == 0:
print e, reduced, datetime.now() - start
print e, reduced, datetime.now() - start
reduce_data(loc_offers, loc_transactions, loc_reduced) I've the two .csv files in my HDFS. How can I submit this Python code over the two files in HDFS?
Thanks!
... View more
08-08-2016
04:30 PM
Hi experts,
Probably is a dummy question (but since I have 🙂 ).
I want to know how Pig read the headers from the following dataset that is stored in .csv:
ID,Name,Function
1,Johnny,Student
2,Peter,Engineer
3,Cloud,Teacher
4,Angel,Consultant
I want to have the first row as a Header of my file. There I need to put:
A = LOAD 'file' using PIGStorage(',') as (ID:Int,....etc) ?
Or I only need to put:
A = LOAD 'file' using PIGStorage(',') And only with this pache PIG already know that the first line are the headers of my table. Thanks!
... View more
Labels:
08-08-2016
02:02 PM
Hi, I'm doing a small Big Data project using Hadoop in cloudera-quickstart-vm-5.7.0-0-virtualbox. I've a file in HDFS that have 22GB of size. When I try to do some job in Pig, like: A = LOAD "/user/cloudera/file.csv"; DUMP A; It stays with staus:Running like a long of time. There exists any configuration that I need to do to proces all this data? Thanks
... View more
- Tags:
- big-data
- performance
08-01-2016
08:10 AM
Hi experts, I'm doing a small project in Hadoop using cloudera-quickstart-vm-5.7.0-0 at virtualbox. I'm trying to Java in Apache PIG, basically use pig java UDF 's. I already do: In eclipse: 1) Create the Project 2) Convert to Marven project 3) Add the depenencies -> pig : 0.15.0 and hadoop-core : 0.20.2 4) Generate the Jar file in the following direcotry /home/cloudera/workspace Now I want to apply my UDF in my PIG Script: REGISTER '/home/cloudera/workspace/UDFs.jar'; emp_data = LOAD '/user/cloudera/teste.txt' USING PigStorage(' ') as (name:chararray, idade:chararray, func:chararray); Upper_case = FOREACH emp_data GENERATE myUDFS.isNumeric(name); DUMP Upper_case; I also try to put the Jar file in HDFS and add at the properties in Pig Editor but when I submit the Pig it gives me error... Anyone knows If I'm failling any step? Many thanks!
... View more
07-29-2016
10:29 AM
Hi experts,
I'm using Apache PIG to make some data transformation, but I need Java Operations to do some complex cleansing activities. I already do the methods in JAVA and already put the necessary code in Pig to register the Java Code. However I don't know that type JARS I need to upload to Eclipse to make the connection between PIG and Eclipse.
There exists any "dummie" tutorial to make this interaction?
Thnaks!
... View more
Labels:
07-26-2016
03:43 PM
Sunile Manjee many thanks! One more question: is possible to create a variable and use to IF statement. Example:
A = Foreach X Generate A1,A2,A3;
--Create a variable
var = Concat(A1,A2);
Split A into B IF (var == "teste");
Is possible to do this?
... View more
07-26-2016
01:38 PM
Hi experts, I've the following field : ToString( ToDate((long) Time_Interval), 'yyyy-MM-dd hh:ss:mm') as Time How can I obtain only the time (hh:ss:mm)? I already try:
ToString( ToDate(Time), 'HH:mm:ss.SSS')
... View more
Labels:
07-20-2016
03:55 PM
src-data.txt I've the following code: Data = LOAD '.../teste1.txt' using PigStorage(''); Fields = FOREACH Data GENERATE
(chararray)$0 AS ID,
(chararray)$1 AS Time,
(chararray)$2 AS Code,
(chararray)$3 AS B_In_Activity,
(chararray)$4 AS B_Out_Activity,
(chararray)$5 AS In_Activity,
(chararray)$6 AS Out_Activity,
(chararray)$7 AS Activity); Transf = FOREACH Data_Fields GENERATE ID,
ToUnixTime(Time,'dd/MM/yyyyHH:mm:ss', 'GMT') AS Time,
Code,
B_Activity,
B_Activity,
In_Activity,
Out_Activity,
Activity; SPLIT Transf INTO Src31 IF ToDate(Time) == ToDate('2014-12-31', 'yyyy-mm-dd'); STORE Src31 INTO '.../TESTE2' using PigStorage(''); I want the following activities:
1) Transform the field Time into a Unix TimeStamo
2) Split the datasets based on date When execute my code it gives me an error... I've upload my source data to show my data. Anyone can help me? Many thanks!!!!
... View more
Labels: