Member since
06-18-2016
52
Posts
14
Kudos Received
0
Solutions
05-15-2018
11:12 AM
Hi experts, There exists any way to make a query to sys.tables like we do on T-SQL, like: SELECT *FROM sys.tables
Is this possible in Impala or Hive? Thanks!
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Impala
10-10-2016
09:34 PM
But when you are putting a file into HDFS, you're using a MapReduce job (even if you don't see)?
... View more
10-10-2016
09:27 PM
Hi experts,
I have some basic questions about the relationship between MapReduce and HDFS: The Data File placing on HDFS is through MapReduce? All transactions in HDFS are using MapReduce jobs? Anyone knows the answer?
Many thanks!
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Cloudera DataFlow (CDF)
09-29-2016
04:34 PM
@jfrazee but I can define the cut-off the values that are higher than the average, right?
... View more
09-29-2016
09:37 AM
@jfrazee The normal is to reduce / remove products with little occurrence, right? It is reasonable to think about eliminating the products that appear only in 20% of all transactions?
... View more
09-28-2016
10:59 PM
Hi jfrazee,
Many thanks for your response 🙂 I've some questions about this:
1) The structure of my data (Each line corresponds to a set of products_id) is correct to this algorithm? 2) The ".filter(_._2 > 2)" filter the products that have occurrence smaller than 2? 3) When I submit "val freqItemsets = transactions.map(_.split(",")).flatMap(xs => (xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).filter(_.nonEmpty).filter(_._2 > 2).map(x => (x.toList, 1L)) ).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}" I'm getting the following error: <console>:31: error: value _2 is not a member of Array[String] .Do you know how to solve it?
Many thanks for your help and explaination about the association rules algorithm 🙂 And sorry for this questions.
... View more
09-28-2016
11:53 AM
Hi experts,
I have attached to this post dataset sample.txt And I am trying to extract some association rules using Spark Mllib:
val transactions = sc.textFile("DATA") import org.apache.spark.mllib.fpm.AssociationRules import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset val freqItemsets = transactions.map(_.split(",")).flatMap(xs => (xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).filter(_.nonEmpty).map(x => (x.toList, 1L)) ).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)} val ar = new AssociationRules().setMinConfidence(0.8) val results = ar.run(freqItemsets)results.collect().foreach { rule => println("[" + rule.antecedent.mkString(",") + "=>" + rule.consequent.mkString(",") + "]," + rule.confidence)} However, my code returns a dozen rules with confidence equal to 1 ... which makes little sense! Does anyone know if I lack some parameterization?
... View more
Labels:
- Labels:
-
Apache Spark
09-25-2016
02:54 PM
4 Kudos
Hi experts,
I've this line from a .txt which results from a Group Operator:
1;(7287026502032012,18);{(706)};{(101200010)};{(17286)};{(oz)};2.5
Basically I've 7 fields how can I obtain this:
1;7287026502032012,18;706;101200010;17286;oz;2.5
Many thanks!
... View more
Labels:
- Labels:
-
Apache Pig
09-09-2016
06:18 PM
3 Kudos
Having this statement:
Values = FILTER Input_Data BY Fields > 0
How to cont the number of records that was filtered and not?
Many thanks!
... View more
- Tags:
- Data Processing
- Pig
Labels:
- Labels:
-
Apache Pig
09-06-2016
01:03 PM
Hi,
Everytime that I run my Pig Script it generates a multiple files in HDFS (I never know the number). I need to do some anlytics using Spark.
How can I join that multiple files to have only one file like:
val data = sc.textFile("PATH/Filejoined");
Thanks!
... View more
Labels:
- Labels:
-
Apache Spark
09-06-2016
08:49 AM
Hi, I've this data in a textfile: 1 4 2 5 2 2 1 5 How can I using Spark and programming Scala can identify the rows that have the number repetead in same row? And how can I delete it? In this case I want to remove the third row... Mnay thanks!
... View more
Labels:
- Labels:
-
Apache Spark
09-05-2016
02:09 PM
I'm trying to return this:
val output = vertices.map(_.split(" ")).toArray
... View more
09-05-2016
01:57 PM
I'm trying to save my Array in HDFS. For that I've this:
array.saveAsTextFile("PATH")
but when I submit this I'm getting this error:
error: value saveAsTextFile is not a member of Array[Array[String]]
Anyone knows how to solve this?
Many thanks!
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
08-25-2016
12:26 PM
Hi experts,I have a .csv file stored in HDFS and I need to do 3 steps:a) Create a parquet file format b) Load the data from .csv to the Parquet Filec) Store Parquet file in a new HDFS directoryThe first step I had completed using Apache Hive:
create external table parquet_file (ID BIGINT, Date TimeStamp, Size Int)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
LOCATION '.../filedirectory';
How can I complete tasks b) and c)??? Many thanks!
... View more
Labels:
- Labels:
-
Apache Hive
08-22-2016
10:03 AM
Hi,
I need to create some graphs using PySpark to elaborate some link analysis research. I already see this link:
http://kukuruku.co/hub/algorithms/social-network-analysis-spark-graphx
But this algorithm is implemented in Scala which is very more complex to understand.
Anyone have an idea on a white paper or some tutorial that do some link analysis research using PySpark?
Thanks!
... View more
08-10-2016
02:13 PM
Hi guys,
I'm very new in using Apache PIG, and I already see a lot of Scripts using Group stament without any operator (Like Sum(X), A Group by A). Why is a good alternative to use group statement?
Thanks!
... View more
Labels:
- Labels:
-
Apache Pig
08-08-2016
04:30 PM
Hi experts,
Probably is a dummy question (but since I have 🙂 ).
I want to know how Pig read the headers from the following dataset that is stored in .csv:
ID,Name,Function
1,Johnny,Student
2,Peter,Engineer
3,Cloud,Teacher
4,Angel,Consultant
I want to have the first row as a Header of my file. There I need to put:
A = LOAD 'file' using PIGStorage(',') as (ID:Int,....etc) ?
Or I only need to put:
A = LOAD 'file' using PIGStorage(',') And only with this pache PIG already know that the first line are the headers of my table. Thanks!
... View more
Labels:
- Labels:
-
Apache Pig
07-29-2016
10:29 AM
Hi experts,
I'm using Apache PIG to make some data transformation, but I need Java Operations to do some complex cleansing activities. I already do the methods in JAVA and already put the necessary code in Pig to register the Java Code. However I don't know that type JARS I need to upload to Eclipse to make the connection between PIG and Eclipse.
There exists any "dummie" tutorial to make this interaction?
Thnaks!
... View more
Labels:
- Labels:
-
Apache Pig
07-26-2016
03:43 PM
Sunile Manjee many thanks! One more question: is possible to create a variable and use to IF statement. Example:
A = Foreach X Generate A1,A2,A3;
--Create a variable
var = Concat(A1,A2);
Split A into B IF (var == "teste");
Is possible to do this?
... View more
07-26-2016
01:38 PM
Hi experts, I've the following field : ToString( ToDate((long) Time_Interval), 'yyyy-MM-dd hh:ss:mm') as Time How can I obtain only the time (hh:ss:mm)? I already try:
ToString( ToDate(Time), 'HH:mm:ss.SSS')
... View more
Labels:
- Labels:
-
Apache Pig
07-19-2016
07:47 AM
1 Kudo
Hi, Recently, I've installed the cloudera-quickstart-vm-5.7.0-0-virtualbox. I need to do a project using PIG and Hive. I find errors from the beginning... For example: - I need to restart the Hue Service in Cloudera Manager every time that I begin a session in VM; - My PIG Script (which is very simple) don't exceeds the 0% of progress I put the erros that I see in Cloudera Manager and my VirtualBox settings. I don't know if the bad performance in PIG is related to this. Someone who had this problem? Its urgent! Many thanks!!!
... View more
07-08-2016
10:14 AM
I'll need to install notebook to use Spark and Python (there exists any tutorial to do that?). After that I think I will use your idea 🙂
... View more
06-23-2016
09:38 AM
Hi, When I try to create a new direcoty it gives me the following error: Cannot perform operation. Note: you are a Hue admin but not a HDFS superuser, "hdfs" or part of HDFS supergroup, "supergroup". SafeModeException: Cannot create directory /user/cloudera/Source_Data. Name node is in safe mode. The reported blocks 907 needs additional 2 blocks to reach the threshold 0.9990 of total blocks 909. The number of live datanodes 1 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached. (error 403) What can I do to solve this?
... View more
06-21-2016
02:20 AM
Hi, My Virtual Machine from Cloudera in VirtualBox has already crash two times. As I am in the beginning of the project, it wasn't a very "tragic" problem. However, in future I will have some big solutions with Hadoop (HDFS, Pig, Hive and Spark) so my question is: How to do a backup for this solution and where I can save them to not lose my work? Many thanks! PS: The log when the VM Crash is this: The application had a problem and crashed. Unfornunately, the crash reporter is unable to submit a report for this crash. Detail: The application did not identify itself.
... View more
06-18-2016
04:47 AM
I hive 45 text files with 5 columns and I'm using Pig to add a new column to each file based on it filename. First question: I upload all the files into HDFS manually. Do you think is a better option upload a compress file? Second question: I put my code bellow. In your opinion it is the best way to add a new column to my files? I submit this code and it taking hours processing... All of my files are in Data directory... Data = LOAD '/user/data' using PigStorage(' ','-tagFile') STORE DATA INTO '/user/data/Data_Transformation/SourceFiles' USING PigStorage(' '); Thanks!!!
... View more
06-18-2016
04:29 AM
Hi experts, There exists any complete tutorial for Hadoop in Cloudera Environment that demonstrates how to use HDFS , Pig , Hive and Spark ? I have seen a lot of guides but do not correspond to practical cases and I have had some difficulties to develop a solution ... I am very new to Hadoop ecosystem . I need to deliver a prototype of a Hadoop solution at the end of July and I'm getting frightened with the constant difficulties and doubts that I have felt . I only want to use that components to do some data cleansing and transformation. I already download this virtual machine to use Spark: http://www.cloudera.com/downloads/quickstart_vms/5-7.html Can anyone help me ? Many thanks 🙂
... View more
06-12-2016
08:53 PM
Hi experts, I've 100 text files in HFDS and I want to aggregate all of them into one big table in Hive (Having the Date as Key). How can I load this multiple files to one table created in hive?
Thanks!
... View more
Labels:
- Labels:
-
Apache Hive
06-09-2016
05:46 PM
Hi Benjamin,
Yes, I'm talking about datamining clustering. So, in your opinion even If I know the schema is a excelent choice use Spark to achieve that
... View more
06-09-2016
05:24 PM
It makes sense use Spark to divide a structured model (I know the schema of my data) into clusters?
My question is because I don't know If will take some advantage in use Python instead of SQL (Hive) to divide the data into clusters.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark