Member since
10-06-2016
40
Posts
1
Kudos Received
0
Solutions
06-01-2017
08:11 AM
I am applying correlation on a csv file using apache spark, when loading data i am obliged to skipe the first row as a header which are columns in the dataset otherwise i can't load the data. i get the correlation computed but when i got the correlation matrix, i can't add the columns name as a header in the new matrix please would you help me get the matrix with its header ,thanks ,this what i did import org.apache.spark.mllib.linalg.{Vector,Vectors} import org.apache.spark.mllib.stat.Statistics import org.apache.spark.mllib.linalg.Matrix import org.apache.spark.rdd.RDD
val data = sc.textFile(strfilePath).mapPartitionsWithIndex {case(index, iterator)=>if(index ==0) iterator.drop(1)else iterator
} val inputMatrix = data.map { line =>val values = line.split(",").map(_.toDouble)Vectors.dense(values)} val correlationMatrix=Statistics.corr(inputMatrix,"pearson")
... View more
Labels:
- Labels:
-
Apache Spark
03-06-2017
10:37 AM
hi guys am using dataset with column name containing "." , here is the code am using and after that the issue i got , i think there is a problem with the "." character ,what should i do ? Thanks.
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("inferSchema", "true").option("header", "true").csv("D:\ProcessDataSet\anis_data\Set _1Mud Pumps_Merged.csv")
val aggs = df.columns.map(c => stddev(c).as(c))
val stddevs = df.select(aggs: _*)
stddevs.show(5,false)
... View more
Labels:
- Labels:
-
Apache Spark
03-03-2017
09:21 AM
i have 4 csv files , i want to join and merge these files into one files based on a column timestamps to get one file.using spark or hadoop Please any help would be appreciated
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
03-02-2017
09:14 AM
Hi guys i am trying to save a dataframe to a csv file , that contains a timestamp. The problem that this column changes of format one written in the csv file .when showing via df.show i got a correct format when i check the csv file i got this format i also tried some think like this ,and still got the same problem
finalresult.coalesce(1).write.option("header",true).option("inferSchema","true").option("dateFormat","yyyy-MM-dd HH:mm:ss").csv("C:/mydata.csv") val spark =SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()val df = spark.read.option("header",true).option("inferSchema","true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")//val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:\dataSet.csv\datasetTest.csv")//convert all column to numeric value in order to apply aggregation function
df.columns.map { c =>df.withColumn(c, col(c).cast("int"))}//add a new column inluding the new timestamp columnval result2=df.withColumn("new_time",((unix_timestamp(col("time"))/300).cast("long")*300).cast("timestamp")).drop("time")val finalresult=result2.groupBy("new_time").agg(result2.drop("new_time").columns.map((_ ->"mean")).toMap).sort("new_time")//agg(avg(all columns..) finalresult.coalesce(1).write.option("header",true).option("inferSchema","true").csv("C:/mydata.csv")
... View more
Labels:
- Labels:
-
Apache Spark
03-01-2017
09:17 AM
I have dobe this code ,my question is for the function cast data type ,how can i cast all columns'datatype included in dataset at the same time except the column timestamp, and the other question is how to apply function avg on all column except also column timestamp,thanks a lot . val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest.csv")
val result=df.withColumn("new_time",((unix_timestamp(col("time")) /300).cast("long") * 300).cast("timestamp"))
result("value").cast("float")//here the first question
val finalresult=result.groupBy("new_time").agg(avg("value")).sort("new_time")//here the second question about avg
finalresult.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("C:/mydata.csv")
... View more
Labels:
- Labels:
-
Apache Spark
02-28-2017
09:02 AM
Hi guys i am working with scala and spark , i would like what is the best api or the best way to aggregate data based on interval of timestamps ,knowing that my data includes column timestamp that are taken every second , i would like to get a new dataframe based on aggregation and interval .Example my timestamp column is taken every second , i would like to know how would be my data if the timestamp was taken every minute ? val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate() val df = spark.read.option("header",true).csv("C:/Users/mhattabi/Desktop/Clean _data/Mud_Pumps _Cleaned/Set_1_Mud_Pumps_Merged.csv") df("DateTime").cast("timestamp") Thanks any help would Any help would be appreciated
... View more
Labels:
- Labels:
-
Apache Spark
02-28-2017
07:30 AM
Hi , yes there is a "." in the column name , can this cause a problem in such a operation ? Thanks
... View more
02-27-2017
03:37 PM
Hi guys i am trying this source code val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).csv("C:/Users/mhattabi/Desktop/Clean _data/Mud_Pumps _Cleaned/Set_1_Mud_Pumps_Merged.csv")
df("DateTime").cast("timestamp")
df("ADCH_Mud Pumps.db.MudPump.2.On.value").cast("integer")
val result=df.select(
col("*"),
date_format(df("DateTime"),"yyyy-MM-dd hh:mm").alias("DateTime")).groupBy(df("DateTime"))
.agg(avg(df("ADCH_Mud Pumps.db.MudPump.2.On.value")))
result.show(5) It says what sould i do for the attribute it is really existing in my data set,Thanks
... View more
Labels:
- Labels:
-
Apache Spark
02-24-2017
09:49 AM
hi guys i have a code written in pyspark any help to run it under scala thanks it is urgent please thanks
from functools import reduce files =["/tmp/test_1.csv","/tmp/test_2.csv","/tmp/test_3.csv"] df = reduce(lambda x,y: x.unionAll(y), [sqlContext.read.format('com.databricks.spark.csv') .load(f, header="true", inferSchema="true") for f in files]) df.show()
... View more
Labels:
- Labels:
-
Apache Spark
02-24-2017
09:35 AM
@Bernhard Walter Thanks a lot but can you do it in scala language please it is so kind of you thanks
... View more
02-24-2017
09:09 AM
@Adnan Alvee thanks , but the problem i do not know the schema of csv so i can't insitialize x_df. any help please thank you
... View more
02-24-2017
08:51 AM
Hi friends I have csv files in local file system , they all have the same header i want to get one csv file with this header , is there a solution using spark-csv or any thing else nwant to loop and merge them any solution please and get a final csv file , using spark Thanks
... View more
Labels:
- Labels:
-
Apache Spark
02-24-2017
08:40 AM
the data are in local file system , they all have the same header i want to get one csv file with this header , is there a solution using spark-csv or any thing else nwant to loop and merge them any solution please thanks
... View more
02-23-2017
01:49 PM
hi guys Am new to spark ans scala,i have csv files that i want tomerge in the same csv file or dataframe i want just to handle them as if they are only one file Any help thanks
... View more
Labels:
- Labels:
-
Apache Spark
02-17-2017
09:51 AM
@Hellmar Becker @Hellmar BeckerYou mean like creating a cluster and ensure the duplication of the data in order to be reachable by the remote hadoop cluster ? and use the normal load operation ?
... View more
02-16-2017
01:31 PM
@Sagar Shimpi i am using the webhds api through .net and c# source code.when i use a little size of data , the operation of the load through remote works well but when i tried about 850MB a csv file it doesn't work , i have tried several example of volumes but there is still problem ,am sure my code because i tried it with a little csv file (little size) is there other api of hadoop to access a remote hadoop cluster ? Thanks
... View more
02-16-2017
01:05 PM
Hi I was able to load data remotely through the webhdfs rest api ,but it doesn't allow to load a big volume of data remotely.Is there any possibilty to load huge data remotely ,i ask if there is a hadoop api to do that ? Thank you
... View more
Labels:
- Labels:
-
Apache Hadoop
02-15-2017
02:59 PM
@Michael Young thanks for your replay ,am developping a .Net app, in order to access a remote hdfs cluster here is the code i used when it is locallly i got a correct answer when i put the ip of the remote cluster i got an exception List<string> lstDirectoriesName = new List<string>();
try
{
WebHDFSClient hdfsClient = new WebHDFSClient(new Uri("http://io-dell-svr8:50070"), "Administrator");
Microsoft.Hadoop.WebHDFS.DirectoryListing directroyStatus = hdfsClient.GetDirectoryStatus("/dataset").Result;
List<DirectoryEntry> lst = directroyStatus.Files.ToList();
foreach (DirectoryEntry var in lst)
{
lstDirectoriesName.Add(var.PathSuffix);
}
return lstDirectoriesName;
}
catch (Exception exException)
{
Console.WriteLine(exException.Message);
return null;
} Any help whould be appreciated
... View more
02-15-2017
01:06 PM
I am using HDP , i would like to use the webhdfs api of hadoop through .Net hadoop webhdfs api, i would like to access the hadoop in a remote machine through the webhdfs , i would like to know the URI that should be done to aceess hdp remotely using http.It should be something like http://host:port/webhdfs, what is the port to use Thank you
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
02-14-2017
09:14 AM
Hi ,same problem here i want to make these inerface hidden and running in background thanks
... View more
02-14-2017
07:31 AM
Hello guys I am running hadoop cluster using HDP , when i run hadoop there are two CLI of the datanode and the namenode , please is there is any possibilty to run them in background ? Thank you a lot !
... View more
Labels:
- Labels:
-
Apache Hadoop
02-10-2017
02:49 PM
Hi @Sagar Shimpi I would like to know about the HDP Price just in case ,we want to use your product (HortonWorks Data Platform) to develop our personal software. Thank you
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
02-10-2017
02:14 PM
Yes am good, i have already tried to log into the browser and to access the web Ui , i couldn't do that . Any help will be appreciated .
... View more
02-10-2017
02:14 PM
Yes am good, i have already tried to log into the browser and to access the web Ui , i couldn't do that . Any help will be appreciated .
... View more
02-10-2017
01:10 PM
Hello sorry , but i want to know what is the difference between sandbox and HDP as two different tools Thanks
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
02-10-2017
12:52 PM
@Sagar Shimpi I have already tried that and i got this i have also tried to log for browsing url and i got nothing ? Would you please help thanks
... View more
02-10-2017
12:50 PM
@Sagar Shimpi I have already tried that and i got this i have also tried to log for browsing url and i got nothing ? Would you please help thanks
... View more
02-10-2017
10:46 AM
Hi guys i have downloaded sandbox under vmware machine , i started it i got this screen i opened a web browser in my windows host machine and wrote the address and there is nothing happened ? what should i do thanks
... View more
02-10-2017
09:47 AM
thank you very , then what should i do to access to the webui , in order to use the sandbox tools,there is [root@sandbox ~]# what now ? Thanks
... View more
02-10-2017
09:11 AM
Hello i downloaded the hortonworks sandbox i couldn't log into the platform ,what should i insert as login and password before using the web Ui of the sandbox Thanks
... View more