Member since
10-06-2016
40
Posts
1
Kudos Received
0
Solutions
06-01-2017
08:11 AM
I am applying correlation on a csv file using apache spark, when loading data i am obliged to skipe the first row as a header which are columns in the dataset otherwise i can't load the data. i get the correlation computed but when i got the correlation matrix, i can't add the columns name as a header in the new matrix please would you help me get the matrix with its header ,thanks ,this what i did import org.apache.spark.mllib.linalg.{Vector,Vectors} import org.apache.spark.mllib.stat.Statistics import org.apache.spark.mllib.linalg.Matrix import org.apache.spark.rdd.RDD
val data = sc.textFile(strfilePath).mapPartitionsWithIndex {case(index, iterator)=>if(index ==0) iterator.drop(1)else iterator
} val inputMatrix = data.map { line =>val values = line.split(",").map(_.toDouble)Vectors.dense(values)} val correlationMatrix=Statistics.corr(inputMatrix,"pearson")
... View more
Labels:
- Labels:
-
Apache Spark
03-03-2017
09:21 AM
i have 4 csv files , i want to join and merge these files into one files based on a column timestamps to get one file.using spark or hadoop Please any help would be appreciated
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
03-02-2017
09:14 AM
Hi guys i am trying to save a dataframe to a csv file , that contains a timestamp. The problem that this column changes of format one written in the csv file .when showing via df.show i got a correct format when i check the csv file i got this format i also tried some think like this ,and still got the same problem
finalresult.coalesce(1).write.option("header",true).option("inferSchema","true").option("dateFormat","yyyy-MM-dd HH:mm:ss").csv("C:/mydata.csv") val spark =SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()val df = spark.read.option("header",true).option("inferSchema","true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")//val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:\dataSet.csv\datasetTest.csv")//convert all column to numeric value in order to apply aggregation function
df.columns.map { c =>df.withColumn(c, col(c).cast("int"))}//add a new column inluding the new timestamp columnval result2=df.withColumn("new_time",((unix_timestamp(col("time"))/300).cast("long")*300).cast("timestamp")).drop("time")val finalresult=result2.groupBy("new_time").agg(result2.drop("new_time").columns.map((_ ->"mean")).toMap).sort("new_time")//agg(avg(all columns..) finalresult.coalesce(1).write.option("header",true).option("inferSchema","true").csv("C:/mydata.csv")
... View more
Labels:
- Labels:
-
Apache Spark
03-01-2017
09:17 AM
I have dobe this code ,my question is for the function cast data type ,how can i cast all columns'datatype included in dataset at the same time except the column timestamp, and the other question is how to apply function avg on all column except also column timestamp,thanks a lot . val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest.csv")
val result=df.withColumn("new_time",((unix_timestamp(col("time")) /300).cast("long") * 300).cast("timestamp"))
result("value").cast("float")//here the first question
val finalresult=result.groupBy("new_time").agg(avg("value")).sort("new_time")//here the second question about avg
finalresult.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("C:/mydata.csv")
... View more
Labels:
- Labels:
-
Apache Spark
02-24-2017
09:35 AM
@Bernhard Walter Thanks a lot but can you do it in scala language please it is so kind of you thanks
... View more
02-24-2017
08:51 AM
Hi friends I have csv files in local file system , they all have the same header i want to get one csv file with this header , is there a solution using spark-csv or any thing else nwant to loop and merge them any solution please and get a final csv file , using spark Thanks
... View more
Labels:
- Labels:
-
Apache Spark
02-15-2017
02:59 PM
@Michael Young thanks for your replay ,am developping a .Net app, in order to access a remote hdfs cluster here is the code i used when it is locallly i got a correct answer when i put the ip of the remote cluster i got an exception List<string> lstDirectoriesName = new List<string>();
try
{
WebHDFSClient hdfsClient = new WebHDFSClient(new Uri("http://io-dell-svr8:50070"), "Administrator");
Microsoft.Hadoop.WebHDFS.DirectoryListing directroyStatus = hdfsClient.GetDirectoryStatus("/dataset").Result;
List<DirectoryEntry> lst = directroyStatus.Files.ToList();
foreach (DirectoryEntry var in lst)
{
lstDirectoriesName.Add(var.PathSuffix);
}
return lstDirectoriesName;
}
catch (Exception exException)
{
Console.WriteLine(exException.Message);
return null;
} Any help whould be appreciated
... View more
02-15-2017
01:06 PM
I am using HDP , i would like to use the webhdfs api of hadoop through .Net hadoop webhdfs api, i would like to access the hadoop in a remote machine through the webhdfs , i would like to know the URI that should be done to aceess hdp remotely using http.It should be something like http://host:port/webhdfs, what is the port to use Thank you
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
02-14-2017
09:14 AM
Hi ,same problem here i want to make these inerface hidden and running in background thanks
... View more
02-14-2017
07:31 AM
Hello guys I am running hadoop cluster using HDP , when i run hadoop there are two CLI of the datanode and the namenode , please is there is any possibilty to run them in background ? Thank you a lot !
... View more
Labels:
- Labels:
-
Apache Hadoop