Member since
02-17-2017
71
Posts
17
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4499 | 03-02-2017 04:19 PM | |
32395 | 02-20-2017 10:44 PM | |
19064 | 01-10-2017 06:51 PM |
03-03-2017
04:22 PM
hi @Maher Hattabi I am seeing a similar question of yours in the link below. Here is one where i answered the question combining any files whether it be csv or txt https://community.hortonworks.com/questions/85230/erge-csv-files-in-one-file.html#answer-85245
... View more
03-03-2017
04:15 PM
1 Kudo
import org.apache.spark.sql.hive.HiveContext;
HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());
df.write().mode(SaveMode.Overwrite).saveAsTable("dbName.tableName");
... View more
03-02-2017
04:19 PM
You need to flatten the STRSPLIT before you can project. C = FOREACH A GENERATE FLATTEN(STRSPLIT(a1,'\\u002E')) as (a1:chararray, a1of1:chararray),a2,a3;
... View more
03-02-2017
04:06 PM
A quick hack would be to use scala "substring" http://alvinalexander.com/scala/scala-string-examples-collection-cheat-sheet So what you can do is write a UDF and run the "new_time" column through it and grab upto time stamp you want. For example if you want just "yyyy-MM-dd HH:MM" as seen when you run the "df.show", your sub string code will be new_time.substring(0,15) which will yield "2015-12-06 12:40" pseudo code def getDateTimeSplit = udf((new_time:String) => {
val s = new_time.substring(0,15)
return s
})
... View more
02-28-2017
04:30 PM
1 Kudo
Is there any better storage format for pig? Lets say I want to store a very large filtered hive table/data before any further processing. Is there any format that makes processing faster?
... View more
Labels:
- Labels:
-
Apache Pig
02-24-2017
05:00 AM
I think this is a duplicate of the question for which I already posted an answer. https://community.hortonworks.com/questions/84507/sql-query-to-sparkdataframe-to-get-date-add-interv.html#answer-85407
This is the working code again, you can convert to dataframe or do operations in Rdd
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.sql.functions.broadcast
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import scala.collection.mutable.ListBuffer
import java.util.{GregorianCalendar, Date}
import java.util.Calendar
def generateDates(startdate: Date, enddate: Date): ListBuffer[String] = {
var dateList = new ListBuffer[String]()
var calendar = new GregorianCalendar()
calendar.setTime(startdate)
while (calendar.getTime().before(enddate)) {
dateList += calendar.getTime().toString().substring(0, 10) + "," + (calendar.get(Calendar.DAY_OF_MONTH)) + "," + calendar.get(Calendar.MONTH)
calendar.add(Calendar.DATE, 1)
}
dateList += calendar.getTime().toString()
println("\n" + dateList + "\n")
dateList
}
def getRddList(a :String) : ListBuffer[(String,String,String,String,String)] = {
var allDates = new ListBuffer[(String,String,String,String,String)]()
for (x <- generateDates(format.parse(a.split(",")(1)),format.parse(a.split(",")(2)))) {
allDates += (( a.split(",")(0).toString(), x , a.split(",")(3).toString(),
a.split(",")(4).toString(), a.split(",")(5).toString() ))
}
allDates
}
var fileRdd = sc.textFile("/data_1/date1");
var myRdd = fileRdd.map{x=>getRddList(x)}.flatMap(y=>y)
myRdd.collect()
... View more
02-23-2017
07:11 PM
works perfectly now.
... View more
02-21-2017
07:52 PM
Late reply but running it on a cluster and increasing memory worked like a charm!
... View more
02-21-2017
03:38 PM
Hi @Olga Svyryd
I recently had the chance to attend Spark Summit East 2017. One the sessions I attended was " No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet" Cuesheet
has a lot of features including submitting jobs not only client but
also straight to cluster. The presenter was using IntelliJ to demo the
project. To deep dive into, please follow links below.
Link to slides:
https://spark-summit.org/east-2017/events/no-more-sbt-assembly-rethinking-spark-submit-using-cuesheet/ Link to code documentation: https://github.com/kakao/cuesheet https://github.com/jongwook/cuesheet-starter-kit
... View more