About adnanalvee

adnanalvee · ‎03-03-2017

hi @Maher Hattabi I am seeing a similar question of yours in the link below. Here is one where i answered the question combining any files whether it be csv or txt https://community.hortonworks.com/questions/85230/erge-csv-files-in-one-file.html#answer-85245

adnanalvee · ‎03-03-2017

import org.apache.spark.sql.hive.HiveContext; HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc()); df.write().mode(SaveMode.Overwrite).saveAsTable("dbName.tableName");

adnanalvee · ‎03-02-2017

please show your code

adnanalvee · ‎03-02-2017

You need to flatten the STRSPLIT before you can project. C = FOREACH A GENERATE FLATTEN(STRSPLIT(a1,'\\u002E')) as (a1:chararray, a1of1:chararray),a2,a3;

adnanalvee · ‎03-02-2017

A quick hack would be to use scala "substring" http://alvinalexander.com/scala/scala-string-examples-collection-cheat-sheet So what you can do is write a UDF and run the "new_time" column through it and grab upto time stamp you want. For example if you want just "yyyy-MM-dd HH:MM" as seen when you run the "df.show", your sub string code will be new_time.substring(0,15) which will yield "2015-12-06 12:40" pseudo code def getDateTimeSplit = udf((new_time:String) => { val s = new_time.substring(0,15) return s })

adnanalvee · ‎02-28-2017

Is there any better storage format for pig? Lets say I want to store a very large filtered hive table/data before any further processing. Is there any format that makes processing faster?

adnanalvee · ‎02-24-2017

I think this is a duplicate of the question for which I already posted an answer. https://community.hortonworks.com/questions/84507/sql-query-to-sparkdataframe-to-get-date-add-interv.html#answer-85407 This is the working code again, you can convert to dataframe or do operations in Rdd import org.apache.spark.{ SparkConf, SparkContext } import org.apache.spark.sql.functions.broadcast import org.apache.spark.sql.types._ import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import scala.collection.mutable.ListBuffer import java.util.{GregorianCalendar, Date} import java.util.Calendar def generateDates(startdate: Date, enddate: Date): ListBuffer[String] = { var dateList = new ListBuffer[String]() var calendar = new GregorianCalendar() calendar.setTime(startdate) while (calendar.getTime().before(enddate)) { dateList += calendar.getTime().toString().substring(0, 10) + "," + (calendar.get(Calendar.DAY_OF_MONTH)) + "," + calendar.get(Calendar.MONTH) calendar.add(Calendar.DATE, 1) } dateList += calendar.getTime().toString() println("\n" + dateList + "\n") dateList } def getRddList(a :String) : ListBuffer[(String,String,String,String,String)] = { var allDates = new ListBuffer[(String,String,String,String,String)]() for (x <- generateDates(format.parse(a.split(",")(1)),format.parse(a.split(",")(2)))) { allDates += (( a.split(",")(0).toString(), x , a.split(",")(3).toString(), a.split(",")(4).toString(), a.split(",")(5).toString() )) } allDates } var fileRdd = sc.textFile("/data_1/date1"); var myRdd = fileRdd.map{x=>getRddList(x)}.flatMap(y=>y) myRdd.collect()

adnanalvee · ‎02-23-2017

works perfectly now.

adnanalvee · ‎02-21-2017

Late reply but running it on a cluster and increasing memory worked like a charm!

adnanalvee · ‎02-21-2017

Hi @Olga Svyryd I recently had the chance to attend Spark Summit East 2017. One the sessions I attended was " No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet" Cuesheet has a lot of features including submitting jobs not only client but also straight to cluster. The presenter was using IntelliJ to demo the project. To deep dive into, please follow links below. Link to slides: https://spark-summit.org/east-2017/events/no-more-sbt-assembly-rethinking-spark-submit-using-cuesheet/ Link to code documentation: https://github.com/kakao/cuesheet https://github.com/jongwook/cuesheet-starter-kit

Online	Offline
Last Visited	‎02-21-2017 06:01 PM

Member Since	‎02-17-2017 09:33 AM
Last Visited	‎02-21-2017 06:01 PM
Posts	71
Kudos received	17

Cloudera Community

Re: Pig Incompatable schema

Re: How can I read all files in a directory using ...

Re: How to iterate multiple HDFS files in Spark-Sc...

Re: merge csv files based on a column timestamp to...

Re: save dataframe to a hive table

Re: read orc table from spark

Re: Pig Incompatable schema

Re: timestamp column changes of format in a csv fi...

Faster and Better Optimized Storage format in Pig?

Re: Spark generate multiple rows based on column v...

Re: How to expand a single row with a start and en...

Re: Spark Broadcast Hash Join failing on 800+ mill...

Re: How to run Spark application with yarn-client ...