Member since
02-17-2017
71
Posts
17
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4077 | 03-02-2017 04:19 PM | |
31631 | 02-20-2017 10:44 PM | |
18289 | 01-10-2017 06:51 PM |
04-17-2020
03:02 PM
@testingsauce I am also facing this issue. Saved df in HIVE using saveAsTable but when i try to fetch results using hiveContext.sql(query), it doesn't return anything. BADLY stuck. Please help
... View more
04-20-2018
08:30 PM
Could be a data skew issue. Checkout if any partition has huge chunk of the data compared to the rest. https://github.com/adnanalvee/spark-assist/blob/master/spark-assist.scala From the link above, copy the function "partitionStats" and pass in your data as a dataframe. It will show the maximum, minimum and average amount of data across your partitions like below. +------+-----+------------------+
|MAX |MIN |AVERAGE |
+------+-----+------------------+
|135695|87694|100338.61149653122|
+------+-----+------------------+
... View more
07-14-2017
04:23 PM
You can add compression when you write your data. This will speed up the saving because the size of the data will smaller. Also increase the number of partition
... View more
05-10-2017
01:24 PM
Thanks! I would be interested to learn more when you are ready to announce it.
... View more
01-09-2018
06:06 PM
IBM offers free courses in Scala and other languages, they are free. There are tests at the end of the course once successful you can earn badges and showcase them. https://cognitiveclass.ai/
... View more
03-07-2017
06:25 PM
oh! that worked. Thanks a lot!
... View more
03-04-2017
12:42 AM
6 Kudos
These might help: https://community.hortonworks.com/questions/39017/can-someone-point-me-to-a-good-tutorial-on-spark-s.html https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/
... View more
03-02-2017
04:06 PM
A quick hack would be to use scala "substring" http://alvinalexander.com/scala/scala-string-examples-collection-cheat-sheet So what you can do is write a UDF and run the "new_time" column through it and grab upto time stamp you want. For example if you want just "yyyy-MM-dd HH:MM" as seen when you run the "df.show", your sub string code will be new_time.substring(0,15) which will yield "2015-12-06 12:40" pseudo code def getDateTimeSplit = udf((new_time:String) => {
val s = new_time.substring(0,15)
return s
})
... View more
03-09-2017
04:36 PM
@Adnan Alvee that is impressive indeed, ORC has additional benefits you will see on the Hive side. Glad you found it of use.
... View more
04-28-2019
02:56 PM
1 Kudo
We can use rank approach which is faster than max , max scans the table twice: Here , partition column is load_date: select ld_dt.txnno , ld_dt.txndate , ld_dt.custno , ld_dt.amount , ld_dt.productno , ld_dt.spendby , ld_dt.load_date from (select *,dense_rank() over (order by load_date desc) dt_rnk from datastore_s2.transactions)ld_dt where ld_dt.dt_rnk=1
... View more