About adnanalvee

geeky_geek · ‎04-17-2020

@testingsauce I am also facing this issue. Saved df in HIVE using saveAsTable but when i try to fetch results using hiveContext.sql(query), it doesn't return anything. BADLY stuck. Please help

adnanalvee · ‎04-20-2018

Could be a data skew issue. Checkout if any partition has huge chunk of the data compared to the rest. https://github.com/adnanalvee/spark-assist/blob/master/spark-assist.scala From the link above, copy the function "partitionStats" and pass in your data as a dataframe. It will show the maximum, minimum and average amount of data across your partitions like below. +------+-----+------------------+ |MAX |MIN |AVERAGE | +------+-----+------------------+ |135695|87694|100338.61149653122| +------+-----+------------------+

msumbul1 · ‎07-14-2017

You can add compression when you write your data. This will speed up the saving because the size of the data will smaller. Also increase the number of partition

alexmc · ‎05-10-2017

Thanks! I would be interested to learn more when you are ready to announce it.

jillmambetova5 · ‎01-09-2018

IBM offers free courses in Scala and other languages, they are free. There are tests at the end of the course once successful you can earn badges and showcase them. https://cognitiveclass.ai/

adnanalvee · ‎03-07-2017

oh! that worked. Thanks a lot!

namaheshwari · ‎03-04-2017

These might help: https://community.hortonworks.com/questions/39017/can-someone-point-me-to-a-good-tutorial-on-spark-s.html https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/

adnanalvee · ‎03-02-2017

A quick hack would be to use scala "substring" http://alvinalexander.com/scala/scala-string-examples-collection-cheat-sheet So what you can do is write a UDF and run the "new_time" column through it and grab upto time stamp you want. For example if you want just "yyyy-MM-dd HH:MM" as seen when you run the "df.show", your sub string code will be new_time.substring(0,15) which will yield "2015-12-06 12:40" pseudo code def getDateTimeSplit = udf((new_time:String) => { val s = new_time.substring(0,15) return s })

aervits · ‎03-09-2017

@Adnan Alvee that is impressive indeed, ORC has additional benefits you will see on the Hive side. Glad you found it of use.

ashish_kumar_68 · ‎04-28-2019

We can use rank approach which is faster than max , max scans the table twice: Here , partition column is load_date: select ld_dt.txnno , ld_dt.txndate , ld_dt.custno , ld_dt.amount , ld_dt.productno , ld_dt.spendby , ld_dt.load_date from (select *,dense_rank() over (order by load_date desc) dt_rnk from datastore_s2.transactions)ld_dt where ld_dt.dt_rnk=1

Online	Offline
Last Visited	‎02-21-2017 06:01 PM

Member Since	‎02-17-2017 09:33 AM
Last Visited	‎02-21-2017 06:01 PM
Posts	71
Kudos received	17

Cloudera Community

Re: Pig Incompatable schema

Re: How can I read all files in a directory using ...

Re: How to iterate multiple HDFS files in Spark-Sc...

Re: save dataframe to a hive table

Re: Spark executor blocks on last task

Re: Saving data to HDFS taking too long

Re: Hortonworks certified Big Data Architect ?

Re: Scala Certification

Re: Error: Only one SparkContext may be running in...

Re: Best tutorials to get started with Kafka and S...

Re: timestamp column changes of format in a csv fi...

Re: Faster and Better Optimized Storage format in ...

Re: How to optimize HIVE access to the "latest" pa...