About adnanalvee

adnanalvee · ‎03-13-2017

this is just a suggestion but have you tried running on Hive on Tez? Its a much faster and efficient execution engine. Try this before you execute your code. set hive.execution.engine=tez;

adnanalvee · ‎03-09-2017

hi @Evan Willett The official Spark Documentation says this: The only reason Kryo is not the default is because of the custom registration requirement, but we recommend trying it in any network-intensive application. Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs with simple types, arrays of simple types, or string type Link: http://spark.apache.org/docs/latest/tuning.html#data-serialization

adnanalvee · ‎03-09-2017

Wow. ORC got me from going 3TB(PigStorage) to 60 gb. This is insane. I didn't notice any performance improvement though. But I am happy with savings in storage. Thanks! 🙂

adnanalvee · ‎03-08-2017

what is the error you are getting while trying to use it then? This is what I used in Spark 1.6.1 import org.apache.spark.sql.functions.broadcast val joined_df = df1.join(broadcast(df2), "key")

adnanalvee · ‎03-08-2017

Hi @X Long The official documentation does include it http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables Here is one tutorial using spark 2 https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-broadcast.html

adnanalvee · ‎03-07-2017

oh! that worked. Thanks a lot!

adnanalvee · ‎03-07-2017

I am trying to run some spark streaming examples online. But even before I start, I'm getting this error Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at: org.apache.spark.SparkContext.<init>(SparkContext.scala:82) I tried this below but doesn't help. conf.set("spark.driver.allowMultipleContexts","true"); Sample code I was trying to run in HDP 2.5 import org.apache.spark._ import org.apache.spark.streaming._ val conf = new SparkConf().setAppName(appName).setMaster(master) val ssc = new StreamingContext(conf, Seconds(1))

adnanalvee · ‎03-06-2017

@vamsi valiveti The result of the code you wrote gives the schema tike this ((a1),(a1of1)),(a2),(a3) Now your projection wouldn't work in a data schema like this as Pig still considers the first two rows which is "((a1),(a1of1))" as one. You need to use flatten for this case to make it into two separate columns. Thats exactly what my code is doing. I tested your data using my code. works perfectly.

adnanalvee · ‎03-06-2017

Try this val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

adnanalvee · ‎03-03-2017

I need some advise on getting myself equipped with Kafka and Spark Streaming skill set. Tutorials with best practices are welcome! Thanks

Online	Offline
Last Visited	‎02-21-2017 06:01 PM

Member Since	‎02-17-2017 09:33 AM
Last Visited	‎02-21-2017 06:01 PM
Posts	71
Kudos received	17

Cloudera Community

Re: Pig Incompatable schema

Re: How can I read all files in a directory using ...

Re: How to iterate multiple HDFS files in Spark-Sc...

Re: Hive queries taking LONG time to start

Re: Using Kryo Serializer with Spark

Re: Faster and Better Optimized Storage format in ...

Re: Is there a way to do broadcast join in Spark ...

Re: Is there a way to do broadcast join in Spark ...

Re: Error: Only one SparkContext may be running in...

Error: Only one SparkContext may be running in thi...

Re: Pig Incompatable schema

Re: save dataframe to a hive table

Best tutorials to get started with Kafka and Spark...