About jwiden

jwiden · ‎10-27-2016

Sounds like a JDBC connection is in order. There is an api for creating a dataframe from jdbc connection. jdbc(url: String, table: String, predicates: Array[String], connectionProperties:Properties): DataFrame The issue with JDBC is reading data from teradata will be much slower compared to HDFS. Is it possible to run a sqoop job to move data to hdfs prior to starting your spark application?

jwiden · ‎10-24-2016

@Steevan Rodrigues When doing saveAsText file it takes a parameter for setting codec to compress with: rdd.saveAsTextFile(filename,compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

jwiden · ‎09-05-2016

The Databricks CSV library skips using Core Spark. The map function in Pyspark is run through a Python subprocess on each executor. When using Spark SQL with Databricks CSV library, everything goes through the catalyst optimizer and the output is java byte code. Scala/Java is about 40% faster than Python when using core Spark. I would guess that is the reason the 2nd implementation is much faster. The CSV library probably is much more efficient at breaking up the records, probably applying the split partition by partition as opposed to record by record.

jwiden · ‎08-01-2016

If I had to guess your using Spark 1.5.2 or earlier. What is happening is you run out of memory. I think youre running out of executor memory, so you're probably doing a map-side aggregate. How many keys do you have? I think we can fix this pretty simply. Are you caching data? If not set spark.suffle.fraction to a number higher than .4.

jwiden · ‎08-01-2016

Can you share the explain plan?

jwiden · ‎08-01-2016

If you call groupByKey on a dataframe it implicitly converts the dataframe to an rdd. You lose all benefits of the optimizer for this step.

jwiden · ‎07-31-2016

Agreed 100%. If you can accomplish the same task using reduceByKey, it implements a combiner, so its basically does the aggregate locally, then shuffles the results for each partition. Just keep an eye on GC when doing this.

jwiden · ‎07-31-2016

Use the explain API, you should see something like BroadcastRDD if a broadcast join is happening. Also, make sure you've enabled code generation for spark sql "spark.sql.codegen=true" . Older versions of spark (1.4 and earlier) have it set to false.

jwiden · ‎07-31-2016

When creating a Kafka receiver, its one receiver per topic partition. You can definitely repartition the data after receiving it from Kafka. This should distribute the data to all of your workers as opposed to only having 5 executors do the work.

jwiden · ‎07-31-2016

Try setting your join to a broadcast join. By default, the Spark SQL does a broadcast join for tables less than 10mb. I think in this case, it would make a lot of sense to changing the setting "spark.sql.autoBroadCastJoinThreshold" to 250mb. This will do a map side join in terms of mapreduce, and should be much quicker than what you're experiencing. Also, don't worry about having a large amount of tasks. Its very ok to have that many tasks. I've found that any more than 4 cores per executor, you have diminishing returns on performance (IE 4 cores = 400% throughput, 5 cores is ~430% throughput). Another setting you might want to investigate is the spark.sql.shuffle.partitions setting, it is the number of partitions to use when shuffling data for joins, and by default is 200. I think you might want to up that number quite a bit.

Online	Offline
Last Visited	‎04-11-2017 05:11 PM

Member Since	‎10-12-2015 02:44 PM
Last Visited	‎04-11-2017 05:11 PM
Posts	63
Kudos received	56

Cloudera Community

Re: Can I join 2 dataframe with condition in colum...

Re: Spark not picking older Kafka messages

Re: Load MYSQL table in to RDD

Re: Index/Rank a Grouped Rdd in Spark Scala

Re: What are the important metrics to notice for e...

Re: Hi, Is there any connector for teradata to sp...

Re: Spark terasort : How to compress output data w...

Re: Hive on Tez vs PySpark for weblogs parsing

Re: SparkException caused by GC overhead limit exc...

Re: Tuning parallelism: increase or decrease?

Re: Tuning parallelism: increase or decrease?

Re: Tuning parallelism: increase or decrease?

Re: Tuning parallelism: increase or decrease?

Re: How to scale Spark-Streaming with kafka?

Re: Tuning parallelism: increase or decrease?