About jwiden

Shikharguglia1 · ‎04-20-2022

Hello all Please reply to this ASAP I am trying to install vm on my pc but my screen is sticking at the same point That Extracting and loading the hortonworks sandbox I have assigned 8 gb of ram n my laptop configuration is 8gb also and i5 n 7th gen

BigSpace · ‎10-01-2020

My best practice:- Keep the number of executors equal to the number of spark clients that your cluster is configured for. Go with 2 cores per executor. You will fetch optimum performance with consistent distribution of batches for processing as confirmed from Spark UI.

amnag · ‎04-13-2020

Hi, can I instead add the following line to spark-defaults.conf file: spark.ui.port 4041 Will that have the same effect ? Thanks

RahulGoyal · ‎01-18-2020

Spark Catalyst Optimiser is smart.If it not optimising well then you have to think about it else it is able to optimise. Below is one example: fr = spark.createDataframe([{'a':1},{'b':2}]) fr.select('a','b').drop('a') parsed logical plan for above query is below Parsed Logical Plan == Project [b#69L] +- Project [a#68L, b#69L] +- LogicalRDD [a#68L, b#69L], false And Physical plan is below Physical Plan == *(1) Project [b#69L] +- *(1) Scan ExistingRDD[a#68L,b#69L] Spark is optimising the query from two projection to single projection Which is same as Physical plan of fr.select('a').

AKR · ‎11-12-2019

Hi, We wont provide any connectors for Teradata to spark. but if you want to get data from Teradata into Spark, you can probably use any JDBC driver that Teradata provides. Thanks AKR

alessandro_lull · ‎12-01-2016

Hi All, Fixing the following issue fixed also this one: https://community.hortonworks.com/questions/68989/datanodes-status-not-consistent.html#answer-69461 Regards Alessandro

jwiden · ‎11-22-2016

Too add to what @Scott Shaw said, the biggest thing we'd be looking for initially is data skew. So we can take a look at a couple things to help determine this. The first is to take a look at the input size. With input size, we can completely ignore the min, and take a look at the 25, median and 75th percentiles. We see that in your job the are fairly close together, and we also the see the max is never dramatically more than the median. If we saw the max and 75% percentile were very large, we would definitely see data skew. Another indicator of data skew is the task duration. Again ignore the minimum, we're definitely going to inevitably get a small partition due to one reason or another. Focus on the 25th median 75th and max. In a perfect world the seperation between the 4 would be a tiny amount. So seeing 6s, 10s, 11s, 17s, they may seem like significantly different but theyre actually relatively close. The only time we would have a cause for concern would be when the 75% and max are quite a bit greater then 25% and median. When I saw significant, I'm talking about most tasks take ~30s and the max taking 10 mins. That would be a clear indicator of data skew.

don_jernigan · ‎08-01-2016

I should have read the post a little closer I thought you were doing a groupByKey. You are correct, you need to use groupBy to keep the execution within the dataframe and out of Python. However, you said you are doing an outer join. If it is a left join and the right side is larger than the left, then do an inner join first. Then do your left join on the result. Your result most likely will be broadcasted to do the left join. This is a pattern that Holden described at Strata this year in one of her sessions.

Online	Offline
Last Visited	‎04-11-2017 05:11 PM

Member Since	‎10-12-2015 02:44 PM
Last Visited	‎04-11-2017 05:11 PM
Posts	63
Kudos received	56

Cloudera Community

Re: Can I join 2 dataframe with condition in colum...

Re: Spark not picking older Kafka messages

Re: Load MYSQL table in to RDD

Re: Index/Rank a Grouped Rdd in Spark Scala

Re: What are the important metrics to notice for e...

Re: HDP sandbox startup too long on virtualbox

Re: Spark on YARN - Executor Resource Allocation O...

Re: How can I resolve java.net.BindException: Addr...

Re: Spark SQL Drop vs Select

Re: Hi, Is there any connector for teradata to sp...

Re: Initial job has not accepted any resources, wh...

Re: What are the important metrics to notice for e...

Re: Tuning parallelism: increase or decrease?