Member since
10-12-2015
63
Posts
56
Kudos Received
13
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
24599 | 02-10-2017 12:35 AM | |
1805 | 02-09-2017 11:00 PM | |
1177 | 02-08-2017 04:48 PM | |
2884 | 01-23-2017 03:11 PM | |
4713 | 11-22-2016 07:33 PM |
04-20-2022
03:04 AM
Hello all Please reply to this ASAP I am trying to install vm on my pc but my screen is sticking at the same point That Extracting and loading the hortonworks sandbox I have assigned 8 gb of ram n my laptop configuration is 8gb also and i5 n 7th gen
... View more
10-01-2020
04:35 AM
My best practice:- Keep the number of executors equal to the number of spark clients that your cluster is configured for. Go with 2 cores per executor. You will fetch optimum performance with consistent distribution of batches for processing as confirmed from Spark UI.
... View more
04-13-2020
12:25 PM
Hi, can I instead add the following line to spark-defaults.conf file: spark.ui.port 4041 Will that have the same effect ? Thanks
... View more
01-18-2020
02:50 AM
Spark Catalyst Optimiser is smart.If it not optimising well then you have to think about it else it is able to optimise. Below is one example: fr = spark.createDataframe([{'a':1},{'b':2}]) fr.select('a','b').drop('a') parsed logical plan for above query is below Parsed Logical Plan == Project [b#69L] +- Project [a#68L, b#69L] +- LogicalRDD [a#68L, b#69L], false And Physical plan is below Physical Plan == *(1) Project [b#69L] +- *(1) Scan ExistingRDD[a#68L,b#69L] Spark is optimising the query from two projection to single projection Which is same as Physical plan of fr.select('a').
... View more
11-12-2019
06:44 AM
Hi, We wont provide any connectors for Teradata to spark. but if you want to get data from Teradata into Spark, you can probably use any JDBC driver that Teradata provides. Thanks AKR
... View more
12-01-2016
11:18 AM
Hi All, Fixing the following issue fixed also this one: https://community.hortonworks.com/questions/68989/datanodes-status-not-consistent.html#answer-69461 Regards Alessandro
... View more
11-22-2016
07:33 PM
2 Kudos
Too add to what @Scott Shaw said, the biggest thing we'd be looking for initially is data skew. So we can take a look at a couple things to help determine this. The first is to take a look at the input size. With input size, we can completely ignore the min, and take a look at the 25, median and 75th percentiles. We see that in your job the are fairly close together, and we also the see the max is never dramatically more than the median. If we saw the max and 75% percentile were very large, we would definitely see data skew. Another indicator of data skew is the task duration. Again ignore the minimum, we're definitely going to inevitably get a small partition due to one reason or another. Focus on the 25th median 75th and max. In a perfect world the seperation between the 4 would be a tiny amount. So seeing 6s, 10s, 11s, 17s, they may seem like significantly different but theyre actually relatively close. The only time we would have a cause for concern would be when the 75% and max are quite a bit greater then 25% and median. When I saw significant, I'm talking about most tasks take ~30s and the max taking 10 mins. That would be a clear indicator of data skew.
... View more
08-01-2016
06:23 PM
I should have read the post a little closer I thought you were doing a groupByKey. You are correct, you need to use groupBy to keep the execution within the dataframe and out of Python. However, you said you are doing an outer join. If it is a left join and the right side is larger than the left, then do an inner join first. Then do your left join on the result. Your result most likely will be broadcasted to do the left join. This is a pattern that Holden described at Strata this year in one of her sessions.
... View more