we are using spark2.3 spark sql, java8 running on Yarn. In spark ui we can observe task straggling, data skew and lot of GC. We are currently using default java serialisation. hopefully there will be some performance improvement when we move to kryo.
in terms of gc plz suggest best practices to tune gc. we are planning for g1gc.
data skew and task straggler we are planning to use salting any salting examples will be helpful and can we use bucketing but we are not using Hive here just file based and spark sql typed dataset will spark bucketing work for file based? any bucketing example would be great plz.
tried parallelising with partitions and spark sql shuffle partitions no luck.
only 160mb of data but its growing as a massive data when loaded and performing aggregations in spark not sure why its exponentially growing? thank u