Support Questions
Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

pyspark df.count() giving 15000 partitions , so its taking more than 1 hr time while in hive it is finished with in 5 minutes.How can we reduce that tasks and partitions in data frames


Hi @Gundrathi babu

By using coalesce/partition you will be re-distributing the data in the partitions. Whereas if it is stored in hive as different partitions then they are few information available in Hcatalog/Hive metastore which enables it get the count much faster than spark. If you want to find the row count of each partition and assuming the table stats are enable/collected then it will again perform better than spark. Whereas in spark there will not separate metadata handled for spark. That's the reason for performance difference. Hope it helps!!