About Jack_sparrow

Jack_sparrow · ‎09-12-2025

I'm having a Hive table emp1 with 100 partitions in Text format. I want Spark to read emp table based on partitions bases and write to EMP2 in parquet format. How to achieve 1) 10 Partition Read from EMP and parallel write to emp2. Making sure multiple executors are running and processing the data in parallel to convert data in short period. I don't want to use HDFS path.

Jack_sparrow · ‎09-11-2025

Can we parallelise it?

Jack_sparrow · ‎09-08-2025

Thank you for the response.

Jack_sparrow · ‎09-08-2025

In a pyspark code which reads a hdfs path in it's respective format(text, orc, parquet) and writes it in parquet format in ozone path. Data is huge. 1) How to do resource calculations for the pyspark job Like no. of cores, no. of executors, memory allocation 2) Is there a way we can dynamically read the data from hdfs by adjusting according to it's file type. 3) What should be the optimal solution and approach for the data movement and what input mappings the approach should use.

Jack_sparrow · ‎09-08-2025

How to run spark df.write inside UDF called in rdd.foreach or rdd.foreachpartition I.e. spark session object inside executor.

Online	Offline
Last Visited	‎09-16-2025 06:51 AM

Member Since	‎09-08-2025 04:58 AM
Last Visited	‎09-16-2025 06:51 AM
Posts	5
Kudos received	1

Cloudera Community

Spark optimum solution

Re: PySpark Queries

Re: How to run spark df.write inside UDF called in...

PySpark Queries

How to run spark df.write inside UDF called in rdd...