I am trying to use pandas udfs in my code. Internally it uses apache arrow for the data conversion. I am getting below issue with the pyarrow module despite of me importing it in my app code explicitly.
File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p3996.4056429/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 361, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p3996.4056429/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 236, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type, runner_conf)
File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p3996.4056429/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in read_single_udf
return arg_offsets, wrap_scalar_pandas_udf(func, return_type)
File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p3996.4056429/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 84, in wrap_scalar_pandas_udf
arrow_return_type = to_arrow_type(return_type)
File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p3996.4056429/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1585, in to_arrow_type
import pyarrow as pa
ModuleNotFoundError: No module named 'pyarrow'
I also tried to manually enable arrow but still no luck
spark.conf.set("spark.sql.execution.arrow.enabled", "true")