Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Pandas UDFs in Pyspark ; ModuleNotFoundError: No module named 'pyarrow'

avatar
New Contributor

I am trying to use pandas udfs in my code. Internally it uses apache arrow for the data conversion. I am getting below issue with the pyarrow module despite of me importing it in my app code explicitly.

 

File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p3996.4056429/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 361, in main

    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)

  File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p3996.4056429/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 236, in read_udfs

    arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type, runner_conf)

  File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p3996.4056429/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 175, in read_single_udf

    return arg_offsets, wrap_scalar_pandas_udf(func, return_type)

  File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p3996.4056429/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 84, in wrap_scalar_pandas_udf

    arrow_return_type = to_arrow_type(return_type)

  File "/opt/cloudera/parcels/CDH-6.3.3-1.cdh6.3.3.p3996.4056429/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1585, in to_arrow_type

    import pyarrow as pa

ModuleNotFoundError: No module named 'pyarrow'

 

 

I also tried to manually enable arrow but still no luck

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

1 REPLY 1

avatar
Master Guru

@AnandG Is pyarrow module is installed? Please try to install with:

#pip install pyarrow 
(pip3 for Python 3, you may need to upgrade the pip utility as well)

 


Cheers!
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.