Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Pysprak issue

avatar
Rising Star

I am trying to load csv file to pyspark through the below query.

sample = sqlContext.load(source="com.databricks.spark.csv", path = '/tmp/test/20170516.csv', header = True,inferSchema = True) 

But I am getting a error saying

py4j.protocol.Py4JJavaError: An error occurred while calling o137.load. 
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
1 ACCEPTED SOLUTION

avatar
Expert Contributor

@prsingh

You need to pass databricks csv dependencies, either you need to download the jar or pass dependencies at run time.

1) download the dependency at run time

pyspark --packages com.databricks:spark-csv_2.10:1.2.0 
df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true') 

or

2) pass the jars while starting

a) downloaded the jars as follow:

wget http://search.maven.org/remotecontent?filepath=org/apache/commons/commons-csv/1.1/commons-csv-1.1.ja... -O commons-csv-1.1.jar 
wget http://search.maven.org/remotecontent?filepath=com/databricks/spark-csv_2.10/1.0.0/spark-csv_2.10-1.... -O spark-csv_2.10-1.0.0.jar 

b) then start the python spark shell with the arguments:

./bin/pyspark --jars "spark-csv_2.10-1.0.0.jar,commons-csv-1.1.jar" 

c) load as dataframe

df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true') 

Let me know if above helps!

View solution in original post

1 REPLY 1

avatar
Expert Contributor

@prsingh

You need to pass databricks csv dependencies, either you need to download the jar or pass dependencies at run time.

1) download the dependency at run time

pyspark --packages com.databricks:spark-csv_2.10:1.2.0 
df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true') 

or

2) pass the jars while starting

a) downloaded the jars as follow:

wget http://search.maven.org/remotecontent?filepath=org/apache/commons/commons-csv/1.1/commons-csv-1.1.ja... -O commons-csv-1.1.jar 
wget http://search.maven.org/remotecontent?filepath=com/databricks/spark-csv_2.10/1.0.0/spark-csv_2.10-1.... -O spark-csv_2.10-1.0.0.jar 

b) then start the python spark shell with the arguments:

./bin/pyspark --jars "spark-csv_2.10-1.0.0.jar,commons-csv-1.1.jar" 

c) load as dataframe

df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true') 

Let me know if above helps!