Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Pysprak issue

avatar
Rising Star

I am trying to load csv file to pyspark through the below query.

sample = sqlContext.load(source="com.databricks.spark.csv", path = '/tmp/test/20170516.csv', header = True,inferSchema = True) 

But I am getting a error saying

py4j.protocol.Py4JJavaError: An error occurred while calling o137.load. 
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
1 ACCEPTED SOLUTION

avatar
Expert Contributor

@prsingh

You need to pass databricks csv dependencies, either you need to download the jar or pass dependencies at run time.

1) download the dependency at run time

pyspark --packages com.databricks:spark-csv_2.10:1.2.0 
df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true') 

or

2) pass the jars while starting

a) downloaded the jars as follow:

wget http://search.maven.org/remotecontent?filepath=org/apache/commons/commons-csv/1.1/commons-csv-1.1.ja... -O commons-csv-1.1.jar 
wget http://search.maven.org/remotecontent?filepath=com/databricks/spark-csv_2.10/1.0.0/spark-csv_2.10-1.... -O spark-csv_2.10-1.0.0.jar 

b) then start the python spark shell with the arguments:

./bin/pyspark --jars "spark-csv_2.10-1.0.0.jar,commons-csv-1.1.jar" 

c) load as dataframe

df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true') 

Let me know if above helps!

View solution in original post

1 REPLY 1

avatar
Expert Contributor

@prsingh

You need to pass databricks csv dependencies, either you need to download the jar or pass dependencies at run time.

1) download the dependency at run time

pyspark --packages com.databricks:spark-csv_2.10:1.2.0 
df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true') 

or

2) pass the jars while starting

a) downloaded the jars as follow:

wget http://search.maven.org/remotecontent?filepath=org/apache/commons/commons-csv/1.1/commons-csv-1.1.ja... -O commons-csv-1.1.jar 
wget http://search.maven.org/remotecontent?filepath=com/databricks/spark-csv_2.10/1.0.0/spark-csv_2.10-1.... -O spark-csv_2.10-1.0.0.jar 

b) then start the python spark shell with the arguments:

./bin/pyspark --jars "spark-csv_2.10-1.0.0.jar,commons-csv-1.1.jar" 

c) load as dataframe

df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true') 

Let me know if above helps!