Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Pysprak issue

Solved Go to solution
Highlighted

Pysprak issue

Contributor

I am trying to load csv file to pyspark through the below query.

sample = sqlContext.load(source="com.databricks.spark.csv", path = '/tmp/test/20170516.csv', header = True,inferSchema = True) 

But I am getting a error saying

py4j.protocol.Py4JJavaError: An error occurred while calling o137.load. 
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Pysprak issue

Rising Star

@prsingh

You need to pass databricks csv dependencies, either you need to download the jar or pass dependencies at run time.

1) download the dependency at run time

pyspark --packages com.databricks:spark-csv_2.10:1.2.0 
df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true') 

or

2) pass the jars while starting

a) downloaded the jars as follow:

wget http://search.maven.org/remotecontent?filepath=org/apache/commons/commons-csv/1.1/commons-csv-1.1.ja... -O commons-csv-1.1.jar 
wget http://search.maven.org/remotecontent?filepath=com/databricks/spark-csv_2.10/1.0.0/spark-csv_2.10-1.... -O spark-csv_2.10-1.0.0.jar 

b) then start the python spark shell with the arguments:

./bin/pyspark --jars "spark-csv_2.10-1.0.0.jar,commons-csv-1.1.jar" 

c) load as dataframe

df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true') 

Let me know if above helps!

1 REPLY 1

Re: Pysprak issue

Rising Star

@prsingh

You need to pass databricks csv dependencies, either you need to download the jar or pass dependencies at run time.

1) download the dependency at run time

pyspark --packages com.databricks:spark-csv_2.10:1.2.0 
df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true') 

or

2) pass the jars while starting

a) downloaded the jars as follow:

wget http://search.maven.org/remotecontent?filepath=org/apache/commons/commons-csv/1.1/commons-csv-1.1.ja... -O commons-csv-1.1.jar 
wget http://search.maven.org/remotecontent?filepath=com/databricks/spark-csv_2.10/1.0.0/spark-csv_2.10-1.... -O spark-csv_2.10-1.0.0.jar 

b) then start the python spark shell with the arguments:

./bin/pyspark --jars "spark-csv_2.10-1.0.0.jar,commons-csv-1.1.jar" 

c) load as dataframe

df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true') 

Let me know if above helps!