Archives of Support Questions (Read Only)

prsingh1 · ‎06-28-2017

I am trying to load csv file to pyspark through the below query.

sample = sqlContext.load(source="com.databricks.spark.csv", path = '/tmp/test/20170516.csv', header = True,inferSchema = True)

But I am getting a error saying

py4j.protocol.Py4JJavaError: An error occurred while calling o137.load. 
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org

nyadav · ‎06-28-2017

@prsingh

You need to pass databricks csv dependencies, either you need to download the jar or pass dependencies at run time.

1) download the dependency at run time

pyspark --packages com.databricks:spark-csv_2.10:1.2.0 
df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true')

or

2) pass the jars while starting

a) downloaded the jars as follow:

wget http://search.maven.org/remotecontent?filepath=org/apache/commons/commons-csv/1.1/commons-csv-1.1.ja... -O commons-csv-1.1.jar 
wget http://search.maven.org/remotecontent?filepath=com/databricks/spark-csv_2.10/1.0.0/spark-csv_2.10-1.... -O spark-csv_2.10-1.0.0.jar

b) then start the python spark shell with the arguments:

./bin/pyspark --jars "spark-csv_2.10-1.0.0.jar,commons-csv-1.1.jar"

c) load as dataframe

df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true')

Let me know if above helps!

View solution in original post

nyadav · ‎06-28-2017