Created on 06-28-2017 08:40 AM - edited 09-16-2022 04:50 AM
I am trying to load csv file to pyspark through the below query.
sample = sqlContext.load(source="com.databricks.spark.csv", path = '/tmp/test/20170516.csv', header = True,inferSchema = True)
But I am getting a error saying
py4j.protocol.Py4JJavaError: An error occurred while calling o137.load. : java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
Created 06-28-2017 08:45 AM
You need to pass databricks csv dependencies, either you need to download the jar or pass dependencies at run time.
1) download the dependency at run time
pyspark --packages com.databricks:spark-csv_2.10:1.2.0 df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true')
or
2) pass the jars while starting
a) downloaded the jars as follow:
wget http://search.maven.org/remotecontent?filepath=org/apache/commons/commons-csv/1.1/commons-csv-1.1.ja... -O commons-csv-1.1.jar wget http://search.maven.org/remotecontent?filepath=com/databricks/spark-csv_2.10/1.0.0/spark-csv_2.10-1.... -O spark-csv_2.10-1.0.0.jar
b) then start the python spark shell with the arguments:
./bin/pyspark --jars "spark-csv_2.10-1.0.0.jar,commons-csv-1.1.jar"
c) load as dataframe
df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true')
Let me know if above helps!
Created 06-28-2017 08:45 AM
You need to pass databricks csv dependencies, either you need to download the jar or pass dependencies at run time.
1) download the dependency at run time
pyspark --packages com.databricks:spark-csv_2.10:1.2.0 df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true')
or
2) pass the jars while starting
a) downloaded the jars as follow:
wget http://search.maven.org/remotecontent?filepath=org/apache/commons/commons-csv/1.1/commons-csv-1.1.ja... -O commons-csv-1.1.jar wget http://search.maven.org/remotecontent?filepath=com/databricks/spark-csv_2.10/1.0.0/spark-csv_2.10-1.... -O spark-csv_2.10-1.0.0.jar
b) then start the python spark shell with the arguments:
./bin/pyspark --jars "spark-csv_2.10-1.0.0.jar,commons-csv-1.1.jar"
c) load as dataframe
df = sqlContext.read.load('file:///root/file.csv',format='com.databricks.spark.csv',header='true',inferSchema='true')
Let me know if above helps!