New Contributor
Posts: 3
Registered: ‎02-10-2016

Unable to load spark-csv package

I am using Cloudera Quickstart VM for online training. For one particular task I need to load spark-csv package so I can read csv files into pyspark for practice. However, I am encounting problems. 


First, I ran PYSPARK_DRIVER_PYTHON=ipython pyspark -- packages com.databricks:spark-csv_2.10:1.3.0


it seems working fine, but I got a warning message saying: util.NativeCodeLoader: unable to load native-hadoop library for your platform... using built-java classes where applicable. 


Then I tried the spark code to import csv:

yelp_df = sqlCtx.load( source="com.databricks.spark.csv", header = 'true', inferSchema = 'true', path = 'file:///usr/lib/hue/apps/search/examples/collections/solr_co nfigs_yelp_demo/index_data.csv')


but I am getting an error saying "Py4JJavaError: An error ocurred while calling o19.load. : java.lang.RuntimeException: Failed to load class for source: com.databricks.spark.csv


Does anyone know how can I fix this? Thanks a lot!

Expert Contributor
Posts: 78
Registered: ‎01-08-2016

Re: Unable to load spark-csv package

Hello Cweeks,


Can you try sbt assembly?

New Contributor
Posts: 3
Registered: ‎02-10-2016

Re: Unable to load spark-csv package

Thanks Consult for suggesting. How can I use sbt assembly in cloudera vm? Should I initilize pyspark and then type it? Just tried and it didn't work. (I've been only using spark through ipython). Thanks. 

Cloudera Employee
Posts: 366
Registered: ‎07-29-2013

Re: Unable to load spark-csv package

The WARNing is irrelevant and ignorable.

This looks like a typo: " -- packages"
New Contributor
Posts: 3
Registered: ‎02-10-2016

Re: Unable to load spark-csv package

Thank you srowen. That's it. I didn't realize a space makes such a big difference. Now everything works fine. Thanks!

Expert Contributor
Posts: 71
Registered: ‎03-04-2015

Re: Unable to load spark-csv package



This solution doesn't work for us as we are behind corporate firewall.  I got it to work by manually downloading the jar files and specifying the path on command line:


spark-shell --jars /tmp/commons-csv-1.4.jar,/tmp/spark-csv_2.10-1.5.0.jar


But what's the best practice Cloudera recommends in this case?  Is there a preferred location for installing third-party Spark packages?  In this case, since parsing CSV files is a major use case, would Cloudera consider including the spark-csv package in its Spark 1.6 distro (it's supposedly included in Spark 2.0)?






Posts: 6
Registered: ‎08-19-2016

Re: Unable to load spark-csv package

Hi, Im running the pyspark running iin Hue Notebook.

Below is my script:

from pywebhdfs.webhdfs import PyWebHdfsClient
from pyspark.sql import functions as fx
from pyspark.sql import types as tx
from pyspark.sql import HiveContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("default")    
#sc = SparkContext(conf=conf)
hc = HiveContext(sc)
sqlc = SQLContext(sc)
hc.setConf("spark.sql.hive.convertMetastoreOrc", "false")
#Read csv file and create table
df ='csv').load(path= '/user/sspandan/unzipped/surge_comp.csv', format='com.databricks.spark.csv', header='true', inferSchema='true')


Its running fine when running in Zeppelin.


But when trying to run in Hue Notebook, Im getting the following error.



  • Traceback (most recent call last): Py4JJavaError: An error occurred while calling o71.load. : java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( at sun.reflect.DelegatingMethodAccessorImpl.invoke( at java.lang.reflect.Method.invoke( at py4j.reflection.MethodInvoker.invoke( at py4j.reflection.ReflectionEngine.invoke( at py4j.Gateway.invoke( at py4j.commands.AbstractCommand.invokeMethod( at py4j.commands.CallCommand.execute( at at Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource at at java.lang.ClassLoader.loadClass( at java.lang.ClassLoader.loadClass( at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at scala.util.Try.orElse(Try.scala:82) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62) ... 14 more


Can anyone help me how to pass the spark-csv package?