Reply
New Contributor
Posts: 3
Registered: ‎02-10-2016

Unable to load spark-csv package

I am using Cloudera Quickstart VM 5.4.2.0 for online training. For one particular task I need to load spark-csv package so I can read csv files into pyspark for practice. However, I am encounting problems. 

 

First, I ran PYSPARK_DRIVER_PYTHON=ipython pyspark -- packages com.databricks:spark-csv_2.10:1.3.0

 

it seems working fine, but I got a warning message saying: util.NativeCodeLoader: unable to load native-hadoop library for your platform... using built-java classes where applicable. 

 

Then I tried the spark code to import csv:

yelp_df = sqlCtx.load( source="com.databricks.spark.csv", header = 'true', inferSchema = 'true', path = 'file:///usr/lib/hue/apps/search/examples/collections/solr_co nfigs_yelp_demo/index_data.csv')

 

but I am getting an error saying "Py4JJavaError: An error ocurred while calling o19.load. : java.lang.RuntimeException: Failed to load class for source: com.databricks.spark.csv

 

Does anyone know how can I fix this? Thanks a lot!

Contributor
Posts: 33
Registered: ‎01-08-2016

Re: Unable to load spark-csv package

Hello Cweeks,

 

Can you try sbt assembly?

New Contributor
Posts: 3
Registered: ‎02-10-2016

Re: Unable to load spark-csv package

Thanks Consult for suggesting. How can I use sbt assembly in cloudera vm? Should I initilize pyspark and then type it? Just tried and it didn't work. (I've been only using spark through ipython). Thanks. 

Cloudera Employee
Posts: 366
Registered: ‎07-29-2013

Re: Unable to load spark-csv package

The WARNing is irrelevant and ignorable.

This looks like a typo: " -- packages"
New Contributor
Posts: 3
Registered: ‎02-10-2016

Re: Unable to load spark-csv package

Thank you srowen. That's it. I didn't realize a space makes such a big difference. Now everything works fine. Thanks!

Expert Contributor
Posts: 64
Registered: ‎03-04-2015

Re: Unable to load spark-csv package

Hi: 

 

This solution doesn't work for us as we are behind corporate firewall.  I got it to work by manually downloading the jar files and specifying the path on command line:

 

spark-shell --jars /tmp/commons-csv-1.4.jar,/tmp/spark-csv_2.10-1.5.0.jar

 

But what's the best practice Cloudera recommends in this case?  Is there a preferred location for installing third-party Spark packages?  In this case, since parsing CSV files is a major use case, would Cloudera consider including the spark-csv package in its Spark 1.6 distro (it's supposedly included in Spark 2.0)?

 

Thanks,

Miles

 

 

Highlighted
Explorer
Posts: 6
Registered: ‎08-19-2016

Re: Unable to load spark-csv package

Hi, Im running the pyspark running iin Hue Notebook.

Below is my script:

from pywebhdfs.webhdfs import PyWebHdfsClient
from pyspark.sql import functions as fx
from pyspark.sql import types as tx
from pyspark.sql import HiveContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("default")    
#sc = SparkContext(conf=conf)
hc = HiveContext(sc)
sqlc = SQLContext(sc)
hc.setConf("spark.sql.hive.convertMetastoreOrc", "false")
#Read csv file and create table
df = hc.read.format('csv').load(path= '/user/sspandan/unzipped/surge_comp.csv', format='com.databricks.spark.csv', header='true', inferSchema='true')
df.show(1)

 

Its running fine when running in Zeppelin.

 

But when trying to run in Hue Notebook, Im getting the following error.

 

Error:

  • Traceback (most recent call last): Py4JJavaError: An error occurred while calling o71.load. : java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62) at scala.util.Try.orElse(Try.scala:82) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62) ... 14 more

 

Can anyone help me how to pass the spark-csv package?

 

Thanks,

MJ

Announcements