02-10-2016 11:58 AM
I am using Cloudera Quickstart VM 22.214.171.124 for online training. For one particular task I need to load spark-csv package so I can read csv files into pyspark for practice. However, I am encounting problems.
First, I ran PYSPARK_DRIVER_PYTHON=ipython pyspark -- packages com.databricks:spark-csv_2.10:1.3.0
it seems working fine, but I got a warning message saying: util.NativeCodeLoader: unable to load native-hadoop library for your platform... using built-java classes where applicable.
Then I tried the spark code to import csv:
yelp_df = sqlCtx.load( source="com.databricks.spark.csv", header = 'true', inferSchema = 'true', path = 'file:///usr/lib/hue/apps/search/examples/collections/solr_co nfigs_yelp_demo/index_data.csv')
but I am getting an error saying "Py4JJavaError: An error ocurred while calling o19.load. : java.lang.RuntimeException: Failed to load class for source: com.databricks.spark.csv
Does anyone know how can I fix this? Thanks a lot!
02-11-2016 06:23 PM
Thanks Consult for suggesting. How can I use sbt assembly in cloudera vm? Should I initilize pyspark and then type it? Just tried and it didn't work. (I've been only using spark through ipython). Thanks.
10-04-2016 03:31 PM
This solution doesn't work for us as we are behind corporate firewall. I got it to work by manually downloading the jar files and specifying the path on command line:
spark-shell --jars /tmp/commons-csv-1.4.jar,/tmp/spark-csv_2.10-1.5.0.jar
But what's the best practice Cloudera recommends in this case? Is there a preferred location for installing third-party Spark packages? In this case, since parsing CSV files is a major use case, would Cloudera consider including the spark-csv package in its Spark 1.6 distro (it's supposedly included in Spark 2.0)?
11-28-2017 03:30 AM
Hi, Im running the pyspark running iin Hue Notebook.
Below is my script:
from pywebhdfs.webhdfs import PyWebHdfsClient
from pyspark.sql import functions as fx
from pyspark.sql import types as tx
from pyspark.sql import HiveContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("default")
#sc = SparkContext(conf=conf)
hc = HiveContext(sc)
sqlc = SQLContext(sc)
#Read csv file and create table
df = hc.read.format('csv').load(path= '/user/sspandan/unzipped/surge_comp.csv', format='com.databricks.spark.csv', header='true', inferSchema='true')
Its running fine when running in Zeppelin.
But when trying to run in Hue Notebook, Im getting the following error.
Can anyone help me how to pass the spark-csv package?