Archives of Support Questions (Read Only)

fisseha_berhane · ‎08-02-2016

I am trying to read data from HDFS on AWS EC2 cluster using Jupiter Notebook. It has 7 nodes. I am using HDP 2.4 and my code is below. The table has millions of rows but the code does not return any rows. "ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com" is the server (ambari-server).

from pyspark.sql import SQLContext
sqlContext = HiveContext(sc)
demography = sqlContext.read.load("hdfs://ec2-xx-xxx-xx-xx.compute-1.amazonaws.com:8020/tmp/FAERS/demography_2012q4_2016q1_duplicates_removed. csv", format="com.databricks.spark.csv", header="true", inferSchema="true")
demography.printSchema()
demography.cache()
print demography.count()

cstanca · ‎12-26-2016

@Fish Berh

This could have due to a problem with the spark-csv jar. i have encountered this myself and I found a solution which I cannot find now.

Here are my notes at the time:

1. Create a folder in your local OS or HDFS and place the proper versions for your case of the jars here (replace ? with your version needed):

spark-csv_?.jar
commons-csv-?.jar
univocity-parsers-?.jar

2. Go to your /conf directory where you have installed Spark and in the spark-defaults.conf file add the line:

spark.driver.extraClassPath D:/Spark/spark_jars/*

The asterisk should include all the jars. Now run Python, create SparkContext, SQLContext as you normally would. Now you should be able to use spark-csv as

sqlContext.read.format('com.databricks.spark.csv').\
options(header='true', inferschema='true').\
load('foobar.csv')

View solution in original post

cstanca · ‎12-26-2016