Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Reading data from HDFS on AWS EC2 cluster

avatar

I am trying to read data from HDFS on AWS EC2 cluster using Jupiter Notebook. It has 7 nodes. I am using HDP 2.4 and my code is below. The table has millions of rows but the code does not return any rows. "ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com" is the server (ambari-server).

from pyspark.sql import SQLContext
sqlContext = HiveContext(sc)
demography = sqlContext.read.load("hdfs://ec2-xx-xxx-xx-xx.compute-1.amazonaws.com:8020/tmp/FAERS/demography_2012q4_2016q1_duplicates_removed. csv", format="com.databricks.spark.csv", header="true", inferSchema="true")
demography.printSchema()
demography.cache()
print demography.count()

6281-screen-shot-2016-08-02-at-30425-pm.png

1 ACCEPTED SOLUTION

avatar
Super Guru

@Fish Berh

This could have due to a problem with the spark-csv jar. i have encountered this myself and I found a solution which I cannot find now.

Here are my notes at the time:

1. Create a folder in your local OS or HDFS and place the proper versions for your case of the jars here (replace ? with your version needed):

  1. spark-csv_?.jar
  2. commons-csv-?.jar
  3. univocity-parsers-?.jar

2. Go to your /conf directory where you have installed Spark and in the spark-defaults.conf file add the line:

spark.driver.extraClassPath D:/Spark/spark_jars/*

The asterisk should include all the jars. Now run Python, create SparkContext, SQLContext as you normally would. Now you should be able to use spark-csv as

sqlContext.read.format('com.databricks.spark.csv').\
options(header='true', inferschema='true').\
load('foobar.csv')

View solution in original post

1 REPLY 1

avatar
Super Guru

@Fish Berh

This could have due to a problem with the spark-csv jar. i have encountered this myself and I found a solution which I cannot find now.

Here are my notes at the time:

1. Create a folder in your local OS or HDFS and place the proper versions for your case of the jars here (replace ? with your version needed):

  1. spark-csv_?.jar
  2. commons-csv-?.jar
  3. univocity-parsers-?.jar

2. Go to your /conf directory where you have installed Spark and in the spark-defaults.conf file add the line:

spark.driver.extraClassPath D:/Spark/spark_jars/*

The asterisk should include all the jars. Now run Python, create SparkContext, SQLContext as you normally would. Now you should be able to use spark-csv as

sqlContext.read.format('com.databricks.spark.csv').\
options(header='true', inferschema='true').\
load('foobar.csv')