Created on 08-02-2016 07:02 PM - edited 08-18-2019 04:52 AM
I am trying to read data from HDFS on AWS EC2 cluster using Jupiter Notebook. It has 7 nodes. I am using HDP 2.4 and my code is below. The table has millions of rows but the code does not return any rows. "ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com" is the server (ambari-server).
from pyspark.sql import SQLContext sqlContext = HiveContext(sc) demography = sqlContext.read.load("hdfs://ec2-xx-xxx-xx-xx.compute-1.amazonaws.com:8020/tmp/FAERS/demography_2012q4_2016q1_duplicates_removed. csv", format="com.databricks.spark.csv", header="true", inferSchema="true") demography.printSchema() demography.cache() print demography.count()
Created 12-26-2016 10:34 PM
This could have due to a problem with the spark-csv jar. i have encountered this myself and I found a solution which I cannot find now.
Here are my notes at the time:
1. Create a folder in your local OS or HDFS and place the proper versions for your case of the jars here (replace ? with your version needed):
2. Go to your /conf directory where you have installed Spark and in the spark-defaults.conf file add the line:
spark.driver.extraClassPath D:/Spark/spark_jars/*
The asterisk should include all the jars. Now run Python, create SparkContext, SQLContext as you normally would. Now you should be able to use spark-csv as
sqlContext.read.format('com.databricks.spark.csv').\
options(header='true', inferschema='true').\
load('foobar.csv')
Created 12-26-2016 10:34 PM
This could have due to a problem with the spark-csv jar. i have encountered this myself and I found a solution which I cannot find now.
Here are my notes at the time:
1. Create a folder in your local OS or HDFS and place the proper versions for your case of the jars here (replace ? with your version needed):
2. Go to your /conf directory where you have installed Spark and in the spark-defaults.conf file add the line:
spark.driver.extraClassPath D:/Spark/spark_jars/*
The asterisk should include all the jars. Now run Python, create SparkContext, SQLContext as you normally would. Now you should be able to use spark-csv as
sqlContext.read.format('com.databricks.spark.csv').\
options(header='true', inferschema='true').\
load('foobar.csv')