<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Reading data from HDFS on AWS EC2 cluster in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-data-from-HDFS-on-AWS-EC2-cluster/m-p/161557#M36700</link>
    <description>&lt;P&gt;@&lt;A href="https://community.hortonworks.com/users/11887/fissehaberhane.html"&gt;Fish Berh&lt;/A&gt;&lt;/P&gt;&lt;P&gt;This could have due to a problem with the spark-csv jar.  i have encountered this myself and I found a solution which I cannot find now.&lt;/P&gt;&lt;P&gt;Here are my notes at the time:&lt;/P&gt;&lt;P&gt;1. Create a folder in your local OS or HDFS and place the proper versions for your case of the jars here (replace ? with your version needed):&lt;/P&gt;&lt;OL&gt;
&lt;LI&gt;spark-csv_?.jar&lt;/LI&gt;&lt;LI&gt;commons-csv-?.jar&lt;/LI&gt;&lt;LI&gt;univocity-parsers-?.jar&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;2. Go to your /conf directory where you have installed Spark and in the spark-defaults.conf file add the line:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;spark.driver.extraClassPath D:/Spark/spark_jars/*&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;The asterisk should include all the jars. Now run Python, create SparkContext, SQLContext as you normally would. Now you should be able to use spark-csv as&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;sqlContext.read.format('com.databricks.spark.csv').\
options(header='true', inferschema='true').\
load('foobar.csv')&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 27 Dec 2016 06:34:49 GMT</pubDate>
    <dc:creator>cstanca</dc:creator>
    <dc:date>2016-12-27T06:34:49Z</dc:date>
    <item>
      <title>Reading data from HDFS on AWS EC2 cluster</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-data-from-HDFS-on-AWS-EC2-cluster/m-p/161556#M36699</link>
      <description>&lt;P&gt;I am trying to read data from HDFS on AWS EC2 cluster using Jupiter Notebook. It has 7 nodes. I am using HDP 2.4 and my code is below. The table has millions of rows but the code does not return any rows. "ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com" is the server (ambari-server).&lt;/P&gt;&lt;PRE&gt;from pyspark.sql import SQLContext
sqlContext = HiveContext(sc)
demography = sqlContext.read.load("hdfs://ec2-xx-xxx-xx-xx.compute-1.amazonaws.com:8020/tmp/FAERS/demography_2012q4_2016q1_duplicates_removed. csv", format="com.databricks.spark.csv", header="true", inferSchema="true")
demography.printSchema()
demography.cache()
print demography.count()

&lt;/PRE&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="6281-screen-shot-2016-08-02-at-30425-pm.png" style="width: 475px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/20663i7D4F14D5CAF44601/image-size/medium?v=v2&amp;amp;px=400" role="button" title="6281-screen-shot-2016-08-02-at-30425-pm.png" alt="6281-screen-shot-2016-08-02-at-30425-pm.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 18 Aug 2019 11:52:14 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-data-from-HDFS-on-AWS-EC2-cluster/m-p/161556#M36699</guid>
      <dc:creator>fisseha_berhane</dc:creator>
      <dc:date>2019-08-18T11:52:14Z</dc:date>
    </item>
    <item>
      <title>Re: Reading data from HDFS on AWS EC2 cluster</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-data-from-HDFS-on-AWS-EC2-cluster/m-p/161557#M36700</link>
      <description>&lt;P&gt;@&lt;A href="https://community.hortonworks.com/users/11887/fissehaberhane.html"&gt;Fish Berh&lt;/A&gt;&lt;/P&gt;&lt;P&gt;This could have due to a problem with the spark-csv jar.  i have encountered this myself and I found a solution which I cannot find now.&lt;/P&gt;&lt;P&gt;Here are my notes at the time:&lt;/P&gt;&lt;P&gt;1. Create a folder in your local OS or HDFS and place the proper versions for your case of the jars here (replace ? with your version needed):&lt;/P&gt;&lt;OL&gt;
&lt;LI&gt;spark-csv_?.jar&lt;/LI&gt;&lt;LI&gt;commons-csv-?.jar&lt;/LI&gt;&lt;LI&gt;univocity-parsers-?.jar&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;2. Go to your /conf directory where you have installed Spark and in the spark-defaults.conf file add the line:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;spark.driver.extraClassPath D:/Spark/spark_jars/*&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;The asterisk should include all the jars. Now run Python, create SparkContext, SQLContext as you normally would. Now you should be able to use spark-csv as&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;sqlContext.read.format('com.databricks.spark.csv').\
options(header='true', inferschema='true').\
load('foobar.csv')&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 27 Dec 2016 06:34:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-data-from-HDFS-on-AWS-EC2-cluster/m-p/161557#M36700</guid>
      <dc:creator>cstanca</dc:creator>
      <dc:date>2016-12-27T06:34:49Z</dc:date>
    </item>
  </channel>
</rss>

